Will inference move to the edge? - podcast episode cover

Will inference move to the edge?

Dec 18, 202548 min
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Summary

Shayle Kann and Dr. Ben Lee examine the potential migration of AI inference workloads from large centralized data centers to localized edge computing or consumer devices. They discuss the current dominance of hyperscale centers due to efficiency, contrast the distinct technical challenges of AI training versus inference, and explore how new latency-sensitive applications like autonomous vehicles could drive a move to the edge. The conversation also covers the trade-offs and future prospects of on-device AI, concluding with a projection for a significant shift towards distributed inference by 2035 and its impact on overall energy consumption.

Episode description

Today virtually all AI compute takes place in centralized data centers, driving the demand for massive power infrastructure.

But as workloads shift from training to inference, and AI applications become more latency-sensitive (autonomous vehicles, anyone?), there‘s another pathway: migrating a portion of inference from centralized computing to the edge. Instead of a gigawatt-scale data center in a remote location, we might see a fleet of smaller data centers clustered around an urban core. Some inference might even shift to our devices. 

So how likely is a shift like this, and what would need to happen for it to substantially reshape AI power?

In this episode, Shayle talks to Dr. Ben Lee, a professor of electrical engineering and computer science at the University of Pennsylvania, as well as a visiting researcher at Google. Shayle and Ben cover topics like:

  • The three main categories of compute: hyperscale, edge, and on-device

  • Why training is unlikely to move from hyperscale

  • The low latency demands of new applications like autonomous vehicles

  • How generative AI is training us to tolerate longer latencies 

  • Why distributed inference doesn‘t face the same technical challenges as distributed training

  • Why consumer devices may limit model capability 

Resources:

Credits: Hosted by Shayle Kann. Produced and edited by Daniel Woldorff. Original music and engineering by Sean Marquand. Stephen Lacey is our executive editor. 

Catalyst is brought to you by EnergyHub. EnergyHub helps utilities build next-generation virtual power plants that unlock reliable flexibility at every level of the grid. See how EnergyHub helps unlock the power of flexibility at scale, and deliver more value through cross-DER dispatch with their leading Edge DERMS platform, by visiting energyhub.com.

Catalyst is brought to you by Bloom Energy. AI data centers can’t wait years for grid power—and with Bloom Energy’s fuel cells, they don’t have to. Bloom Energy delivers affordable, always-on, ultra-reliable onsite power, built for chipmakers, hyperscalers, and data center leaders looking to power their operations at AI speed. Learn more by visiting⁠ ⁠⁠BloomEnergy.com⁠.

Catalyst is supported by Third Way. Third Way’s new PACE study surveyed over 200 clean energy professionals to pinpoint the non-cost barriers delaying clean energy deployment today and offers practical solutions to help get projects over the finish line. Read Third Way's full report, and learn more about their PACE initiative, at www.thirdway.org/pace.

Transcript

Intro / Opening

A very brief word before we start the show, we've got a survey for listeners of Catalyst and Open Circuit, and we would be so grateful if you could take a few moments to fill it out. As our audience continues to expand, it's an opportunity to understand how and why you listen to our shows, and it helps us continue bringing relevant content on the tech and markets you care about in Clean Energy.

If you fill it out, you'll get a chance to win a$100 gift card from Amazon, and you can find it at latitudemedia.com slash survey or just click the survey link in the show notes. Thank you so much. Latitude Media, covering the new frontiers of the energy transition. I'm Shao Kahn, and this is Catalyst. We could be getting 80% of our compute done locally and leaving 20% of the heavy lifting for the data center cloud.

Eight of the eighty percent, I would say most of that will be on the edge. I think maybe uh on the word of one percent ends up being put on your consumer electronics. Coming up. to the age of edge inference blunt the big data center boom. Catalyst is supported by Fishtank PR, an award-winning PR firm focused on climate and energy tech, renewables, and sustainability. Fishtank is known for generating prominent and effective media coverage for the brands they work with.

If you want a PR partner that's thoughtful, shoots straight, and gets results, you'll like Fishtank PR. To learn more about Fishtank's approach, visit fishtankpr.com. That's f I s ch fishtankpr.com. When utilities need flexible capacity they can count on, they turn to energy hubs.

Energy Hub works with more than one hundred and seventy utilities, coordinating over two point five million devices to manage three point four gigawatts of flexibility built for the moments when utilities can't afford uncertainty.

Energy Hub builds and operates virtual power plants that utilities actually stake their grid planning on, coordinating EVs, batteries, thermostats, and more through a single platform built for utility scale, predictive, verifiable, and designed to perform when it counts. Learn more at energyhub.com.

AI Compute and Energy Grid Challenges

I'm Shale Khan. I lead the early stage venture strategy and energy impact partners. Welcome. Okay, so here's an energy question disguised as an AI infrastructure question. What proportion of the world's AI compute in 2035 will be cloud, i.e. in large centralized data centers, versus edge versus edge-edge, i.e., on device? It's an energy question because the answer today is effectively a hundred percent in that first category.

And that's why we have this crazy dynamic in the electricity sector and actually in the natural gas sector too. Where hyperscalers and neo clouds and developers and real estate speculators and crypto miners, turned AI companies, and more are hunting for sites that can accommodate hundreds of megawatts or gigawatts of power. And the whole thing, as we know, is crashing through the electricity sector, affecting generation and transmission distribution, prices, now politics and so on.

But there's a narrative that I've heard a number of times that if born out would potentially present a very different future from the present. This is one where AI workloads, first of all, shift significantly from training to inference. And then where those inference workloads become highly latency sensitive. And are also able to be executed in a more distributed fashion. And as a result, much of that compute and thus the power demand shifts from these big centralized data centers to the edge.

That could mean it shifts to 10 megawatt data centers clustered around an urban core or an autonomous vehicle corridor. Or at the limit, it could mean inference compute happens on device. And centralized data centers fall back into a pure training position. Any version of this that takes significant share of the market would have profound implications for the energy question and for the grid. So is worth exploring, which is what I'm doing today with my guest, Dr. Ben Lee.

Ben is a professor of electrical engineering and computer science at the University of Pennsylvania. He's also a visiting researcher at Google. By the way, this edge AI infrastructure world and the energy implications thereof is super interesting to me, as you will hear. So if you are building something in the space, please come get in touch. In the meantime, here's Ben. And welcome. Great to be here. Thanks so much.

Defining AI Compute Categories

I'm very excited for this conversation because this is the topic that I in my energy circles that I travel in, I've heard uh scuttle butt about a bunch of times, but have never actually spent the time to like really try to understand. The topic basically being How much of inference compute? might move from central cloud cloud infrastructure to the edge. And then how far to the edge, of course, being another question.

But I think we should start by actually defining those categories a little bit. How do you think about the categorization of like where compute can occur? And then we'll talk about each of those categories individually. Right. So even before we talk about generative AI, there w for classical comput cloud computing in general, all the services we loved and and changed the way we live and work today. There are three levels I generally I think about for computing. The first is

Massive hyperscale data centers, the ones run by Microsoft and Google and Amazon, hundreds of thousands of machines, massive facilities. That's what most people think about when they think about cloud computing. At the other extend of the extreme would be uh personal devices, consumer electronics. So you think about your phone, you think about your tablet, uh your your your laptop. Uh plenty of compute can happen there as well.

There is a perhaps less understood uh middle layer or intermediate layer called edge computing. And edge computing really means that there are times where you don't want to go all the way to this remote, massive facility, uh and wait for the data to go out to that data center and then come back.

you might want to access some compute that's a little bit closer to you, maybe in the same city, maybe in the same geographic region, that's edge computing. So they're still going to supply really h capable, high performance machines, these servers. Um but you don't suffer those longer communication times or latencies that you might if you um if you were to go to that remote massive data center.

And my recollection is that there was I think, okay, so the advent of cloud computing meant the build out of lots of big centralized data centers. There was a fair amount of conversation some number of years ago in the kind of first wave of excitement around autonomous vehicles in particular that you might see a fair amount of edge infrastructure get built because of the latency tolerance.

requirement for AVs. I mean, I'm on the outside. So tell me if I've got the kind of narrative wrong here. But then it seems to me that because AVs were generally delayed or maybe the need wasn't as high, like what we've got today, if you just look at the infrastructure today. It s seems like the vast, vast majority of Classical compute even, um, except for stuff that's sitting in like mainframes at companies, is in the cloud, in the big centralized data centers. Do I have that right?

Why Centralized Cloud Dominates Today

That that's right. And this is a decades long trend. I mean we've seen this progression uh this adoption of cloud computing over the last fifty fifteen to twenty years and there are a couple of reasons uh we are seeing that shift or we have seen that shift. Uh the the first is that uh

Uh computing in a massive data center run by the hyperscaler companies, the big big tech companies, is much more energy efficient. They know how to deploy these facilities, they know how to cool them and build H V A C systems. Um efficiently. So they're incurring very few very small overheads per watt of compute. There's this industry standard metric called

uh power usage effectiveness or PUE. And that's the ratio of how m of the power you're using the body compared to the power that's going to compute. So Google's PUE is close to 1.1, which is to say for every watt going to compute, there's an additional 0.1 watt.

going to the overheads of power delivery or cooling or whatever. So that's really incredibly efficient. And most mom and pop data center operators, most enterprise data center operators don't get the scale and efficiency that these hyperscalers do. Um the scale also gives a second key advantage, which is the ability to share hardware. So you buy the hardware once and you have lots of users sharing the same physical hardware.

That allows us again to drive the costs down, a lot a lot allows the hyperscaler operators to drive the costs down. And that essentially gets a a massive increase in efficiency. So most compute now is being done in in these large data centers and in the cloud.

Training Versus Inference Workloads

Talk about the world of AI now, which is where all this growth in compute is happening. Um, you know, AI workloads, of course, divided into two major categories, one being training of models and the other being inference. I think we'll spend most of our time today talking about inference probably, but let's spend one minute on training.

Is there any movement or argument that training should take place anywhere other than large centralized data centers? It seems very clear to me that the trend right now is just build the largest possible data center to train the largest possible model. So is there anyone who thinks that it might that might turn in the other direction?

Some, but th that really hasn't gotten much traction. I I think the reason why we see most training go happening in massive data centers is because of the scale. You need communic large data sets, you need lots of GPUs all. closely coordinated learning the model parameters. The only scenario that some people have explored for training away from the data center is if you've got private data and somehow you want to refine your model or somehow fine tune your model.

with that private data. You don't want to share it uh with the hyperscalers. Uh that has been primarily a research uh question rather than a production system that people have deployed. Okay, so let's assume then that that the vast majority of training compute is still gonna happen in centralized data centers.

As it stands today, I don't know if you know the numbers, but but just high level of all AI workloads, how much is training versus inference? Because I think the other big point people have made is like over time the the proportion of workloads going toward inference is going to increase. And the proportion of workloads going toward training may decrease as we sort of asymptote the next model or something like that. But like today it's mostly training still.

I I I would agree with that. I I I think um to first order, um The training costs are historically what people have cared about the most because the data sets are massive and then you're talking about these massive one thousand megawatt data centers uh for the training workloads. Um, there was a study w we did when I was a a visiting research scientist at Meta where we found that energy costs for AI were br roughly broken into three categories. There's a data pre-processing aspect as well.

And that's about a third. The training is another third. And then the inference or the use of the model is the last third. Um, but clearly those fractions are evolving rapidly. And I would agree with you when you're saying that the training costs are probably flatlining. They they were reading reaching a plateau in how quickly they're growing, perhaps. And

the optimism about AI is to be justified, you're gonna have to see inference costs go way up because that will be an indicator that adoption has gone up in a in a fairly significant way, both among individual users but also among uh companies and and and enterprise use So so I I think it's true to say that inference costs are are large and potentially will grow very rapidly.

Okay. So then we're getting to the sort of crux of our question today, which is inference workloads, inference costs increase over time, usage of the models increases over time. That's the presumption of everything going on in in AI world. And then the question is, will that inference compute predominantly still take place in these big centralized cloud data centers, or will some or much of it potentially shift?

either to one of the other two categories you described, sort of edge uh localized or fully localized on device. So let's let's talk about the edge version first, which is essentially smaller data centers, still data centers, but smaller and more local. What's the argument for why that might happen and what are the limitations?

The Case for Edge Inference: Latency

So so the argument in favor of edge computing is mainly uh the proximity to the end user, right? So when you So we have we have conditioned in an era before generative AI that when we access internet based services like a search engine, we expect the answer to come back on the order of a hundred milliseconds. That that is the order of magnitude that we're we're talking about.

And as a result, to get those hundred millisecond latencies, oftentimes you require computation closer to the user. So you don't have to travel across the internet, you don't have to travel from the west coast out to the east coast and back again. uh the data, I mean, um, and um and get that answer back uh i in a timely way.

What is interesting with generative AI is that we are being reconditioned to tolerate much longer delays. So if you use something like GPT or you use something like Claud or your favorite chatbot. Oftentimes it's just sitting there thinking for seconds and seconds, maybe tens of seconds before it gets you the first token. So so the question there is to what extent we care about that latency and need that really fast responsive access to the answer.

Yeah. And I think we've been especially trained even further in that direction as the with the introduction of things like deep research, where, you know, even in the name, you sort of think, well, of course that has to take time. It is deep research that they're that they're doing. So it's an interesting point that maybe we

We are being becoming reconditioned to allowing more latency. The argument that I've heard for why latency is really gonna matter, apart from just wanting search queries or chat queries to come back quicker, is that is the next wave of applications. for AI, right? And so maybe we go back to the autonomous vehicle world and things like that, where like latency making decisions in near real time

does become really important. Robotics being another category that could be a major user of AI compute, but needs really, really low latency is that part of the argument for shifting some compute to the edge. Yes, absolutely. So the class of compute you mentioned, autonomous vehicles, robotics, fit into what we call cyber physical AI. So cyber physical systems are those that have a cyber component, a computational component, but also interact with the physical world. And once those interactions

with the physical world arise, then we care about responsiveness Uh, because with that underpins safety guarantees and the ability to to make sure that your robotic arm is able to respond quickly enough to hazards, your autonomous vehicles are able to do so. So I I I agree that there will be cases where we will need those really low uh latencies and that is gonna require edge computing much closer to the user so we have uh much shorter uh internet delays, network delays.

Technical Reasons for Centralized Training

I'm curious to understand the trade offs here, right? Like I know with with model training, there there are technical reasons why you want all your compute as uh clustered together as closely as possible you want every GPU as close to every other GPU as you can make them, minimizing the copper between them or the optics or whatever it is that's communicating between them. And that

For some reason that you can explain to me makes model training more effective. Um, is there a similar dynamic in inference? Is there a technical reason why that you are you're paying a penalty if you shift to smaller uh data centers at the edge, or is there no technical reason why that's it's suboptimal?

Right. Yeah. Let's let's talk about the training piece first. Um the reason why we need a thousand megawatt data centers where we have hundreds of thousands of GPUs connected so closely together is because The data sets are massive, uh and the models are massive. We're trying to learn um on the order of a trillion parameters uh for these machine learning models, these AI models. And we're trying to do it on the wealth of data we find in the internet.

There's no way that any single GPU can handle that much data. So what we end up doing is partitioning the data into smaller pieces and then handing each GPU a slice or a partition of this data. And each GPU will churn on its own share on its own partition of the data and learn the models that work best for its piece of the data. And all the other GPUs in the data data center are doing the same thing on their partitions of the data.

periodically, what they will do is they will compare notes. They will share the weights that they've learned and This sharing is really, really expensive. And some of the people in the energy space may know that there are massive energy fluctuations or power fluctuations we will see in data center usage. when the GPUs go from this computational intensive phase where you're learning the model weight.

to this communication intensive phase where they're comparing notes and sharing their intermediate results with each other. So as a result, that's why we're talking about these massive data centers for training. They all need to communicate frequently to share what they've learned from their own data sets. For inference, we don't see that effect.

Just to add, the craziest thing to me about how model training data centers operate right now, the absolute craziest thing is as you said, there are these, there are surprisingly large spikes in power demand. as as a result of how the models are trained, what they do in large part, because those spikes are actually problematic, not just to the grid, but to the equipment inside the data center as well. So what what they do, at least sometimes to manage that, is they create dummy workloads.

So they keep the power profile basically flat, but you are literally just wasting energy on absolutely nothing. Nothing is happening during those times. They're dummy workloads. At that scale, the fact that that is happening is wild to me.

Uh absolutely. And I I think we've seen this in other contexts as well, but not perhaps at this scale. This notion of an electrical engineering we call it the d D I D T problem, the change in current divided by the change in time if If large current swings over very short periods of time.

You could imagine building batteries to sort of damp things out or decouple and certainly a lot of people are thinking about that. But the easiest thing to do might be to just modulate the software, as you say, because we have con very precise control over what the software does. So that that is an active and ongoing area of research that needs to get further developed.

Are you tired of overpaying for big name PR firms but not really knowing what they're delivering? Is your comms team wasting time reviewing lengthy messaging briefs and decks? Instead of engaging journalists or producing content, are you wondering why your competitors are getting pressed and you aren't? Fishtink PR is an award winning climate and energy tech, renewables and sustainability focused PR firm

Dedicated to elevating the work of both early stage and established companies. Whether you need to position yourself as a thought leader in between project announcements or translate complex ideas and technologies into tangible, compelling stories that resonate with the media, Fishtank can help. Check out fishtankpr.com. That's f-i-s-c-h fishtankpr.com.

Virtual power plants are becoming a reliable way for utilities to manage capacity, but enrolling devices is just the start. What really matters is confidence, knowing those resources will perform when dispatched. And being able to prove it from the control room to the living room. Energy Hub's platform handles the full picture, from near real-time forecasting, locational dispatch, and the kind of rigorous verification that holds up when regulators, grid operators, or leadership ask

Did it deliver? Easy enrollment creates momentum. Proven performance builds trust. That's why more than 170 utilities rely on EnergyHub to manage over 2.5 million devices, delivering 3.4 gigawatts of flexible capacity. See what that looks like at energyhub.com.

No Technical Downside for Edge Inference

Okay, so then on to inference. So you're saying inf inference does not contain that same challenge. So is there any What is the downside to shifting inference workloads to the edge? Um to my knowledge, there isn't much of a downside because the reason why inference um is amenable to edge computing is because when you send a prompt to

uh for processing by a large language model, that prompt is probably handled by one GPU or maybe eight GPUs inside a single machine. So And and the reason that is, is because the model sits in that machine, the data sits in that machine, and all of your prior conversations with that bot have are sitting in that machine. And it's a very localized uh piece of compute that needs to be done.

And you don't need tens or hundreds of GPUs to be coordinating to give you an answer back. You've got that one GPU or eight tightly coupled GPUs giving you that answer back. And that is amenable. That is great for for edge computing and we can certainly supply that.

Scaling Edge Data Centers

So a thought experiment that I've given people recently in thinking about this is Let's just say that you need a gigawatt of inference compute in five years from now or seven years from now, something like that. You think you need a gigawatt. Um Wherein the demand for that gigawatt is is geographically centralized somewhere. Let's just say you need a gigawatt of inference you think you're gonna need a gigawatt of inference compute to serve the Dallas metropolitan area, whatever it might be.

At that point, a few years from now, then this is back to the the power perspective, is it going to be easier for you to find and cite a one gigawatt site? or 110 megawatt sites within that geographic region. Um Today, I think it is still probably easier to find the gigawatt site, or at least the past couple of years it has been.

But there are that many gigawatt sites out there from a power availability perspective. So at some point is that going to flip and is it going to be easier to build 110 megawatt sites, which is sounds really hard to do and indeed is, but these are all hard problems.

So if that happens, do you think that we are going to see a significant portion of that inference workload move to that type of scale? Is that the right scale? Like should we be looking at 10 megawatt sites, 100 megawatt sites, one megawatt sites? How far to the edge do we want to go? Yeah, absolutely. And I I agree with the premise of that question one hundred percent. I think that there are two reasons

to go to smaller in many smaller data centers. The first is the one you mentioned, power power provisioning uh and connections to the grid. The second is uh the fact that you don't need Massive GPU coordination for an inference workload. Um, I I guess the catch might be that if you are thinking about

your existing edge data centers, maybe you've got data centers in downtown Los Angeles or something like that already serving workloads. Those workloads may not be configured to handle GPU and AI compute. Uh they may have uh po power delivery infrastructure that was optimized for CPUs. They might have um HVAC systems optimized for the much lower power density of CPUs. So it's not simply a matter of pulling out your CPUs and replacing them with GPUs. You're gonna you may have to retrofit the

the facility itself to support that. Uh but I I agree. I think finding capacity there may eventually become easier than finding the next uh thousand uh megawatts. Is there any limitation? I can imagine I'm trying to think of why you wouldn't do that. Um, you know, you need to sort of house all of the you need to have a fair amount of memory, you need to house all the model weights and so on in every individual.

data center if you're going to do that at the edge, right? So is there there's got gotta be some minimum viable scale, I assume. Right. And maybe to give you a sense of the type of data centers we were talking about in the past, um again in a study that we had done with Meta, we looked at fifteen of their data centers before generative AI and the scale of those facilities were somewhere between fifteen to fifty

megawatts, right? So less than a hundred megawatts. And certainly that is that was fairly incon fairly conventional, uncontroversial to build those sites of data centers in the in the past. Um so that that's the starting point, I think, in terms of the the the scale. Now as you scale down towards for example one megawatt, uh not clear uh at what point things uh start making less sense.

Geographic Distribution of Edge

I guess the other point here, like The way that the data center build out has gone historically, just like the cloud data center build out, it's been fairly clustered in these regions, right? And there's a reason why Northern Virginia is the data center hub of the world. Um and there are others as well, Chicago, Dallas, etc. Uh, Phoenix. Um is And and that that as I understand it is largely because the cloud providers

needed to offer a certain level of reliability to their customers. And so they could have redundancy within a given region. And that was helpful to them in terms of what they were offering. Do you think that this future world wherein a bunch of inference compute moves to the edge, let's call it fifteen to fifty megawatt data centers then instead of

instead of hundreds or or thousands of megawatt data centers. Does it look similar is that you have a bunch of a small number of regions that have like a really high concentration of those 15 to 15 megawatt data centers? Or could it be much more dispersed because the whole point of this is really low latency and local and you don't need them to be as clustered?

I I think there are lots of different aspects at play in in terms of uh data center citing. I I think the redundancy is definitely one of them. And I have trouble disentangling the role that some of these other factors play as well. Some people talk about tax breaks and incentives from local companies and local states.

Uh some people talk about proximity to internet exchange points. So not only are we talking about uh congestion-free power movement, but you're also talking about congestion-free data movement into and out of the data centers. Northern Virginia is has that. Um, and then of course the availability of the power itself. Um I I guess I would say that when you start talking about many of these smaller data centers.

from a redundancy perspective, it might be okay that they're not all geographically clustered ex as long as you have a strategy for rolling over the compute or rolling over the workload to spare capacity somewhere within that region that has a similar performance profile or some sort of similar uh latency or delay characteristic. Um and so that's really that's really the concern whether you have robust geographical redundancy and and resilience there.

Current Barriers to Edge Inference Adoption

Is this happening? Like I it's interesting. I was thinking about okay, so it sounds like you're saying there's not a big downside. It's we already have significant inference workloads. So it's not like we're waiting on workloads to show up that could accommodate this. And yet. If you look at everybody, most everybody building data centers, certainly the hyperscalers and I think the Kolos and folks as well, you know, the focus continues to be on we got to find big sites for big data centers.

W why don't we see more development of this small, smaller scale edge? AI inference world. I I think it really depends on on the workload and the application. And we don't know uh I I would view AI as a more fundamental basic technology and we don't necessarily know what application or capability will be layered on on on top of it.

I I I'd say that we've been talking about edge data centers a lot. There are other words for this uh type of data center. A content distribution network is one of those examples, a CDN. um or a point of presence, uh a BOP that uh these facilities are sometimes called, and they exist in fairly significant numbers. Content distribution networks ensure that when you want to access, for example, New York Times dot com or WSJ dot com.

Your web page is not being served from the other end of the country. Those web pages are sitting close to you because the content distribution network took those updated web pages and moved them to facilities near you, data centers near you.

Uh likewise, uh companies like Meta, um, when they have Instagram or when they have uh these social media applications, they also have these points of presence that supply data from local points of presence rather than retrieving content for your feed from across the country. So we already see that, but these are application level uh performance requirements, whether they be for social media or for other sort of news content.

Once it becomes clear what applications of AI really drive further inference uh deployments, then we'll know what sort of performance requirements are needed, what sort of what we call caching techniques or strategies might be useful so that we can keep fresher data or more recent, more frequently used models closer to these users. So and then serve them uh more quickly. I think I think we'll become clearer as we see which models really get traction, which applications really get traction.

Right. So maybe the state of affairs today is look, the anybody who's developing data centers, we we know we need the big centralized data centers because there is currently essentially endless demand to train models, at least relative to the availability of of uh compute today. And so we know we need to build the big centralized ones. We might as well use those big centralized ones that we know we need right now for inference workloads, such as they are today.

But we don't have enough certainty yet about what the inference workloads are going to be long term to invest that kind of capital and time. expenditure that it would take to build out the network of 110 megawatt data centers in a particular geographic region, something like that. That that's right. And I I would say maybe that I my crystal ball is as clear as anyone else's crystal ball, but I I feel like there's

uh a huge amount of GPU capacity being discussed in i in in the pipeline in these large data centers. And if it turns out that Maybe there are diminishing returns from training larger and larger models, or maybe we run out of data because we've exhausted all the data that's available on the internet. When those things happen, it may be that demand for these GPUs in these largest data centers.

we'll we'll flatten out and we're gonna have spare capacity. At which point, as you say, they they will be used or repurposed to serve and uh inference. And then it will be hard. uh to make the case we're building yet more data centers, smaller ones with GPUs closer to the users. I think the catch there will be if one of these

uh model providers or one of these application developers makes performance a distinguishing feature of their of their offering, right? If they start competing on performance rather than on capability. then we're gonna see well, I may have a thousand GPUs in the middle of Nebraska that are already deployed, but if I really wanna break into the San Francisco market, I've got to build my GPUs right there and and have them available.

On-Device Inference: Pros and Cons

All right. So speaking of performance, let's let's transition to the the full extreme version of this, which is also, I think, theoretically the most disruptive from an energy perspective, which is shifting any significant portion of these inference workloads all the way onto the device.

um skip the either skip the the middle ground of edge five megawatt data centers or fifteen or fifty um or include them but but you know shift workloads that would have gone to a big data center that requires a lot of power. straight onto your iPhone or your iPad or whatever it is. Um and we've heard some glimmers of this as well. Give me this similar sort of like pros and cons of shifting that workload straight onto the device.

Right. Uh pros, primarily two things. One is performance, right? You don't have to go across the internet. The model is right there and the compute is right there. Um, assuming that you get really capable hardware on your device as well, you get really quick responsive answers from your AI. The second is also something we've mentioned earlier, which is uh the notion of privacy. You don't necessarily need to send your data out.

into this hyperscale data center where it gets blended with all all lots of other lots of other user data and you want you d you have maybe fewer guarantees about what happens to it. um localized compute uh i is certainly more private than compute on shared systems. So th those are those are the two uh key advantages. And then I I guess the third would be that it gets more tightly integrated with um the capabilities on a particular platform. So for example, uh Apple's uh ecosystem.

Right. And Apple seems like the obvious candidate to do this clearly. You mentioned privacy. Apple is particularly focused on privacy. They have the hardware, the device, right? Like Apple is notoriously, uh, or at least reputationally behind in the AI race. And so like this It's it's not hard to picture that like if somebody's gonna move a lot of this inference on device, it's gonna be it's gonna be Apple. Um okay, but there is a real trade-off here, I assume.

Yes. And and and the trade off is primarily uh Uh it's primarily with respect to the capabilities of the device. So so if we have a very large model, uh we're gonna have to deploy that model on a much more capable hardware platform than we've got today. Um this means having some number of gigabytes of memory to hold the model weight. And then also some additional gigabytes of memory to hold the context as you develop this conversation with the model.

In addition to the memory, you're also gonna need the compute. Uh you're not gonna have this high performance GPU sitting inside your phone. Uh so you're gonna have to have specialized chips. Those specialized chips on your hand are gonna be less powerful or less capable than the ones in the data center. So All of this speaks to

not getting exactly the same model that you would get into the data center. You would get a shrunk down model. Maybe in the data center you would have a trillion parameters, this massive GPT five model, for example. But on a on on a personal consumer electronics device, you might only have seven billion parameters, so orders of magnitude smaller. Um and that smaller model will be less capable. It will give you less capable answers. Um, it will be capable of doing fewer tasks.

Um, but maybe that's okay because you've identified only a handful of tasks that you really care about on your on your personal device. So that is really the trade-off. As you go towards the device, you're gonna have to shrink the size of the model down. You're also gonna get less and less capability out of your AI.

Uh the the the final thing of course is the power and energy profile. Uh at the at data center scale, we care primarily about power because power influences infrastructure and power delivery and i inf influences thermal and so on. Thermalit management. Um for device level compute, there are two considerations. We care about energy rather than power because that affects battery life.

Right. So even if you could deliver um a really capable GPU chip onto your phone, the question is how long would your phone last if you were using that chip on a fairly consistent basis? Um so the the power the energy aspect will continue to be challenging. And then the thermal aspect will also be challenging if you have a really powerful device. uh that's gonna be a hot brick inside your your pocket. Uh and that's gonna be that's gonna that's gonna be a deal breaker as well.

So when you say deal breaker, do is there progress toward on device inference? I mean, I to your point on performance, that strikes me as like, okay, this is now we're now again in the in the context of like Specific workloads. Certain types of workloads, like a 7 billion parameter model, might be fine, and others it wouldn't be. And so maybe there will be some on-device.

an on device chip and some inference that you could do on device, but you know, you you pull up your chat GPT app or whatever, and of course it's gonna send you back out to the cloud or maybe to the edge. Um But, you know, these other challenges of thermal management and things like that are our hardware challenges. Do you what where are we in the progression of on device inference? Is it coming? Is it not coming? Do we not know?

I I I think the assumption with on device inference is that you'll be able to shrink the model without loss in performance for for for the tasks you care about. That that is the primary s primary strategy the computer scientists have been taking. Uh on the hardware side, uh, we have uh made strides in developing custom chips, custom silicon for sp the specific types of tensor algebra that are required for for machine learning models. So we know how to build those chips.

And that gives us energy efficient compute higher performance. Uh we know how to build uh really capable memory systems or solid state disks. So when your phone now has uh hundreds of gigabytes of memory on it or hundred hundreds of gigabytes of storage on it. So there's a question of Well, maybe you'll end up using less of it for your photos and more of it for your AI model, something like that.

Um, so I think there there are fairly significant resource constraints, but I don't think that they are insurmountable in the sense that more intelligent hardware design and more intelligent hardware management could go some ways in terms of um m making these AI models feasible on on the device.

Future AI Compute Distribution

Okay, so I'm gonna put you on the spot and we promise not to hold you to these numbers, but just to give a sense of like where we think things are heading. If I if we're fast forwarding 10 years, right, let's just say we're in 2035 and imagine there's A total volume of inference compute in the world or whatever that's let's just say it's a hundred megawatts total. What would be your guess of the ranges of how much of that compute is going to take place in large centralized data centers?

Or versus at the edge. Let's let's we'll draw a line. Let's say, you know, hundred megawatts and above is large centralized, sub hundred megawatts, but not on device is edge. And then the third category, of course, being um on device. Like how much of it can go anywhere but decentralized data centers? So I I would I would go straight to this idea of the having a twenty eighty rule, because we see this all the time in computer systems where you you have

20% of your task being extremely popular. Maybe you there are twenty twenty things that you always want to do and that you spend 80% of your AI computer doing those things. That could be email processing, that could be uh photo analysis, that could be So we can identify what those really compelling applications and tasks are, and we're going to be spending most of our time doing that.

And then for the remainder of the long tail, long heavy tail of other tasks that people might want to do, that will there will always be backup capabilities residing in the cloud data center. So I would say that we we could be getting eighty percent of our compute done locally and leaving twenty percent of the heavy lifting, uh or the more esoteric, the more the more

corner case compute uh for the data center cloud. That is a of course excluding the training. The training will continue to all reside in the in the in the massive facilities. But in terms of the inference, I think there's huge potential. Right. But that yeah, that's like actually a very significant shift if eighty percent of the of the inference workload appreciate that that doesn't include training, but still if eighty percent of the inference workload could end up

local. That that's a significant shift and has has pretty profound implications for the energy picture as well. Are you saying that 80% just a Just to pin you down even a little bit more, is that local in the sense of being at the edge, or is that local in the sense of being on device? Or like what do you think the split ends up being there? Yeah. Yeah, so I think

Eight o of the eighty percent, I would say most of that will be on the edge. Um like it I I like I suspect it is today. I I think that um if you look at what we what we talked about earlier, the content delivery networks um points of presence, they've probably identified twenty percent of the content that eighty percent of the people will be t looking at most of the time and they're putting it at the edge. I think maybe uh on the word of one percent.

ends up being put on your consumer electronics. Actually, even for today's compute when we set aside AI, there is a trend towards um consumer electronics hiding that flow of data back and forth between the uh the device and the edge for you. Right. So s sometimes they'll like if you use a a a cloud storage service like Dropbox or if you're using a a photo storage service.

They will let you pretend that you have access to all of your videos or all of your photos and all of your documents, and they will transparently behind the scenes move things back and forth between the data center and your local device. So you may think you have all of it, but maybe you've only got a tiny sliver, less than one percent, on your local device.

Right. Certain things open up in my box instance, certain things open up much faster than others, uh, when I try to open them. And and I've it's occurred to me that that is why. If I step back then, okay, so it sounds like what you're saying in this scenario, your painting of the future, roughly 80% of the of the inference workloads are edge, very little of it actually on device.

And then the other twenty percent or so sitting in cloud, big cloud data centers. So when I think about the energy implications of that. There's I think a couple ways to think about it that that are pretty interesting. One is this, okay, so maybe a fair amount of the energy consumption of at least inference compute is going to shift to these. Five megawatt, fifteen megawatt, fifty megawatt type local sites. That's that has big implications for the grid. Um, in ways that are

I don't know, both good and bad, probably harder to manage in some ways, easier to manage in other ways. But the overall energy consumption. of inference compute, I would expect, and you can tell me if I'm wrong, would actually be higher in this scenario than it would be if it was all centralized, because I assume the PUE that you get for these edge data centers isn't quite as good. as it is for the large centralized data center. So like on balance.

This probably means more overall AI energy consumption. Do you think that's right? Yes. Yes. I I think I I think you get economies of scale, uh, when you when you go to when you go to a gigawatt. or two gigawatts. Uh you have a single facility, you're managing it in a uh in a in a highly optimized, coordinated way, and you've got hundreds of thousands of these machines all managed very precisely. I I think

As you shrink the system down, you will get you will lose an efficiency. You will be trying to build these 20 megawatt data centers and maybe footprints or facilities that weren't designed initially for those workloads. So yes, I I think total energy costs may go up uh as a result.

Types of Inference Workloads and Location

We're talking about inference workloads to some extent as like a monolith. I I'm sure they are not. So are there big distinctions in your mind in terms of the different types of inference workloads and how that influences like where they should be housed?

Right. Yes. So that's a that's a really great question actually. I would say that Um there are fundamental limits to the um number of inference queries a human user can actually uh produce because we're ultimately limited by the speed of our typing. uh produce um to query the models. So there is some of that, um, where humans will continue to send requests to agents. But I think increasingly most of the inference workload will come from

uh other software agents. This could be a search engine retrieving web pages and then asking the large language model to summarize it on for into coherence uh discussion for you. Um this could be um This could be your photo app, uh, learning something about your your images and uh uh or this could be your mail app uh doing something with your with with the mails and helping you compose messages. So all of that is

done behind the scenes. And those inference workloads are potentially much larger because of course software can generate those requests at much, much higher rate. From the perspective of where that computation happens, to the extent that the data center um already has servers running your mail workloads or To the extent that your search engines are already running in the same data center, the communication to the model

um will be a bottleneck, right? So if you have a data center in Nebraska, uh running your search engine for you or doing uh some of these other big heavy lifting uh heavy software jobs, then potentially they could query uh and execute inference uh in these largest hyperscale data centers. All right, Ben, this was super interesting. Really appreciate your time. It was my pleasure. I really enjoyed the conversation. Thanks so much.

Dr. Ben Lee is a professor of electrical engineering and computer science at the University of Pennsylvania. He's also a visiting researcher at Google. This show is a production of Latitude Media. You can head over to latitudemedia.com for links to today's topics. Latitude is supported by Prelude Ventures. This episode was produced by Daniel Waldorf, mixing and theme song by Sean Marquand. Stephen Lacey is our executive editor. I'm Shao Kahn, and this is Cal.

This transcript was generated by Metacast using AI and may contain inaccuracies. Learn more about transcripts.
For the best experience, listen in Metacast app for iOS or Android