¶ The Critical Role of Systems in AI
Continued innovation in systems and efficiency and cost are going to be crucial to drive the next generation of AI advances. The last 10 years have been huge for deep learning and AI and primary reason for that has been the significant advance in... both hardware, you know, in terms of emergence of GPUs and so on, as well as software infrastructure to actually parallelize jobs, you know, run large distributed jobs efficiently and so on.
And if you think about the theory of deep learning, people knew about backpropagation, about neural networks, you know, 25 years ago. And we largely use very similar techniques today. But why have they really taken off in the last 10 years? The main catalyst has been...
sort of advancement in systems. And if you look at the trajectory of current deep learning models, the rate at which they are growing larger and larger, systems innovation will continue to be the bottleneck in sort of determining the next generation of advancement in AI. Welcome to the Microsoft Research India podcast, where we explore cutting-edge research that's impacting technology and society. I'm your host, Sridhar Vedantam.
Artificial intelligence, machine learning, deep learning and deep neural networks are today critical to the success of many industries. But they are also extremely compute intensive and expensive to run in terms of both time and cost. and resource constraints can even slow down the pace of innovation. Join us as we speak to Moten Sivathanu, Partner Research Manager at Microsoft Research India.
about the work he and his colleagues are doing to enable optimal utilization of existing infrastructure to significantly reduce the cost of AI. So, Muthayan, welcome to the podcast and thanks for making the time for this. Thanks, Rida. Pleasure to be here. And what I'm really looking forward to, given that we seem to be in some kind of final stages of the pandemic, is to actually...
be able to meet you face-to-face again after a long time. Unfortunately, we've had to again do a remote podcast, which isn't all that much fun. Right, right. Yeah, I'm looking forward to the time when we can actually do this again in office. Yeah. Okay, so let me jump right into this. You know, we keep hearing about things like AI and deep learning and deep neural networks and so on and so forth.
What's very interesting in all of this is that we kind of tend to hear about the end product of all this, which is kind of... you know, what actually impacts businesses, what impacts consumers, what impacts the healthcare industry, for example, right, in terms of AI. It's a little bit of a mystery, I think, to a lot of people as to how all this works, because what goes on behind the scenes to actually make AI work.
is generally not talked about. So before we get into the meat of the podcast, you just want to speak a little bit about what goes on in the background? Sure. So machine learning, Sridhar, as you know, and deep learning in particular, is essentially about learning patterns from data.
¶ Unpacking AI's Intensive Compute Demands
right and deep learning system is fed a lot of training examples you know examples of input and output and then it automatically learns a model that fits that data right and this is typically called the training phase so training phase is where it takes data builds a model out of it. Now, what is interesting is once this model is built, which was really meant to fit the training data, the model is really good at answering queries on data that it had never seen before.
And this is where it becomes useful. These models, you know, are built in various domains. You know, it could be for recognizing an image, for converting speech to text and so on. Right. And, you know, what has... in particular, happened over the last 10 or so years is that there has been significant advancement both on the theory side of machine learning.
which is new algorithms, new model structures that do a better job at fitting the input data to a generalizable model, as well as rapid innovation in systems infrastructure, which... actually enable the model to sort of do its work, which is, you know, very compute intensive in a way that's actually scalable, that's actually feasible, economically, you know, cost effective and so on.
Okay, Muthian, so it sounds like there's a lot of compute actually required to make things like AI and ML happen. Can you give me a sense of what kind of resources or how intensive the resource requirement is? Yeah. So the resource usage in a machine learning model is a direct function of how many parameters it has. So the more complex the data set, the larger the model gets.
correspondingly requires more compute resources. To give you an idea, the early machine learning models, which perform simple tasks like recognizing digits and so on, they could run on a single server machine in... a few hours. But models now, you know, just over the last two years, for example, the size of the largest model that's useful, that state-of-the-art, that achieved state-of-the-art accuracy, has grown by nearly three orders of magnitude.
right and what that means is today to train these models you need thousands and thousands of servers and that's infeasible so accelerators or GPUs have really taken over over the last six, seven years. And GPUs, a single V100 GPU today, a Volta GPU from NVIDIA, can run about 140 trillion operations per second.
And you need several hundreds of them to actually train a model like this. And they run for months together to train a 175 billion model, which is called GPT-3 recently. You need on the order of thousands of such GPUs. And it still takes... a month. A month? That sounds like a humongous amount of time.
Exactly, right? So that's why I think just as I told you how the advance in the theory of machine learning in terms of new algorithms, new models, structures, and so on, have been crucial to the recent advance in the relevance. practical utility of deep learning. Equally important has been this advancement in systems, right? Because given this huge explosion of compute demands that these workloads place, we need fundamental innovation in systems to actually keep pace.
make sure that you can train them in reasonable time, you can actually do that with reasonable cost. Right. Okay. So, you know, for a long time, I was generally under the impression that if you wanted to run bigger and bigger models and bigger jobs, Essentially, you had to throw more hardware at it because at one point hardware was cheap, but I guess that kind of applies only to the CPU kind of scenario, whereas the GPU scenario tends to become really expensive, right?
Okay, so in which case, when there is basically some kind of a limit being imposed because of the cost of GPUs, how does one actually go about tackling this problem of scale?
¶ Tackling AI Scale and Cost Challenges
Yeah. So, you know, the high-level problem ends up being you have... limited resources so let's say you can view this in two perspectives right one is from the perspective of a machine learning developer or a machine learning researcher who wants to build a model to accomplish a particular task right so from The perspective of the user, there are two things you need. A, you want to iterate really fast, right? Because deep learning, incidentally, is this...
special category of machine learning where the exploration is largely by trial and error. So if you want to know which model actually works, which parameters or which hyperparameter set actually gives you the best accuracy, the only way to really know for sure is to train.
the model to completion, measure accuracy, and then you would know which model is better, right? So as you can see, the iteration time, the time to train a model to run inference on it directly impacts the rate of progress you can achieve. The second aspect that the machine learning researcher cares about is cost. You want to do it without spending a lot of dollar cost. Now, from the perspective of, let's say, a cloud provider who...
runs this huge form of GPUs and then offers this as a service for researchers, for users to run machine learning models, their objective function is cost, right? So to support a given workload, you need to support it with... as minimal GPUs as possible. Or in other words, if you have a certain amount of GPU capacity, you want to maximize the utilization, the throughput you can get out of those GPUs. And that's where a lot of the work we've been doing at MSR has focused on.
how do you sort of multiplex lots and lots of jobs onto a finite set of GPUs while maximizing the throughput that you can get from them. Right. So I know you and your team have been working on this problem for a while now. Do you want to share with us some of the key insights and some of the...
results that you've achieved so far. Because it is interesting, right? Schedulers have been around for a while. It's not that there aren't schedulers. But essentially what you're saying is that the... schedulers that exist do not really cut it given the
intensity of the compute requirements as well as the size of the jobs and models that are being run today in terms of deep learning or even machine learning models, right? That's right. So what are your key insights and what are some of the results that you guys have achieved?
¶ Gandiva: Optimizing GPU Scheduling for Deep Learning
So you raise a good point. I mean, schedulers for distributed systems have been around for decades. But what makes deep learning somewhat special is that it turns out, in contrast to traditional schedulers, which... have to view a job as a black box because they are meant to run arbitrary jobs. There is a limit to how efficient they can be. Whereas in deep learning, first of all, because deep learning is...
such high impact area with lots and I mean, from an economic perspective, there are billions of dollars spent in these GPUs and so on. So there is enough economic incentive to extract. the last bit of performance out of these expensive GPUs, right? And that lends itself into this realm of what if we co-design, what if we custom design a scheduler for the specific case of deep learning?
And that's what we did in the Gandiva project, which we published at OSDI in 2018. What we said was, instead of viewing a deep learning job as just another distributed job, which is opaque to us, let's actually exploit some key characteristics that are unique to deep learning jobs, right? And one of those characteristics is that, you know, although, as I said, single deep learning training job can run for days or even months, right?
Deep within, it is actually composed of millions and millions of these what are called mini batches. So what is a mini batch? A mini batch is an iteration in the training where it reads... uh one set of you know input training examples runs it through the model and then back propagates the laws and essentially changes the parameters to fit that input right and
This sequence, this mini-batch repeats over and over again across millions and millions of mini-batches. And what makes it particularly interesting and relevant from a systems optimization viewpoint is that from a resource usage perspective, And from a performance perspective, mini-batches are identical. They may be operating on different data in each mini-batch, but the computation they do is pretty much identical. And what that means is we can look...
at the job for a few mini-batches, and we can know what exactly it's going to do for the rest of its lifetime. And that allows us to, for example, do things like, we can automatically decide which hardware generation.
is the best fit for this job because you can just measure it in a whole bunch of hardware configurations or when you're distributing the job, you can compare it across a whole bunch of parallelism configurations and you can automatically figure out this is the right configuration, right hardware assignment for this particular job. couldn't do in an arbitrary job with a distributed scheduler because
the job could be doing different things at different times, like a MapReduce job, for example. It would keep fluctuating across how we'd use a CPU, network, storage, and so on. Whereas with deep learning, there is this remarkable repeatability and predictability. What it also allows us to do is we can then look within a mini-batch what happens. And it turns out one of the things that happens is if you look at the memory usage, how much GPU memory the training loop itself is consuming.
Somewhere at the middle of a mini-batch, the memory peaks to almost fill the entire GPU memory, right? And then by the time the mini-batch ends, the memory usage drops down by like a factor of anywhere between 10 to 50x.
And so there is this sawtooth pattern in the memory usage. And so one of the things we did in Gandiva was propose this mechanism of transparently migrating a job. So you should be able to... on-demand checkpoint a job, the scheduler should be able to do it and just move it to a different machine. maybe even essentially a different GPU, different machine, and so on, right? And this is very powerful from load balancing. Lots of scheduling things become easy if you do this.
Now, when you're doing that, when you're actually moving a job from one machine to another, it helps if the amount of state you need to move is small, right? And so that's where this awareness of mini batch boundaries and so on helps us because now you can... choose when exactly to move it so that, you know, you move 50x smaller amount of state. Right. Very interesting.
¶ Quiver: High-Throughput Storage for AI
Another part of this whole thing about resources and compute and all that is I think the demands on storage itself, right? Yeah. Because if the models are that big that you need some really high-powered GPUs to compute, how do you manage the storage requirements? Right, right. So it turns out the biggest requirement from storage that deep learning poses is on the throughput that you need from storage, right? So as I mentioned, because GPUs are the most expensive resource in this whole
infrastructure stack, the single most important objective is to keep GPUs busy all the time. You don't want them idling at all. What that means is... the input training data that the model needs in order to run its mini batches, that has to be fed to it at a rate that is sufficient to keep the GPUs busy. And GPUs process... I mean, the amount of data that a GPU can process from a compute perspective has been growing at a very rapid pace, right? And so what that means is...
you know, when between a Volta series and an Ampere series, for example, of GPUs, there is like a 3x improvement in compute speed, right? Now, that means the storage bandwidth should keep up with that pace otherwise a faster gpu doesn't help it will be stalling on io so in that context you know one of the systems we built was a system called quiver where we say
A traditional remote storage system, like the standard model for running this training is the data sets are large. I mean, the data sets can be in terabytes. So you place it on some remote cloud. storage system, like Azure Blob or something like that. And you read it remotely from whichever machine does the training. And that bandwidth simply doesn't cut it because it goes through network backbone switches and so on. And it becomes insanely expensive to sustain that level of bandwidth.
from a traditional cloud storage system, right? So what we need to achieve here is hyperlocality. So ideally, the data should reside on the exact machine that runs the training, then it's a local read, and it has to reside on SSD and so on. So you need several gigabytes per second read bandwidth. And this is to reduce network latency.
Yes, this is to reduce network latency and congestion. When it goes through lots of back-end like T1 switches, T2 switches, etc., the end-to-end throughput that you get across the network is not as much as what you can get locally.
So ideally, you want to keep the data local in the same machine. But as I said, for some of these models, the data set can be in tens of terabytes. So what we really need is a distributed... cache so to speak right but a cache that is locality aware so what we have is a mechanism by which within each locality domain like a rack for example we have a copy of the entire training data
So a rack would comprise of maybe 20 or 30 machines. So across them, you can still fit the training data. And then you do peer-to-peer across machines in the rack for the access to the cache. And within a rack network... Bandwidth is not a limitation. You can get nearly the same performance as you could from local SSD. So that's what we did in Quiver. And, you know, there are a bunch of challenges here because, you know, if every model wants the entire training data...
to be local, to be within the rack, then there is just no cache space for keeping all of that. So we have this mechanism by which we can... transparently share the cache across multiple jobs or even multiple users without compromising security. And we do that by sort of intelligent content addressing of the cache entries so that even though two users may be accessing different copies of the cache,
same data. Internally in the cache, they will refer to the same instance. Right. I was actually just going to ask you that question about how do you maintain security of data, given that you're talking about... distributed caching, right? Because it's very possible that multi-user jobs will be running simultaneously. But that's good. You answered it yourself. So, you know, I've heard you...
¶ Micro Co-design for Infrastructure Optimization
speak a lot about things like micro design and so on. How do you bring those principles to bear in these kind of projects here? Right, right. So I alluded to this a little bit in one of my earlier...
points, which is the interface. I mean, if you look at a traditional scheduler, which views the job as a black box, right? That is an example of a traditional... philosophy to system design where you build each layer independent of the layer above or below it right so that there are good reasons to do it because you know like multiple use cases can
use the same underlying infrastructure. Like if you look at an operating system, it's built to run any process, whether it is office or a browser or whatever, right? But... In workloads like deep learning, which place particularly high demands on compute and that are super expensive and so on, there is a benefit to sort of relaxing this tight layering to some extent.
So that's the philosophy we take in Gandiva, for example, where we say the scheduler no longer needs to think of it as a black box. It can make use of internal knowledge. It can know what mini-batch boundaries are. It can know that mini-batch times are repeatable. and stuff like that, right? So, core design is a philosophy that has been gaining traction over the last several years. And people typically refer to hardware, software core design, for example.
What we do in micro co-design is sort of take a more pragmatic view to co-design where we say, look, it's not always possible to rebuild entire software layers from scratch to make them more tightly coupled. But the reality is, in existing large systems, we have these software stacks, infrastructure stacks.
What can we do without rocking the ship, without essentially throwing away everything and building away everything from a clean slate? So what we do is very surgical, carefully thought through interface changes that allow us... to expose more information from one layer to another. And then, you know, we also introduce some control points which allow one layer to control. For example, the scheduler can have a control point to ask a job to suspend. And it turns out by...
Opening up those carefully thought through interface points, you leave the bulk of the infrastructure unchanged, but yet achieve these efficiencies that result from richer information and richer control. So micro-co-design is something we have been adopting not only in Gandiva and Quiver, but in several other projects in MSR.
And micro stands for, you know, minimally invasive, cheap and retrofitable co-design. So it's a more pragmatic view to co-design in the context of, you know, large cloud infrastructures. Right, where you can do the co-design with the minimum disruption to the existing systems. That's right. Excellent.
¶ Real-World Impact and AI Cost Reduction
We have spoken a lot about the work that you've been doing, and it's quite impressive. Do you have some numbers in terms of, you know, how jobs were run faster? savings of any nature. Do we have any numbers that you can share with us? Yeah, sure. So the numbers, as always, depend on the workload and several aspects. But I can give you some examples.
So in the Gandiva work that we did, we introduced this ability to time-slice jobs, right? So the idea is today when you launch a job in a GPU machine, that job essentially holds on to that machine until it completes. And until that time, it has exclusive possession of that GPU. No other job can use it, right?
And this is not ideal in several scenarios. You know, one classic example is hyperparameter tuning, where you have a model and you need to decide, you know, what exact hyperparameter values like learning rate, etc. actually are the best fit and give the best.
accuracy for this model. So people typically do what is called a hyperparameter search, where you run maybe 100 instances of the model, see how it's doing, maybe kill some instances, spawn off new instances, and so on. And hyperparameter exploration really benefits from You want to run all these instances at the same time so that you have an apples to apples comparison of how they are doing.
And if you want to run like 100 configurations and you have only 10 GPUs, that significantly slows down hyperparamide exploration. It serializes it, right? What Gandiva has is an ability to... perform fine-grained time slicing of the same GPU across multiple jobs. Just like how, you know, an operating system time slices multiple processes, multiple programs on the same CPU, we do the same in GPU context, right? And because we make use of mini-batch boundaries.
so on, we can do this very efficiently. And with that, we showed that, you know, for typical hyperparameter tuning, we can sort of speed up the end-to-end time to accuracy by nearly 5 to 6x. Right. And so this is one example of how time slicing can help. We also saw that from a cluster-wide utilization perspective, some of the techniques that Gandiva adopted can improve overall cluster utilization by 20% to 30%.
And this directly translates to cost incurred to the cloud provider running those GPUs because it means... with the same GPU capacity, I can serve 30% more workload or vice versa, right? For a given workload, I only need 30% lesser number of GPUs. Yeah. I mean, those savings sound huge. And I think you're also, therefore, talking about reducing the cost of AI, making the process of AI itself more efficient.
That's correct. That's correct. So the more we are able to extract performance out of the same infrastructure, the cost per model or the cost per user goes down. And so the cost of AI reduces. And for large companies like Microsoft or Google, which have first-party products that require deep learning, like Search and Office and so on, it reduces the capital expenditure.
running such clusters to support those workloads. Right. And we've also been thinking about areas such as, today there is this limitation that large models need to run in really tightly coupled hyperclusters which are connected via infinite band and so on and that brings up another dimension of cost escalation to the equation because, you know, these are sparse, you know, the networking itself is expensive, there is fragmentation across hyperclusters and so on.
What we showed in some recent work is how can you actually run training of large models in just commodity VMs? These are commodity GPU VMs, but without any requirement on them being part of the same infinite band cluster or hyper cluster, but just they can be scattered anywhere in the data center.
More interestingly, we can actually run these off of spot VMs. So Azure, AWS, all cloud providers provide these bursty VMs or low-priority VMs, which is a way essentially for them to sell spare capacity. So you get them at a significant discount, maybe 5 to 10x cheaper price.
And the disadvantage, I mean, the downside of that is they can go away at any time. They can be preempted when, you know, real demand shows up. So what we showed is, you know, it's possible to train such massive models at the same performance. despite these being on spot VMs and spread over a commodity network without custom InfiniBand and so on. So that's another example how you can bring down the cost of AI by reducing constraints on what hardware you need.
¶ Systems Innovation Drives Future AI Advances
Muthian, we are kind of reaching the end of the podcast. And is there anything that you want to leave the listeners with based on your insights and learning from the work that you've been doing? Yeah, so taking a step back, right? I think... Continued innovation in systems and efficiency and cost are going to be crucial to drive the next generation of AI advances.
And the last 10 years have been huge for deep learning and AI. And primary reason for that has been the significant advance in both hardware, in terms of emergence of GPUs and so on, as well as software infrastructure to actually parallelize jobs, run large distributors. jobs efficiently and so on. And if you think about the theory of deep learning, people knew about backpropagation, about neural networks, you know, 25 years ago.
And we largely use very similar techniques today. But why have they really taken off in the last 10 years? The main catalyst has been sort of advancement in systems. And if you look at the trajectory of current deep learning models, the rate at which, you know, they are growing larger and larger. systems innovation will continue to be the bottleneck in sort of determining the next generation of advancement in AI.
Okay, Muthi, and I know that we are kind of running out of time now, but thank you so much. This has been a fascinating conversation. Thanks, Reeder. It was a pleasure. Thank you.
