Maximizing GPU Utilization: Heterogeneous Pipelines with Ray and Kubernetes

Tobias Macey

00:11

Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Your host is Tobias Maci, and today I'm interviewing Robert Nishihara about the challenges of maximizing the utility of your available hardware for AI and data intensive applications.

Robert Nishihara

00:26

Robert, can you start by introducing yourself? Thanks for having me on. I'm Robert. I'm one of the co founders of AnyScale, and we are commercializing Ray, which is an open source project distributed system that a lot of companies use to scale and run compute intensive AI workloads ranging from model training to training data preparation,

00:49

inference, reinforcement learning. We can go into a lot more detail on all of those, but this is an open source project that we started as grad students at Berkeley, and then started any scale to commercialize.

Tobias Macey

01:01

And do you remember how you first got started working in the data and AI space?

Robert Nishihara

01:05

Yeah, well, I actually got my start more with AI research, more on the theoretical side of things. This was just around the time that deep learning was taking off. If you remember in 2012, 2013, deep learning was delivering amazing results, groundbreaking results in computer vision, and the whole world was It felt like AI was just exploding at the time, so that's when I got into AI research.

01:35

I had no background or interest in distributed systems at the time, on the infrastructure side of things. My interest was really more on the algorithmic side. Can we come up with better algorithms for learning from data, for training these models? Can we develop better general purpose techniques for learning? So at the time, we were doing research on deep learning training methods, optimization algorithms, reinforcement learning algorithms.

02:00

And the bottleneck that we faced, or one of the bottlenecks, was that in order to AI is very empirical, so if you come up with a new algorithm, in order to validate whether it's good or not, you have to try it out and see how it performs in practice. Often need to really convince yourself you need to not just try it out on a small toy problem, you need to try it out at some meaningful scale with a largish model, a large amount of data, and see if it can really solve the problem.

02:29

In order to run those experiments, you need to scale your algorithm across a bunch of machines, across a bunch of GPUs, or you have to And so we found ourselves spending all of our time I'm talking about my fellow grad students and myself. We were spending all of our time building tools for managing clusters, for handling machine failures, like moving data across machines, getting stuff to run on cheaper spot instances or on GPUs, and we thought, Wow, surely there's

03:00

some reusable tooling that can be built here so that not everyone has to build their own tools and redo all of this all the time. And that led us to start RAID. Basically, we thought that AI was taking off and that the need for scale was only going to grow, right? That the era of doing machine learning research on a single machine was going to be a thing of the past, and so the need for scale and the degree of scale was only going to grow.

03:26

And that was just going to introduce a lot of hard systems and infrastructure challenges, and so there 's a big opening to try to build something useful there. Initially, we were just building it for ourselves, but we thought, Hey, distributed computing is really hard. We'd love to build tools that are useful for a lot of people. That's kind of how we got started. And it ended up being very prescient

Tobias Macey

03:47

and also coincided fairly closely with the introduction and rise of Kubernetes. I'm wondering what are some of the evolutions from when you first started Ray and introduced it as an available project through to where we are now where we have moved beyond

04:06

the niche aspect of these deep learning. Despite the popularity that it grew to, it was still fairly bespoke in terms of who was using it to LLM training and inference, which has gained much broader adoption, and the distributed compute ecosystem has also evolved substantially. Just wondering if you can talk through some of that growth and how Ray has taken advantage of it.

Robert Nishihara

04:30

Yeah, and you mentioned far more people are building models today than before, and that is really going to accelerate with the rise of coding agents, which it's still, even today, it's hard to build models and make use of your data.

04:45

As coding agents like Cursor and Cloud Code and others reduce that barrier and the amount of expertise required to really make use of your data and train models, I think far more people are going to do it. And you're right, the landscape has changed a tremendous amount. Kubernetes was not anything like what it is today, the standard it is today.

05:09

Today, the vast majority of Ray users are using Ray on top of Kubernetes, and Kubernetes has really just emerged as the dominant standard for container orchestration and managing provisioning compute and managing container life cycles across different clouds. And that was not the case when we started Ray.

05:32

We were in a distributed systems and AI lab at Berkeley, and so there was a lot of The people who had created Apache Spark were in that lab, and so we had a lot of experience with other distributed systems, and we launched a bunch of Spark clusters, and the standard way to launch Spark clusters was not on Kubernetes, it was to run this script that another grad student had written to directly talk to EC2 and spin up virtual machines and SSH to them and install all the relevant stuff, and

06:04

that has changed quite a bit. But I would say the change you've seen over those years is the shift from a high degree of fragmentation to consolidation of the infrastructure tech stack. And Kubernetes is one example. There used to be a number of container orchestrators and ways of doing container orchestration. At another layer, think about deep learning frameworks. Of course, you're familiar with PyTorch and TensorFlow. There used to be Theano was a popular one, Torch.

06:38

Actually, folks at Berkeley created Caffe, which was one of the early popular deep learning frameworks. There were a dozen of these. There are a ton of different deep learning frameworks, and now it is primarily PyTorch. So what often happens when you have a new use case or new workload, like the emergence of AI, there's a proliferation of different frameworks and then consolidation,

07:06

because you tend to have a standard emerge over time that gains momentum that many people are using and contributing to, especially with open source. Open source really lends itself to the emergence of a standard. And so the PyTorch layer that we see, the Kubernetes layer, those are really

07:28

some of the most important layers that we see, pieces of software that we see people using together with Ray. And so typically, our users are not just using Ray on its own, they're using Ray plus PyTorch plus Kubernetes, often plus a VLLM or SGLANG to implement their overall workloads.

Tobias Macey

07:48

And I know too that one of the early standout features of Ray was the RayTune library, which was focused on hyperparameter tuning, which was a very frequent topic of conversation and one that I hear a lot less now, particularly because there are so many more of them.

08:08

And I'm just wondering if you can talk to some of the ways that the focus of Ray has shifted from when you first started it and in those early days of deep learning growth and adoption to where we are now, where deep learning is still valuable and there are areas of utility for it, but the majority of the focus is on these transformer based language models or vision models?

Robert Nishihara

08:31

Yeah, hyperparameter tuning itself is a fun area to talk about, and then I'll come back to how Ray relates to all of that. But was always hard to difficult to optimize neural networks, because there are many choices you have to make. Choices around the architecture, choices of the parameters of the optimization algorithm, learning rates, things like dropout and so forth. And if you got the parameters a little bit wrong, it was easy for the optimization

09:04

to not work. And so there were a lot of papers written and a lot of experiments done to figure out the best way to just search over these different parameters. Hyperparameter search is basically train the model a bunch of times with different settings of the parameters and see what works the best. And you can be more clever about that, you can be less clever about that and search randomly. In fact, random search is a very strong baseline.

09:33

And of course that lends itself well to Ray because if you're training the same model a bunch of times, it's a compute intensive workload, you're using a bunch of things in parallel,

09:46

and so you need a good way to express that. And it can get more complex than just run a bunch of experiments in parallel, because you're often looking at the results of some of the experiments, stopping them early, investing more resources in the promising ones, spinning up new experiments based on what worked well or what worked poorly, but it's not that complicated. Now, what has changed in hyperparameter search? First, a few things have changed.

10:14

One is we're starting to train really big models, and you just, for your big run, you can't actually run 100 copies of the training run, because you just don't have the compute resources to do that. So that is no longer viable in a lot of cases. Second, we came up with a lot better initialization techniques for neural networks.

10:39

Lot of the problems that hyperparameter search was solving was just not knowing how to initialize the weights of the neural network or how to set learning rates and things like that, and that has been a lot more best practices have emerged around that, so you can get it right more frequently. And the third thing is where the emphasis is now is not about run a bunch of experiments and take the best result. It's really more, I'm only going to do one big training run.

11:11

So I need to get that one right. I can do a bunch of small scale experiments first. And so how do I do small scale experiments and learn how to set the parameters at that small scale and configure things properly so that when I do my big run, will work? And so there's a progression of run more experiments at a small scale, use that to set a smaller number of experiments at the next scale, and gradually go up in scale, but reduce the number of experiments,

11:44

and use what you learned at the smaller scale to ensure that the larger scale runs go well. So there's a lot of shift in perspective of how to do what it means to do hyperparameter search. Now, just to say hyperparameter search is one use case that people used Ray for. Another very early

12:03

use case for Ray was reinforcement learning. This was actually kind of the motivating use case because the whole world was excited about this previous wave of reinforcement learning with Atari and MuJoCo and AlphaGo, and we wanted to do research on those types of algorithms. And now, of course, we can talk about how the use cases we see people running with Ray have shifted, but

12:29

reinforcement learning, of course, has made a comeback. And now we see a ton of people using Ray for reinforcement learning for post training LLMs. And one of the actually, just the other day, Cursor released their Composer two model, which uses a lot of reinforcement learning to build a great coding model. And they use Ray for all of that for basically the RL infrastructure.

Tobias Macey

12:57

Then in order to be able to build these models, whether you're doing a from scratch training run or you're doing fine tuning of the model, there's also a lot of data preparation and data manipulation involved, which Ray is also very well situated for, particularly given the fact that it's easily integrated into the broader Python ecosystem, which has become the de facto language for a lot of these workloads. And for people who are building these data workflows, data pipelines,

13:28

there are numerous tools to choose from. I'm just wondering if you can give some of the ways that you think about the juxtaposition of Ray versus something like an orchestrator like Airflow or Dagster or a distributed compute system like Spark and some of the selection criteria for when somebody would use Ray for a particular use case versus reaching to some of the other tools that are more of these, I'm going to say, pipeline native workflows?

Robert Nishihara

13:56

Yeah, that's a great question. So data preparation, training data preparation and preprocessing is something that has changed so much since the early days of deep learning. If you remember early on when people were training deep learning models, you would preprocess your data, but the preprocessing you would do was very limited. You might in the case of ImageNet, a computer vision data set, you might try to

14:24

take each data point and generate multiple data points to augment your data set. And you could do that by taking each image and randomly cropping or scaling it to get some more versions of that same image. Going back to that point in time, all of the research was on model architecture. The ImageNet dataset was a fixed dataset and split into your training set and your test set, so it's this fixed benchmark.

14:58

All of the research was, how do I choose the best optimization algorithm and the best neural network architecture so that I can when I train that on my training set and then test it on my test set, I get the best results, get the best score. That has almost entirely flipped. Of course, there's still interesting research happening on model architectures, but people have largely converged on the transformer architecture.

15:24

And optimization algorithms are largely variations of stochastic gradient descent, and despite tons and tons of research in that area. And people have realized that really getting great results comes down to getting the data right.

15:43

And so there was this maybe previous mental block where the dataset was treated as static. It was treated as kind of a given that you don't optimize, and now that's the thing you really optimize over, and you invest money in collecting data, and you spend a lot of your experimentation and compute budget

16:02

on curating the data. And so now we see it's not simple data cleaning where you just strip trailing white space from your sentences and you crop your images so they're all the same size and things like that, and normalize your data. You are really doing a ton of experimentation and actually running a ton of models to filter out low quality data, to augment your data with high quality synthetic data, and annotate your data. For example,

16:32

often using models to generate a lot of structure in your data. For example, if you have a data set of images or videos, you might use a vision language model to generate captions for those videos and images, and then to later use in training. You may compute embeddings. It's very common to run dozens of classifiers in your data preparation process so that you can use those tags, you can filter out subsets of your data to train on specific subsets, and you can filter out low quality data.

17:08

So it's very common to see data curation pipelines that involve dozens of stages of filtering and annotation and are really, really complex. It's just enormously different from how people did prepared training data in the past. So a lot of things have changed about this data preparation stage. Some of the big differences are that it's now model driven and GPU driven instead of CPU driven.

17:38

Basically, are More and more data processing is shifting to GPUs because more and more data processing is being done with inference, and that is a massive shift. So you ask about how does Ray relate to the data world, and when would you use Ray versus Apache Spark, or when would you use Airflow and workflow orchestrators like that? Let me start with the workflow orchestrators, because those are actually quite complementary with Ray. They operate at different levels of granularity.

18:13

Of course, there's always overlap, but I think of Ray as running an individual workload. I'm running my data processing pipeline, and it has this op stage where I download the data and decompress the videos, and then I stream that into some CPU based heuristic filtering, and then I stream that into some model based filtering, which judges aesthetic score and

18:43

filters out explicit content or these kinds of things. And then this streams into some vision language model stage where I'm running using transformers, and then maybe I'm writing out to Parquet or writing to some blob storage.

18:56

And Ray would be used to express this workload, assign different compute resources to each stage of computation, manage the processes sort of that are executing each operation, stream data between them, handle back pressure if one stage is too slow, handle failures if one process dies and needs to be recreated, handle auto scaling of the compute resources to match the throughput of all the different operations and things like that, really solving these core distributed computing challenges.

19:32

Now, that data processing workload might just be one part of some overall broader workload. For example, maybe you want to set up something like you're collecting new data every day. You're a robotics company. You have robots out in the real world that are streaming video back, so you've got new data coming in every day. Every day, you want to run that data processing job to curate your data. Once that finishes, you want to take the curated data, maybe copy that over to a different cloud,

20:03

and then your Neo Cloud, where you run training, and then kick off a training job. That kind of coarse grained workflow orchestration is where you would use something like Airflow or a different one, and that's very compatible with Ray. You might use Ray to run the individual components of each stage. So we see Ray being used together with Airflow, Flight, and all of these different engines. The Spark comparison is interesting. So there are First, the high order thing, of course, is that

20:34

Ray is used for a huge range of workloads, and Spark is specific for big data processing. So Ray is also used for big data processing, but it's also used for reinforcement learning, inference, training, and stuff like that. But you can ask the question specifically about data processing. If I have a bunch of data and I need to transform and manipulate my data, when would I use Ray and when would I use Spark?

20:58

They are actually designed for very different use cases. So to oversimplify, Spark was built at a time when, with the emergence of big data, when people were not really thinking about GPUs. So if I to, if I have a bunch of tabular data, if I have a bunch of data that is nicely structured in tables, and I want to do analytics, I want to join a bunch of tables together and run SQL queries and do analytics like that, Spark is a fantastic choice. Where Spark starts to run into limitations

21:33

is when I have more multimodal data. Perhaps I'm working with images, video, robotic sensor data, all of these types of things. And the way that I manipulate this type of data, the way I get value out of this really unstructured, really multimodal data is not by running SQL queries on it. Instead, it is by running inference on it. So what am I gonna do with just a PDF or a video?

22:02

I'm going to feed it into a model. And that because that model is is able to understand it and and and reason about it. And so we're entering this regime where data processing is becoming

22:16

an inference heavy workload and is running on GPUs. You've still got CPUs, of course. You've got it's a mixture of CPUs and GPUs. And that heterogeneity, this scenario where you have tons of multimodal data and it is being processed both with inference and with regular processing on a mixture of CPUs and GPUs, that is a scenario where Ray is the best system for is really designed for using that for that kind of scenario.

22:45

And so to oversimplify, Spark is amazing for running SQL queries on tabular data, and Ray is amazing for running inference on multimodal data.

Tobias Macey

22:58

You mentioned GPUs being a critical hardware need for a lot of these data processing workloads. You mentioned the focus on inference that Ray has been investing in. And we also briefly touched on some of the distributed systems capabilities around Ray and Kubernetes and some of the overlap there.

23:21

And I think that all of that focuses in on an interesting question as well that a lot of people are dealing with right now is how do I actually get the most out of my hardware because these GPUs are super expensive. I wanna make sure that they're not just sitting idle when I could be putting them to some good use or deallocating them from my cloud environment if you're in a cloud and using some form of auto scaling.

23:43

I also know that, in particular, Kubernetes, their orchestration system is optimized for a different style of use case than what a lot of these high data throughput

23:55

workloads need. And so you might sometimes find yourself fighting with the Kubernetes orchestrator where it's saying, hey. You're done over there, and you're saying, no. I'm really not. And I'm just wondering if you could talk to some of those aspects of how teams are dealing with some of this question of resource optimization at the hardware level to be able to get the most out of their expensive compute.

Robert Nishihara

24:16

Yeah, that's a great question. I'll talk about the Ray Kubernetes relationship, and also about just really getting compute optimization, getting the best utilization. I want to say one more thing on the data topic, which is that I mentioned Ray is amazing for this world of GPU data processing, processing data with inference, and multimodal data. I think it's worth pointing out that all of this multimodal data was previously useless.

24:43

So if you think about And I'm exaggerating a little bit, but think about all the random PDFs and documents sitting in your organization, or video recordings of meetings, or of sales calls, or audio recordings of things. You would store that data and then not do anything with it. Think about how many meetings people have been in where you might record the meeting, but no one's going to go back and actually listen to it. And so that data was There's tremendous

25:14

value and information in all of this data. It's just really hard to manipulate and get insights out of it. So the thing that has changed is that AI is making it possible to programmatically analyze and manipulate all different types of data because you have powerful multimodal models. And so of course, reason Spark and other systems have been primarily working with tabular data is not that that's the only valuable data, but rather it's just the easiest to work with and to ask questions about.

25:49

Tabular data is really a tiny, it's a miniscule fraction of the world's data, And now that we can unlock value in all the rest of the data, we're going to start storing way more of it. We're going to start using it all the time, and this is going to be tremendously valuable for her. So I wanted to share that. Now, on your point about cost of GPUs, yes, GPUs are tremendously expensive. And so getting good utilization of your expensive resource is a hard problem, and it has to be solved.

26:27

And this is why a lot of companies have infrastructure teams that are responsible for managing all of the compute and making that compute available to their AI researchers and practitioners. So think about the challenge. It's also much harder at different scales. There are many different dimensions of complexity. So it's one thing if you have one AI person running

26:54

one training job on one cluster. But if I have five different Kubernetes clusters across different clouds, and I've got a team of dozens of researchers

27:05

that need to share that compute. All of a sudden, I need to solve for a few things. I need my researchers to be productive. I need them to be able to really move quickly, debug easily, run things at scale, search for capacity across different clouds, and take advantage of compute resources wherever they are so that they can be productive. I also need to get great utilization, right? So I need to be able to prioritize different workloads against each other.

27:33

I may have a big training run that needs all the GPUs and needs to run all at once. But when that finishes, I don't want the GPUs to just sit idle. So I might need some background elastic job that can just soak up all the unused compute and just expand elastically to fill the unused compute. And how do I set that up? How do I enable teams to run different workloads with different priorities and appropriately

27:58

share their compute resources among them. The kinds of challenges that we see these infrastructure teams needing to solve are, one, making their developers and their end users, researchers productive. Two is having a standardized interface to all of their compute so that you can easily plug in new sources of compute. That's important for two reasons. One is it lets you shop around for the cheapest GPUs and plug them in. And second,

28:24

it lets researchers and the users run their workloads everywhere, so that factors into the productivity point. So that's a standardized interface to your compute. Third is being able to really do a good job of assigning the right resources to the right workloads. And that will mean workload prioritization. It will mean elasticity of low priority workloads so that they can expand and make sure you always have

28:49

stuff running so that your GPUs are not sitting idle. And then fourth is being able to search for capacity across different regions and different clouds, which if you have bursty jobs, you're going back and reprocessing all of your data to compute some new feature, being able to quickly acquire a lot of capacity across a bunch of different regions is very helpful for that. So those are some of the in order to do a good job with GPU utilization,

29:17

those are all of some of the challenges you have to solve. That is in addition to Those are challenges that sit around what Ray does. Ray is responsible for running your training workload, your RL workload, your data workload, and running that individual workload and making sure it is performance and reliable and fast and cost efficient. And then everything else I just described

29:40

sits outside of the workload and has to be solved at a different layer. So you have to think about many different layers of the stack. Now, Ray and Kubernetes, you asked about the relationship between Ray and Kubernetes. They're highly complementary, and they sit at different layers of the stack. So for the AI infrastructure stack, I like to think about the PyTorch layer, the Ray layer, and the Kubernetes layer. All of this is the software stack that sits on top of your GPUs and cloud providers.

30:07

So each layer on its own is not sufficient. Each layer solves some fraction of the infrastructure problems that you need to solve. What PyTorch is responsible for is squeezing the most performance out of the model on the GPU, just running the model both for training and inference in the most performant way possible. And that is

30:30

not just PyTorch. There's a rich ecosystem around PyTorch. So think about frameworks like VLLM and SGLANG for optimizing inference with transformers, or frameworks like Megatron for training. But fundamentally, the responsibility of that layer is about squeezing the most performance out of the chips by running the model. Ray, we talked about a bunch. The responsibility at that layer is about solving the distributed computing challenges.

30:59

So this means process management, process lifecycle management, process coordination and communication, data movement, data ingest, failure handling, because a lot of the hardware is unreliable, resource allocation to different portions or components of your workload, stuff like that. And Kubernetes is responsible for container orchestration, so provisioning of the compute, managing container life cycles, and things like that. And those are all very complementary.

31:31

And we've also seen them co evolve with each other. So the Ray open source community and the Kubernetes open source community collaborate deeply to really share information and have the right interfaces and be able to optimize in a way that they couldn't without each other. For example, Kubernetes has the ability to resize containers, resize containers to add more memory or things like that. Kubernetes

32:02

on its own doesn't know when it's appropriate to resize the container. On the other hand, Ray is running the workload inside of those containers. And so Ray has knowledge of what the workload is and what its resource requirements are. And so Ray is actually in the perfect position to say, hey, we need more resources over here. But Ray is not managing the containers, and so is not able to execute that without Kubernetes'

32:25

help. And so there are lots of things like this where, by working together and co evolving, the different layers of the infrastructure stack are developing to work better together. And this is also especially true with Ray and VLLM, where the VLM community and the Ray community have worked very deeply together to make it possible to do performant cross node inference. So LLM inference gets far more complex when you have large models that span many machines.

32:58

If you have a single model on a single machine, that's a simpler scenario. But a large model that spans multiple machines is its own little distributed system. A single query to the model may touch many different experts your expert layers, and those experts may be sharded across different machines. And so the query is getting routed around in complex ways. Different parts of the computation may get disaggregated

33:23

into separate pools of compute. There are many things like this where Ray and VLM work together to enable complex forms of cross node parallelism.

Tobias Macey

33:34

The whole idea of sharding the models is also interesting, particularly for people who aren't deep in the weeds of the actual model architectures.

33:42

And I know that the models themselves, they present as being this monolithic thing when in reality, they're just a hierarchy of the actual neural layers. I'm wondering if you can maybe talk a little bit too to some of those challenges of being able to distribute the model effectively across different GPUs for the case where you can't fit it all on a single chip.

Robert Nishihara

34:02

This is an area that's become far more complex recently, because in the past, when you were scaling inference, you would just stick your model in a container, and then you would replicate the container however many times you need it. And it really just wasn't that complicated. If you need to scale more, you replicate the container more. Now you may have many containers and many machines that are running a single replica of the model.

34:31

And that is so questions like elasticity become a lot more complex, failure handling become a lot more complex, because you can't just think of failure handling at the container level. Locality matters. I mentioned when you are running a big model, you may separate So with transformers especially, there's this pre fill stage where you process the input tokens,

34:56

which might be more compute bound, GPU compute bound. Then you've got your decode stage where you're generating the output tokens one at a time, which might be more GPU memory bandwidth bound. And it might make sense to separate out those two stages into different pools of compute and assign different GPUs to each one. But there may be

35:16

certain shards of your prefill workers and certain shards of your decode workers that correspond to each other, and actually you want them to be co located. And so now you need to think about co locating containers in the same nodes.

35:29

And that never happened before when you were thinking about just, I have a single model and a single container and replicating it. This has especially gotten more complex with large mixture of expert models, where the expert stages are often sharded across a bunch of different GPUs and a single query needs to be routed around to different experts. So that is something that is, and there are many different

35:56

ways of sharding and partitioning your model. It is not like, oh, there's just one strategy that always works the best. So that is an area that's grown far more complex.

Tobias Macey

36:08

In order to be able to use and manage these systems effectively, obviously, there's also the question of observability and being able to understand what is being executed, how efficient it's being executed. I'm wondering if you can just talk to some of the leading reasons for wasted or inefficient compute.

36:30

We obviously talked a lot about some of the ways that Ray helps to alleviate that situation, but also just some of the organizational and team education that's required to make sure that they understand how best to effectively apply the capabilities of a framework like Ray to such a complex and multivariate problem space.

Robert Nishihara

36:51

Yeah. One of the biggest reasons, I would say, is not having a good way to share GPUs between training and inference. Fundamentally, and this is at the fleet organizational level,

37:02

not at the individual workload level. If I have training workloads and inference workloads, and I am partitioning my GPUs between them and trying to provision each one or inference for peak capacity, then there are going to be a lot of times when my inference GPUs are idle and not being used because we're not at the peak capacity.

37:26

And so being able to efficiently share GPUs between training and inference in a way that doesn't risk your most important production workloads, but allows excess capacity to be used by lower priority workloads is probably the number one thing. And that is complex to get right. That's the kind of problems that are solved by the MECL platform and the layers that sit around Ray. And I would say that's outside of the workload. Within the workload itself, there can be many different bottlenecks.

38:06

It's important to be able to get good performance within a single workload. It's very important to have the tools to tackle whatever bottleneck is you're running into at the moment. So what do I mean by that? I mentioned with in inference, might separate out the prefill and decode stages into separate pools of compute, because

38:29

they are different shaped workloads. They have different types of bottlenecks, and so it's natural that you might want different compute resources for each one into potentially different accelerators and to right size the compute for each stage. The same thing can be true of a data processing pipeline or a training pipeline. You might have stages that are GPU bound. You might have stages that are IO bound. You might have components that are memory bound.

38:57

And if you don't have a system that can support a large degree of heterogeneity and assigning, breaking down a workload into different pieces, assigning different compute resources to each piece, and rightsizing and scaling those resources appropriately, then you are going to likely have inefficiencies. And this is just to give an example with training. As you scale training on more GPUs,

39:25

it's very easy to become bottlenecked by the data ingest and preprocessing side of things. I might be using some It might be very CPU heavy processing. I might be using some models in my preprocessing right before it gets fed into training. Now, if I can't separate out that preprocessing into a different pool of compute and then scale that pool of compute independently from training, then I'm not going have a way to eliminate the data in just bottleneck.

39:55

And so this is really where Ray shines. It's giving you the control to separate out different pieces of your workload into different pools of compute of any different type, connect them all together, scale them independently, manage them independently, handle failures independently. And so when there is a bottleneck in one component of your workload, you can address that without everything being so tightly coupled. And that is, I would say, the key for eliminating

40:27

bottlenecks within a workload. So that's probably the biggest thing inside the workload. Then the biggest thing outside of the workload is sharing resources between workloads, especially training and inference.

Tobias Macey

40:41

Yeah, the sharing resources is one of the pieces that I think is most complex and probably least effectively executed by most teams because it requires so much of that coordination of understanding what the workloads are, which also brings up the question of, well, what if I have two different Ray clusters running or I have one system?

41:04

Maybe I just have an independently deployed VLLM that's using up a portion of my GPU, and then I have Ray using the other portion of that GPU for data preprocessing, and I'm just curious how you can give some level of visibility to Ray to be aware of the other workloads that are co located within that same pool of hardware.

Robert Nishihara

41:28

Yeah. This is a great example of how you can do a better job by co designing different layers of the stack, because the layer outside of Ray, the overall compute provisioning and orchestration layer, is the layer that's responsible for deciding which resources to allocate to which workloads, which nodes to preempt, things like that.

41:55

But to do a great job of that, you really want to know what's running inside the workload, and that's information that Ray has, information that exists at the distributed compute, like the workload layer. And so if you can combine those pieces of information, then you can do a much better job of preempting non critical portions of your workload that can be more easily scaled down or restarted later. And that is something that the kind of thing that we think about when co designing

42:28

Ray with these other layers of the stack. So you're calling out a very important problem.

Tobias Macey

42:34

As you have been working on and building the AnyScale company around it and just evolving along with this fast moving ecosystem, what are some of the most interesting or innovative or unexpected ways that you've seen Ray being applied and maybe some of the lessons that you've learned from these unexpected uses that have helped guide the future trajectory of the project?

Robert Nishihara

43:00

Yeah, it's been really interesting to see how the broad categories of workloads have largely remained the same. If you think about training, inference, data processing, but each one has evolved a ton. We talked about how inference has grown in complexity with large models and multi node models. We talked a bit about how the data processing side has evolved, to accommodate all this multimodal data and really these heterogeneous inference heavy pipelines.

43:37

On the training side, we've seen the evolution of, or the resurgence of reinforcement learning, which is far more complex than regular training because, well, it has all the complexities of regular training. You're still doing regular training, but you're also doing inference. Are also running To generate new data, you're running simulations or environments that are tightly coordinating with the inference, going back and forth to generate data.

44:06

You are shuffling data around. You're moving the data back to training. You're moving the new model weights over to the inference portion. You can have failures in each of these different stages.

44:17

Training can fail, of course, but as models, we have more powerful agentic models that can reason and take actions over long periods of time, you need to think about failures on the rollout side of things, the data generation side, because if rollout is happening over the course of ten hours or a day or multiple days, you really don't want to lose that progress. And so you have to handle failures on that side of things as well.

44:49

So this introduces a ton of both algorithmic complexities as well as infrastructure complexities. And Ray is, the more challenging AI workloads become, the more important Ray becomes. Because if you're just taking a single process and replicating that a bunch of times and running the identical thing everywhere, then that's fairly simple to do and you don't need Ray.

45:25

When there is heterogeneity, when your workload is broken down into different components that have different responsibilities and need to coordinate with each other, like in reinforcement learning, like in multi node inference, like in these AI data pipelines, then you really need Ray.

Tobias Macey

45:44

And to that point of statefulness and failure recovery, it also brings question of systems such as Temporal that are focused on that.

45:55

They're termed crash proof, which is a little bit of a misnomer, but I'm wondering what you're seeing as some of the ways that people are integrating with some of those more stateful workflow management systems like Temporal to be able to do some of that failure recovery, for instance, if you're in the middle of a training run and you want to be able to make sure that you can resume from the appropriate checkpoint or whatever.

Robert Nishihara

46:16

Yeah, failure recovery is essential, right? Because failures happen all the time. And not only is failure recovery important, but fast failure recovery is important because the larger your cluster, the more frequent the failures. And if you have a failure in extreme cases happening every few hours, and you spend thirty minutes recovering, reloading from a checkpoint, that is a significant fraction of your overall time. So systems that Great recovery from failure is really a non negotiable.

46:53

And one interesting trend we've seen there is that the application needs to have quite a bit of control over how to handle failures, because the appropriate way to handle failures may be quite custom. And we see this more and more with the GB200s and 300s, some of the latest generation of NVIDIA hardware, which previously you would reason about a node of maybe eight GPUs. And now the

47:27

right level to reason about is not a node, but a rack. You might have 72 GPUs in a rack where each rack has a number of nodes and communication within the rack is far more faster than communication between racks. And there are multiple levels of topology here. And because of the performance difference, the application needs to precisely control

47:57

which parts of the application run-in the same rack, which ones run-in different rack. And so if a GPU fails, let's say there's 72 GPUs in the rack, I know these things are going to fail, and they're going to fail frequently. And so I may start my application and run without using all of the GPUs. Maybe I'll use 64 and keep some of them as backups so that I can swap them in when something goes wrong. And now I need to know, okay, GPU dies.

48:27

My application needs to have control over which precise processes to tear down and making sure we can recreate them in the same rack, and it needs to track how many spares it has, and then wire everything up with specific other processes. But then it may need to do something different when I've used up the spares. It may need to recreate, tear down a bunch of stuff and acquire a whole new rack. Or maybe it will decide at that point, Actually, we can't acquire a new rack. We need to

49:01

proceed with training on just a smaller number of workers, a smaller number of racks, and that might mean adapting batch sizes and things like that. And so there is a ton of logic and control that the application needs to have over what to do in the presence of a failure, And that is something that Ray uniquely provides, is really this level of control over process failures and what to do about them.

Tobias Macey

49:31

And in your work of building these technologies and understanding the use cases and applications, what are some of the most interesting or unexpected or challenging lessons that you've learned personally in your work of this very data and AI heavy workloads?

Robert Nishihara

49:49

Yeah. First of all, there are many potential bottlenecks all over the place. And when you're trying to build a flexible tool like Ray that supports a variety of use cases, it's not enough to do one thing well, you have to really have the tools to eliminate any bottleneck that could come up. Now, I would say another lesson, there is a or there can be a tension between building general purpose tools and really optimizing for a specific use case. And this is actually a common

50:29

question people will ask about Ray. They'll say, Ray is so general purpose, it's so flexible. Doesn't that mean it's less optimized for each individual use case? And there is some truth to that, but the way that Ray, the way we approach that with Ray is to really build two layers. There's the lower level primitives, which are highly flexible.

50:54

And really, this is the array as an actor framework. So the ability to The low level primitives is basically the ability to spin up a Python class as an actor and to manage processes. And that is very flexible. You can build anything with functions and classes, and so you can build any kind of distributed system with actors.

51:18

And at that level, that core part of Ray, we focus on keeping it simple, having a very small number of highly flexible primitives, and making sure they are battle tested, they're hardened, their performance. And we are very conservative about adding new functionality at that layer. Then there is the library ecosystem on top of Ray, in the same way that you have Python and then the core primitives, then you've got the library ecosystem around Python.

51:50

And there are libraries built on top of Ray for data processing, for training, for reinforcement learning, for inference. And that's the layer at which you can specialize and really deeply optimize for a specific use case, building on top of the lower level primitives. And so that's the approach we've taken to trying to get the best of both worlds, combining optimization with generality. And I think that's worked quite well.

52:21

We've actually seen a rich ecosystem emerge on top of Ray. So nearly every open source RL framework is built on top of Ray. Companies like NVIDIA are building their data curation frameworks on top of Ray. We see inference frameworks and agent frameworks built on top of Ray. So that is one of the lessons.

Tobias Macey

52:44

And as you continue to build and invest in this technology and ecosystem, what are some of the predictions that you have for some of the next set of challenges or the next evolution of use cases as we maybe move into newer model architectures or invest more in optimizing the existing set of transformer applications?

Robert Nishihara

53:10

Yeah. Complexity is going to continue to grow along every axis. So on the hardware side, there will be more and more levels of topology that we need to reason about, more and more topology aware scheduling. Hardware failure rates will continue to be very prominent. There will be more heterogeneity, just way more types of hardware accelerators and different cloud providers that people will use. So the complexity and heterogeneity will continue to grow on the hardware side.

53:49

The same will be true on the application side. Data scale will grow, model scale will grow, the types and heterogeneity of data will grow. So for example, one thing you don't really have with tabular data, with tabular data, your data roughly all looks the same. With multimodal data, you might have a video that's ten seconds long, and you might have one that's three hours long. And you're going to have to deal with both.

54:18

And really, every different type of data you work with might introduce different systems challenges. Working with PDFs is different from working with textbooks, and there's going to be more heterogeneity there. You're going to see much longer running and much larger scale workloads. Like with reinforcement learning, you're going to see rollouts that are happening at much larger timescales. You're going to see models continue to grow in size.

54:48

So all of these are things that will continue to happen.

Tobias Macey

54:54

Are there any other aspects of the Ray project, the ecosystem, the use cases that it supports that we didn't discuss yet that you'd like to cover before we close out the show?

Robert Nishihara

55:07

I would say that the shift to the world becoming GPU centric really plays to Ray's strengths. A lot of our investment areas are in really natively supporting high performance communication between GPUs, natively supporting RDMA and Ray, different transport back ends. This is growing in importance and is really essential for performance across a variety of use cases.

Tobias Macey

55:45

Well, thank you for taking the time. For anybody who wants to get in touch with you, I'll have you add your preferred contact information to the show notes. And as the final question, I'm interested in getting your perspective on what you see as being the biggest gap in the tooling technology or training that's available for data and AI management today.

Robert Nishihara

56:04

Yeah, it is still not a solved problem.

56:08

Would say one of the biggest gaps is the tooling around multi cloud workloads, and it's very common now for companies to work with one hyperscaler and one or more Neo Clouds, and it's quite complex to find capacity in the right location at the right time to make your data available in a performance and cost efficient way in the right location at the right time to make your workloads portable so they can run-in different locations and elastic, fault tolerance so they can

56:45

take advantage of whatever compute resources are available. So at the same time, it's incredibly valuable when you can get it right and you can have a setup so that your infra team can just continuously shop around for the cheapest compute, the cheapest GPUs, the latest hardware accelerator, plug that in into a uniform interface that your

57:08

researchers and AI teams can take advantage of. So that is a lot of the stuff that we're excited about solving at any scale, and one of the biggest challenges we see.

Tobias Macey

57:18

All right. Well, I appreciate you taking the time today to join me and share all the great work that you and your team are doing on the Ray project and helping to address the myriad complexities of all of the technological utility that we get from these large language models and the whole ecosystem of ML and data processing.

57:39

It's definitely a very complex problem as we discussed, so I appreciate all of the energy that you're putting into it. And I hope you enjoy the rest of your day. Thank you so much. Thank you for listening, and don't forget to check out our other shows. Podcast.net covers the Python language, its community, and the innovative ways it is being used. And the AI engineering podcast is your guide to the fast moving world of building AI systems.

58:08

Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts@dataengineeringpodcast.com with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript