NeurIPS 2024: The co-evolution of AI and systems with Lidong Zhou

00:04

to the Microsoft Research Podcast, where Microsoft's leading researchers bring you to the cutting edge. This series of conversations showcases the technical advances being pursued at Microsoft through the insights and experiences of the people driving them. I’m Eliza Strickland, a senior editor at IEEE Spectrum and your guest host for a special

00:21

edition of the podcast. [MUSIC FADES] Joining me today in the Microsoft Booth at the 38th annual Conference on Neural Information Processing Systems, or NeurIPS, is Lidong Zhou. Lidong is a Microsoft corporate vice president, chief scientist of the Microsoft Asia-Pacific Research and Development Group, and managing director of Microsoft Research Asia. Earlier today, Lidong gave a keynote here at NeurIPS on

00:45

the co-evolution of AI and systems engineering. Lidong, welcome to the podcast.

LIDONG ZHOU

00:50

Thank you, Eliza. It's such a pleasure to be here.

STRICKLAND

00:53

You said in your keynote that progress in AI is now outpacing progress in the systems supporting AI. Can you give me some concrete examples of where the current infrastructure is struggling to keep up? ZHOU: Yeah. So actually, we have been working on supporting AI from the infrastructure perspective, and I can say, you know, there are at least three dimensions where it's actually posing a lot of challenges. One dimension is that the scale of the

01:19

AI systems that we have to support. You know, you heard about the scaling law in AI and, you know, demanding even higher scale every so often. And when we scale, as I mentioned in the talk this morning, every time you scale the system, you actually have to rethink how to design a system, develop a new methodology, revisit all the assumptions. And it becomes very challenging for the community to keep up. And the other dimension is if you look at AI systems, it's actually a

01:49

whole-stack kind of design. You have to understand not only the AI workloads, the model architecture, but also the software and also the underlying hardware. And you have to make sure they are all aligned to deliver the best performance. And the third dimension is the temporal dimension, where you really see accelerated growth and the pace of innovation in AI and not actually only in AI but also in the underlying hardware. And that puts a lot of pressure on how fast we innovate

02:17

on the systems side because we really have to keep up in that dimension, as well. So all those three dimensions add up. It's becoming a pretty challenging task for the whole systems community. I like how in your talk you proposed a marriage between systems engineering and AI. What does this look like in practice, and how might it change the way we approach both fields?

ZHOU

02:35

Yeah, so I'm actually a big fan of systems community and AI community work together to tackle some of the most challenging problems. Of course, you know, we have been working on systems that support AI. But now increasingly, we're seeing opportunities where AI can actually help developers to become more productive and develop systems that are better in many dimensions in terms of efficiency, in terms of reliability, in terms of trustworthiness. So I really want

03:04

to see the two communities work together even more closely going forward. You know, I talk about, sort of, the three pillars, right—the efficiency; there's trust; there's also the infusion of the two (AI and systems engineering)—that are three ambitions that we are actually working on. And we see very encouraging early results that makes us believe that there's much more to be achieved going forward with the two communities working together.

STRICKLAND

03:28

You mentioned the challenging of scaling. I think everyone at NeurIPS is talking about scaling. And you've highlighted efficiency as a key opportunity for improvement in AI. What kind of breakthroughs in systems engineering or new ideas in systems engineering could help AI achieve greater efficiencies? ZHOU: Yeah, that's another great question. I think there are a couple of aspects to efficiency. So this morning, I talked about

03:50

some of the innovations in model architecture. So our researchers have been looking into BitNet, which is essentially try to use one bit or, actually, using a ternary representation for the weights in all those AI models rather than using FP16 and so on. And that potentially creates a lot of opportunities for efficiency and energy gains. But that cannot be done without rethinking about the software and even the hardware stack so that, you know, those innovations that you have

04:23

in the model architecture can actually have the end-to-end benefits. And that's, you know, one of the dimensions where we see the co-innovation of AI and underlying system to deliver some efficiency gains for AI models, for example. But there's another dimension, which I think is also very important. With all the AI infrastructure that we build to support AI, there’s

04:49

actually a huge room for improvement, as well. And this is where AI can actually be utilized to solve some of the very challenging systems problems, for optimization, for reliability, for trustworthiness. And I use some of the examples in my talk, but this is a very early stage. I think the potential is much larger going forward. STRICKLAND: Yeah. It's interesting to think about how GPUs and large language models are so intertwined at this point. You can't really have

05:20

one without the other. And you said in your talk you sort of see the need to decouple the architectures and the hardware. Is that right? ZHOU: Yes. Yeah, so this is always, you know, like very system type of thinking where, you know, you really want to decouple some of the elements so that they can evolve and innovate independently. And this gives more opportunities,

05:41

you know, larger design space, for each field. And what we are observing now, which is actually very typical in relatively mature fields, where we have GPUs that are dominating in the hardware land and all the model architecture has to be designed and, you know, proving very efficient on GPUs. And that

06:02

limits the design space for model architecture. And similarly, you know, if you look at hardware, it’s very hard for hardware innovations to happen because now you have to show that those hardwares

06:14

are actually great for all the models that have been actually optimized for GPUs. So I think, you know, from a systems perspective, it's actually possible if you design the right abstraction between the AI and the hardware, it's possible for this two domains to actually evolve separately and have a much larger design space, actually, to find the best solution for both. And when you think about systems engineering, are there ways

06:43

that AI can be used to optimize your own work? ZHOU: Yes, I think there are. Two examples that I gave this morning, one is, you know, in systems there's this what we call a holy grail of system research because we want to build trustworthy systems that people can depend on. And one of the approach is called verified systems. And this has been a very active research area in systems because there are a lot of advancements in formal methods in how we can infuse the formal method

07:14

into building real systems. But it's still very hard for the general system community because, you know, you really have to understand how formal methods works and so on. And so it's still not within reach. You know, like when we build mission-critical systems, we want to be completely verified so, you know, you don't have to do a lot of testing to show that there are no bugs. You’ll never be able to show there's no bugs with testing. But if you …

07:41

Sorry, can I pause you for one moment? Could you define formal verification for our listeners, just in case they don't know? ZHOU: Yeah, that's a good point. I think the easy way to think about this is formal verification, it uses mathematical logic to describe, say, a program and, you know, it can represent some properties in math, essentially, in logic.

08:01

And then you can use a proof to show that the program has certain properties that you desire, and a simple form, like, a very preliminary form of formal (specification for) verification is, you know, just assertions in the program, right, where it, say, asserts A is not equal to zero.

08:20

And that's a very simple form of logic that must hold (or be proven to hold), and then, you know, the proof system is also much more complicated to talk about more advanced properties of programs, their correctness, and so on. STRICKLAND: Mm-hm.

ZHOU

08:33

So I think that the opportunity that we're seeing is that with the help of AI, I think we are on the verge of providing the capability of building verified systems, at least for some of the mission-critical pieces of systems. And that would be a very exciting area for systems and AI to tackle together. And I think we're going to see a paradigm shift in

08:58

systems where some pieces of system components will actually be implemented using AI. [What] is interesting is, you know, system is generally deterministic because, so, you know, when you look at the traditional computer system, you want to know that it's actually acting as you expected, but AI, you know, it can be stochastic, right. And it might not always give you the same answer. But how you combine these two is another area where I see a lot of opportunities for breakthroughs.

STRICKLAND

09:29

Yeah, yeah. I wanted to back up in your career a little bit and talk about the concept of gray failures because you were really instrumental in defining this concept, which for people who don't know, gray failures are subtle and partial failures in cloud-scale systems. They can be very difficult to detect and can lead to major problems. I wanted to see if you're still thinking about gray failures in the context of your thinking about AI and systems.

09:51

Are gray failures having an impact on AI today? ZHOU: Yes, definitely. So when we were looking at cloud systems, we realized the … so in systems, we developed a lot of mechanisms for reliability. And when we look at the cloud systems, when they reach a certain scale, a lot of methodology we develop in systems for reliability actually no longer applies. One of the reasons is we have those gray failures. And then we moved to looking at AI infrastructure.

10:19

The problem is actually even worse because what we realize is there's a lot of built-in redundancy at every level, like in GPUs, memory, or all the communication channels. And because of those built-in redundancies, sometimes the system is experience failures, but they’re being masked because of the redundancies. And that makes it very hard for us to actually maintain the system, debug the system, or to troubleshooting. And for AI infrastructure, what we have developed

10:53

is a very different approach using proactive validation rather than reactive repair. And this is actually a paper that we wrote recently in USENIX ATC that talks about how we approach reliability in AI infrastructure, where the same concept happens to apply in a new meaning. Mm. I like that. Yeah. So tell me a little bit about your vision for where AI goes from here. You talked a little bit in your keynote about

11:21

AI-infused systems. And what would that look like? ZHOU: Yeah, so I think AI is going to transform almost everything, and that includes systems. That's why I'm so happy to be here to learn more from the AI community. But I also believe that for every domain that AI is going to transform, you really need the domain expertise and, sort of, the combination of AI and that particular

11:47

domain. And the same for systems. So when we look at what we call AI-infused systems, we really see the opportunity where there are a lot of hard system challenges can be addressed by AI. But we need to define the right interface between the system and the AI so that we can leverage the advantage of both, right. Like, AI is creative. It comes up with solutions that, you know, people might not think of, but it's also a little bit random sometimes. It could, you know, give you

12:18

wrong answers. But systems are very grounded and very deterministic. So we need to figure out what is the design paradigm that we need to develop so that we can get the best of both worlds. Makes sense. In your talk you gave an example of OptiFlow. Could you tell our listeners a bit about that? ZHOU: Yeah. This is a pretty interesting project that is actually done in Microsoft Research Asia jointly with the Azure team where we look

12:46

at collective communication, which is a major part of AI infrastructure. And it turns out, you know, there's a lot of room for optimization. It was initially done manually. So an expert had to take a look at the system and look at the different configurations and do all kinds of experiments, and, you know, it takes about two weeks to come up with a solution. This is why I say, you know, the productivity is becoming a bottleneck for our AI infrastructure because people are in the

13:16

loop who have to develop solutions. And it turns out that this is a perfect problem for AI, where AI can actually come up with various solutions. It can actually develop good system insights based on the observations from the system. And so OptiFlow, what it does is it comes up with the, sort of, the algorithm or the schedule of communications for different collective communication primitives. And it turns out to be able to discover algorithms that's much better

13:47

than the default one or, you know, for different settings. And it's giving us the benefits of the productivity; also, efficiency. STRICKLAND: And you said that this is in production today, right? ZHOU: Yes. It is in production. That's exciting. So thinking still to the future, how might the co-evolution of AI and systems change the skills needed for future computer scientists?

ZHOU

14:10

Yeah, that's a very deep question. As I mentioned, I think being fluent in AI is very important. But I also believe that domain expertise is probably undervalued in many ways. And I see a lot of needs for this interdisciplinary kind of education where someone who not only understands AI and what AI technology can do but also understands a particular domain very well. And those are the people who will be able to figure out

14:45

the future for that particular domain with the power AI. And I think for students, certainly it's no longer sufficient for you to be an expert in a very narrow domain. I think we see a lot of fields sort of merging together, and so you have to be an expert in multiple domains to see new opportunities for innovations. STRICKLAND: So what advice would you give to a high school student who's just starting out and thinks, ah, I want to get into AI?

15:14

Yeah, I mean certainly there's a lot of excitement over AI, and it would be great for high school students to, actually, to have the firsthand experience. And I think it's their world in the future. Because they probably can imagine a lot of things from scratch. I think they probably have the opportunity to disrupt a lot of the things that we take for granted today. So I think just use their imagination. And I don't think we have really good advice for the young generation.

15:43

It's going to be their creativity and their imagination. And AI is definitely going to empower them to do something that's going to be amazing. STRICKLAND: Something that we probably can't even imagine. ZHOU: Right.

STRICKLAND

15:54

Yeah. ZHOU: I think so. I like that. So as we close, I'm hoping you can look ahead and talk about what excites you most about the potential of AI and systems working together, but also if you have any concerns, what concerns you most? ZHOU: Yeah, I think in terms of AI systems, I'm certainly pretty excited about what we can do together, you know, with a combination of AI and systems. There are a lot of low-hanging fruit, and there are also a lot of potential grand challenges

16:23

that we can actually take on. I mentioned a couple in this morning's talk. And certainly, you know, we also want to look at the risks that could happen, especially when we have systems and AI start to evolve together. And this is also in an area where having some sort of trust foundation is very important so we can have some assurance of the kind of system or AI system that we are going

16:52

to build. And this is actually fundamental in how we think about trust in systems. And I think that concept can be very useful for us to guard against unintended consequences or unintended issues. [MUSIC] Well, Lidong Zhou, thank you so much for joining us on the podcast. I really enjoyed the conversation ZHOU: It's such a pleasure, Eliza.

17:15

And to our listeners, thanks for tuning. If you want to learn more about research at Microsoft, you can check out the Microsoft Research website at Microsoft.com/research. Until next time. [MUSIC FADES]

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript