NeurIPS 2024: The co-evolution of AI and systems with Lidong Zhou - podcast episode cover

NeurIPS 2024: The co-evolution of AI and systems with Lidong Zhou

Dec 17, 202418 min
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

Just after his NeurIPS 2024 keynote on the co-evolution of systems and AI, Microsoft CVP Lidong Zhou joins the podcast to discuss how rapidly advancing AI impacts the systems supporting it and the opportunities to use AI to enhance systems engineering itself.

Learn more:

Verus: A Practical Foundation for Systems Verification | Publication, November 2024

SuperBench: Improving Cloud AI Infrastructure Reliability with Proactive Validation | Publication, July 2024

BitNet: Scaling 1-bit Transformers for Large Language Models | Publication, October 2023

Transcript

to the Microsoft Research Podcast, where  Microsoft's leading researchers bring you to   the cutting edge. This series of conversations  showcases the technical advances being pursued   at Microsoft through the insights and  experiences of the people driving them.  I’m Eliza Strickland, a senior editor at IEEE  Spectrum and your guest host for a special  

edition of the podcast. [MUSIC FADES]  Joining me today in the Microsoft Booth at the  38th annual Conference on Neural Information   Processing Systems, or NeurIPS, is Lidong Zhou.  Lidong is a Microsoft corporate vice president,   chief scientist of the Microsoft Asia-Pacific  Research and Development Group, and managing   director of Microsoft Research Asia. Earlier  today, Lidong gave a keynote here at NeurIPS on  

the co-evolution of AI and systems engineering. Lidong, welcome to the podcast. 

LIDONG ZHOU

Thank you, Eliza.  It's such a pleasure to be here. 

STRICKLAND

You said in your keynote that  progress in AI is now outpacing progress in   the systems supporting AI. Can you give  me some concrete examples of where the   current infrastructure is struggling to keep up? ZHOU: Yeah. So actually, we have been working on   supporting AI from the infrastructure perspective,  and I can say, you know, there are at least three   dimensions where it's actually posing a lot of  challenges. One dimension is that the scale of the  

AI systems that we have to support. You know, you  heard about the scaling law in AI and, you know,   demanding even higher scale every so often.  And when we scale, as I mentioned in the talk   this morning, every time you scale the system, you  actually have to rethink how to design a system,   develop a new methodology, revisit all the  assumptions. And it becomes very challenging for   the community to keep up. And the other dimension  is if you look at AI systems, it's actually a  

whole-stack kind of design. You have to understand  not only the AI workloads, the model architecture,   but also the software and also the underlying  hardware. And you have to make sure they are all   aligned to deliver the best performance. And  the third dimension is the temporal dimension,   where you really see accelerated growth and the  pace of innovation in AI and not actually only in   AI but also in the underlying hardware. And that  puts a lot of pressure on how fast we innovate  

on the systems side because we really have to  keep up in that dimension, as well. So all those   three dimensions add up. It's becoming a pretty  challenging task for the whole systems community.  I like how in your talk you proposed  a marriage between systems engineering and AI.   What does this look like in practice, and how  might it change the way we approach both fields? 

ZHOU

Yeah, so I'm actually a big fan of systems  community and AI community work together to   tackle some of the most challenging problems. Of  course, you know, we have been working on systems   that support AI. But now increasingly, we're  seeing opportunities where AI can actually help   developers to become more productive and develop  systems that are better in many dimensions in   terms of efficiency, in terms of reliability,  in terms of trustworthiness. So I really want  

to see the two communities work together  even more closely going forward. You know,   I talk about, sort of, the three pillars,  right—the efficiency; there's trust;   there's also the infusion of the two (AI and  systems engineering)—that are three ambitions   that we are actually working on. And we see very  encouraging early results that makes us believe   that there's much more to be achieved going  forward with the two communities working together. 

STRICKLAND

You mentioned the challenging  of scaling. I think everyone at NeurIPS is   talking about scaling. And you've highlighted  efficiency as a key opportunity for improvement   in AI. What kind of breakthroughs in systems  engineering or new ideas in systems engineering   could help AI achieve greater efficiencies? ZHOU: Yeah, that's another great question.   I think there are a couple of aspects to  efficiency. So this morning, I talked about  

some of the innovations in model architecture.  So our researchers have been looking into BitNet,   which is essentially try to use one bit or,  actually, using a ternary representation for   the weights in all those AI models rather than  using FP16 and so on. And that potentially creates   a lot of opportunities for efficiency and energy  gains. But that cannot be done without rethinking   about the software and even the hardware stack so  that, you know, those innovations that you have  

in the model architecture can actually have the  end-to-end benefits. And that's, you know, one of   the dimensions where we see the co-innovation  of AI and underlying system to deliver some   efficiency gains for AI models, for example. But  there's another dimension, which I think is also   very important. With all the AI infrastructure  that we build to support AI, there’s  

actually a huge room for improvement, as well.  And this is where AI can actually be utilized   to solve some of the very challenging systems  problems, for optimization, for reliability, for   trustworthiness. And I use some of the examples in  my talk, but this is a very early stage. I think   the potential is much larger going forward. STRICKLAND: Yeah. It's interesting to think   about how GPUs and large language models are so  intertwined at this point. You can't really have  

one without the other. And you said in your  talk you sort of see the need to decouple the   architectures and the hardware. Is that right? ZHOU: Yes. Yeah, so this is always, you know,   like very system type of thinking where, you  know, you really want to decouple some of the   elements so that they can evolve and innovate  independently. And this gives more opportunities,  

you know, larger design space, for each field. And  what we are observing now, which is actually very   typical in relatively mature fields, where we have  GPUs that are dominating in the hardware land and   all the model architecture has to be designed and,  you know, proving very efficient on GPUs. And that  

limits the design space for model architecture.  And similarly, you know, if you look at hardware,   it’s very hard for hardware innovations to happen  because now you have to show that those hardwares  

are actually great for all the models that have  been actually optimized for GPUs. So I think,   you know, from a systems perspective, it's  actually possible if you design the right   abstraction between the AI and the hardware, it's  possible for this two domains to actually evolve   separately and have a much larger design space,  actually, to find the best solution for both.  And when you think about  systems engineering, are there ways  

that AI can be used to optimize your own work? ZHOU: Yes, I think there are. Two examples that   I gave this morning, one is, you know, in systems  there's this what we call a holy grail of system   research because we want to build trustworthy  systems that people can depend on. And one of   the approach is called verified systems. And this  has been a very active research area in systems   because there are a lot of advancements in formal  methods in how we can infuse the formal method  

into building real systems. But it's still very  hard for the general system community because,   you know, you really have to understand how  formal methods works and so on. And so it's   still not within reach. You know, like  when we build mission-critical systems,   we want to be completely verified so, you know,  you don't have to do a lot of testing to show   that there are no bugs. You’ll never be able to  show there's no bugs with testing. But if you … 

Sorry, can I pause you for one  moment? Could you define formal verification   for our listeners, just in case they don't know? ZHOU: Yeah, that's a good point. I think the easy   way to think about this is formal verification,  it uses mathematical logic to describe, say,   a program and, you know, it can represent some  properties in math, essentially, in logic.  

And then you can use a proof to show that the  program has certain properties that you desire,   and a simple form, like, a very preliminary form  of formal (specification for) verification is,   you know, just assertions in the program, right,  where it, say, asserts A is not equal to zero.  

And that's a very simple form of logic that must  hold (or be proven to hold), and then, you know,   the proof system is also much more complicated to  talk about more advanced properties of programs,   their correctness, and so on. STRICKLAND: Mm-hm. 

ZHOU

So I think that the opportunity that  we're seeing is that with the help of AI,   I think we are on the verge of providing  the capability of building verified systems,   at least for some of the mission-critical pieces  of systems. And that would be a very exciting area   for systems and AI to tackle together. And I  think we're going to see a paradigm shift in  

systems where some pieces of system components  will actually be implemented using AI. [What]   is interesting is, you know, system is generally  deterministic because, so, you know, when you look   at the traditional computer system, you want to  know that it's actually acting as you expected,   but AI, you know, it can be stochastic, right. And  it might not always give you the same answer. But   how you combine these two is another area where  I see a lot of opportunities for breakthroughs. 

STRICKLAND

Yeah, yeah. I wanted to back up  in your career a little bit and talk about   the concept of gray failures because you were  really instrumental in defining this concept,   which for people who don't know, gray failures  are subtle and partial failures in cloud-scale   systems. They can be very difficult to detect and  can lead to major problems. I wanted to see if   you're still thinking about gray failures in the  context of your thinking about AI and systems.  

Are gray failures having an impact on AI today? ZHOU: Yes, definitely. So when we were looking   at cloud systems, we realized the … so in  systems, we developed a lot of mechanisms for   reliability. And when we look at the cloud  systems, when they reach a certain scale,   a lot of methodology we develop in systems for  reliability actually no longer applies. One of   the reasons is we have those gray failures. And  then we moved to looking at AI infrastructure.  

The problem is actually even worse because what we  realize is there's a lot of built-in redundancy at   every level, like in GPUs, memory, or all the  communication channels. And because of those   built-in redundancies, sometimes the system is  experience failures, but they’re being masked   because of the redundancies. And that makes it  very hard for us to actually maintain the system,   debug the system, or to troubleshooting. And  for AI infrastructure, what we have developed  

is a very different approach using proactive  validation rather than reactive repair. And   this is actually a paper that we wrote recently  in USENIX ATC that talks about how we approach   reliability in AI infrastructure, where the  same concept happens to apply in a new meaning.  Mm. I like that. Yeah.  So tell me a little bit about your   vision for where AI goes from here. You  talked a little bit in your keynote about  

AI-infused systems. And what would that look like? ZHOU: Yeah, so I think AI is going to transform   almost everything, and that includes systems.  That's why I'm so happy to be here to learn more   from the AI community. But I also believe that  for every domain that AI is going to transform,   you really need the domain expertise and, sort  of, the combination of AI and that particular  

domain. And the same for systems. So when we  look at what we call AI-infused systems, we   really see the opportunity where there are a lot  of hard system challenges can be addressed by AI.   But we need to define the right interface between  the system and the AI so that we can leverage the   advantage of both, right. Like, AI is creative.  It comes up with solutions that, you know,   people might not think of, but it's also a little  bit random sometimes. It could, you know, give you  

wrong answers. But systems are very grounded and  very deterministic. So we need to figure out what   is the design paradigm that we need to develop  so that we can get the best of both worlds.  Makes sense. In your  talk you gave an example of OptiFlow.   Could you tell our listeners a bit about that? ZHOU: Yeah. This is a pretty interesting project   that is actually done in Microsoft Research  Asia jointly with the Azure team where we look  

at collective communication, which is a major part  of AI infrastructure. And it turns out, you know,   there's a lot of room for optimization. It was  initially done manually. So an expert had to take   a look at the system and look at the different  configurations and do all kinds of experiments,   and, you know, it takes about two weeks to come  up with a solution. This is why I say, you know,   the productivity is becoming a bottleneck for  our AI infrastructure because people are in the  

loop who have to develop solutions. And it turns  out that this is a perfect problem for AI, where   AI can actually come up with various solutions.  It can actually develop good system insights   based on the observations from the system. And so  OptiFlow, what it does is it comes up with the,   sort of, the algorithm or the schedule of  communications for different collective   communication primitives. And it turns out to be  able to discover algorithms that's much better  

than the default one or, you know, for different  settings. And it's giving us the benefits of   the productivity; also, efficiency. STRICKLAND: And you said that this   is in production today, right? ZHOU: Yes. It is in production.  That's exciting. So thinking still  to the future, how might the co-evolution   of AI and systems change the skills  needed for future computer scientists? 

ZHOU

Yeah, that's a very deep question. As I  mentioned, I think being fluent in AI is very   important. But I also believe that domain  expertise is probably undervalued in many   ways. And I see a lot of needs for this  interdisciplinary kind of education where   someone who not only understands AI and what  AI technology can do but also understands a   particular domain very well. And those are  the people who will be able to figure out  

the future for that particular domain with  the power AI. And I think for students,   certainly it's no longer sufficient for you to be  an expert in a very narrow domain. I think we see   a lot of fields sort of merging together, and  so you have to be an expert in multiple domains   to see new opportunities for innovations. STRICKLAND: So what advice would you give   to a high school student who's just starting  out and thinks, ah, I want to get into AI? 

Yeah, I mean certainly there's a lot of  excitement over AI, and it would be great for   high school students to, actually, to have the  firsthand experience. And I think it's their world   in the future. Because they probably can imagine a  lot of things from scratch. I think they probably   have the opportunity to disrupt a lot of the  things that we take for granted today. So I think   just use their imagination. And I don't think we  have really good advice for the young generation.  

It's going to be their creativity and their  imagination. And AI is definitely going to empower   them to do something that's going to be amazing. STRICKLAND: Something that we probably can't   even imagine. ZHOU: Right. 

STRICKLAND

Yeah. ZHOU: I think so.  I like that. So as we close,  I'm hoping you can look ahead and talk   about what excites you most about the potential  of AI and systems working together, but also if   you have any concerns, what concerns you most? ZHOU: Yeah, I think in terms of AI systems,   I'm certainly pretty excited about what we can do  together, you know, with a combination of AI and   systems. There are a lot of low-hanging fruit, and  there are also a lot of potential grand challenges  

that we can actually take on. I mentioned a couple  in this morning's talk. And certainly, you know,   we also want to look at the risks that could  happen, especially when we have systems and AI   start to evolve together. And this is also in an  area where having some sort of trust foundation   is very important so we can have some assurance of  the kind of system or AI system that we are going  

to build. And this is actually fundamental in  how we think about trust in systems. And I think   that concept can be very useful for us to  guard against unintended consequences or   unintended issues. [MUSIC]  Well, Lidong Zhou, thank you  so much for joining us on the podcast.   I really enjoyed the conversation ZHOU: It's such a pleasure, Eliza. 

And to our listeners, thanks for  tuning. If you want to learn more about research   at Microsoft, you can check out the Microsoft  Research website at Microsoft.com/research.   Until next time. [MUSIC FADES]

Transcript source: Provided by creator in RSS feed: download file
For the best experience, listen in Metacast app for iOS or Android