Why AI needs a new kind of supercomputer network - Episode 18 - podcast episode cover

Why AI needs a new kind of supercomputer network - Episode 18

May 06, 202638 minEp. 18
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

Training frontier models isn’t as simple as adding more GPUs—one small problem and the whole coordinated dance falls apart. OpenAI’s Mark Handley and Greg Steinbrecher discuss how a new supercomputer network design, used to train some of the company’s latest models, keeps the whole system moving in lockstep, even with record numbers of GPUs. They break down Multipath Reliable Connection, a new protocol OpenAI developed with AMD, Broadcom, Intel, Microsoft, and Nvidia, and why they’re making it available for the whole industry to use.


Chapters

00:00 Intro

00:39 Greg and Mark's paths to OpenAI

04:34 Why training AI stresses networks differently

10:05 Bottlenecks, failures, and the cost of waiting

15:19 How Multipath Reliable Connection works

18:59 A protocol to route around failures

25:05 Why OpenAI is making MRC an open standard

35:09 Could AI compute move to space?



Hosted on Acast. See acast.com/privacy for more information.

For the best experience, listen in Metacast app for iOS or Android