El Capitan - podcast episode cover

El Capitan

Nov 19, 202431 minEp. 10
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Summary

This episode explores the development of El Capitan, the world's fastest supercomputer, and its critical role in ensuring U.S. nuclear stockpile safety through advanced simulation, replacing underground testing. It details the extensive challenges overcome, including power requirements and software adaptation, highlighting innovations like GPU and APU integration. El Capitan's launch signifies a monumental leap in computational power, enabling breakthroughs in diverse scientific fields and maintaining U.S. leadership in high-performance computing.

Episode description

For decades, it was an ambitious dream: to create a supercomputer powerful enough to tackle humanity's most complex problems. Now, that dream is a reality. On November 18, 2024, El Capitan made history as the world’s fastest supercomputer, surpassing two exaflops of speed. Join us as we explore how this monumental achievement is set to redefine national security, revolutionize scientific research, and spark breakthroughs that could change the world as we know it.


Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.

Transcript

The Vision of Exascale Computing

For decades, the US Department of Energy has been pursuing a bold vision. A system powerful enough to tackle the greatest challenges facing humanity. Fears of a serious new threat to U.S. national... Russia has begun major nuclear weapons exercises... World Health Organization has declared a global public health emergency...

system is undergoing a once-in-a-century transformation. What will the energy of the future look like? That vision? Exascale computing. Exa is a Greek prefix meaning 10 to the 18th and... Exascale system nominally is about how many calculations can it perform per second. The U.S. government and the National Nuclear Security Administration's tri-labs, Lawrence Livermore, Los Alamos,

and Sandia National Laboratories needed a machine capable of operating on a scale that had never been done before. You can think of it as a billion billion. And so just the sheer number. of calculations that you can perform in a fixed amount of time is beyond anything that we've been able to do in the past. The NNSA labs needed a computer that could simulate nuclear reactions to the tiniest detail.

Discover new materials, boost energy, advance inertial confinement fusion, and meet the nation's evolving national security demands. But building a machine of this magnitude required vision and a willingness to gamble on the unknown. More than a decade of work went into building something that would push the boundaries of what was possible. Capable of performing more than two quintillion calculations per second at its peak. And now, that vision is a reality.

This machine doesn't just turn on. It awakens. Piece by piece, system by system. Each one coming to life in perfect synchronization. And then, it happens. The future has arrived. Welcome to the world. El Capitan. We expect El Capitan to offer more total compute capability than any previously built system. Welcome to the Big Ideas Lab.

your weekly exploration inside Lawrence Livermore National Laboratory. Hear untold stories, meet boundary-pushing pioneers, and get unparalleled access inside the gates. From national security challenges to computing revolutions, discover the innovations that are shaping tomorrow, today. On November 18, El Capitan was officially launched at supercomputing's biggest showcase, the SC Conference, where it was announced as the world's fastest supercomputer.

At a peak speed of more than two exaflops, El Capitan is not just a technological marvel, but a machine that holds the future of national security, scientific research, and breakthrough innovations in its hands. El Capitan is one of the first exascale systems deployed in the United States. It is the third in a series that the United States has been developing and is the first.

of these exascale systems to be deployed for the national security mission. Rob Neely is the Associate Director for Weapons Simulation and Computing at Lawrence Livermore National Laboratory. The immense computational power of El Capitan and its unclassified companion system, Tuolumne, holds the potential to solve some of humanity's biggest challenges.

From fusion energy, to climate modeling, to renewable energy research, to breakthroughs in drug discovery and earthquake simulation. However, at its core, El Capitan was designed with a singular mission. To ensure the safety, security, and reliability of the U.S. nuclear stockpile. For the United States to maintain confidence in our nuclear stockpile.

Prior to 1992, if we wanted to understand if a change or a new design worked, we would go off to Nevada, drill a big hole in the ground, put the weapon down there and set it off. And that's called underground testing. And that's how things worked for decades. We stopped doing that in 1992 under the Clinton administration. And that left us with the big question, how are we going to retain confidence in our nuclear stockpile? And so...

That really spearheaded a big push in the United States to use supercomputing and modeling and simulation as one leg of a new... program called science-based stockpile stewardship designed to make sure we could retain our confidence in these weapons. With new global threats emerging and a Cold War era arsenal still in play, ensuring that the U.S. maintains its nuclear deterrence and its competitive advantage over its adversaries have become some of the nation's most critical challenges.

For the first time in decades, we're designing new weapons that are similar, but safer, higher performing. So that national security mission is core. to a lot of what we do and what we plan to use El Capitan for. This mission is not new. It's one Lawrence Livermore National Laboratory has been working on since its founding in the 1950s. The nice thing about DOE labs is they do make these long-term investments in science and technology that we think we're going to need.

for the mission so they can take 20 and 30 years to come to fruition, which is a really interesting work environment for us. Teresa Bailey is the Associate Program Director. for their weapons simulation and computing computational physics program at Lawrence Livermore National Laboratory.

Her job is to oversee the development of a wide range of modeling and simulation tools that can be run on Lawrence Livermore National Laboratory's high-performance computers. She points out that El Capitan in many ways represents the culmination of decades of research and development, bringing to life the vision of the Accelerated Strategic Computing Initiative, or ASCII, that was established over 25 years ago.

The ASCII program was designed to deliver modeling and simulation tools aimed at stockpile stewardship using high-performance computers so that we would never have to go back to nuclear testing. So El Capitan really represents that end product for the original vision of ASCII.

The Imperative of Computational Precision

Super computing is bringing to bear as much computational power as we can assemble to solve the hardest problems that are out there. Bronis Disipinski is the chief technology officer for Livermore Computing. We do modeling and simulation of a variety of processes. Most of them are related to stockpile stewardship, but climate science, things like that.

To do those kinds of simulations in ways that actually models things close to reality, it takes quite a lot of computation. And so it takes much more computing capability than you have in, say, your laptop. Today, supercomputers far surpass the computational power of any device you have at home. What truly sets them apart is their precision and interconnectedness.

Supercomputers are designed to have thousands of compute nodes work together to run simulations that mimic reality with incredible accuracy. which requires immense computational power to achieve the 64-bit double precision calculations necessary for reliable scientific results. Mathematically, real numbers are infinite precision. In a computer you have to choose some finite precision. The fidelity with which you're representing that infinitely precise number.

in the computer is limited by the number of bits you devote to it and so getting an accurate answer depends on how many bits you use for it think about it this way 3.000000000 continuing infinitely. Computers, however, have finite precision, meaning they have to cut off those trailing zeros at some point. In scientific computation,

Every extra decimal place of precision can be the difference between a simulation that is reliable and one that isn't. Doing large simulations requires fairly significant precision. Most of our computations require 64-bit computations. 64-bit precision allows supercomputers to handle numbers with up to 64 binary digits.

enabling the highly accurate calculations needed for complex simulations, such as those used in nuclear weapons research. In national security, where the smallest margin of error can have critical consequences, close enough... simply isn't an option. This is why Lawrence Livermore National Laboratories and the NNSA have been relentlessly focused on hitting that exascale computing target.

Architectural Evolution and Software Adaptation

But achieving this level of technological advancement requires more than just improving current capabilities. It requires holding a vision so bold and far-reaching that the path forward may not always be immediately clear. Back in about 2008, there was a seminal paper that was released by DARPA, the Defense Advanced Research Projects Agency, foreshadowing the difficulties that...

the computing industry was going to have reaching this exascale target. So for decades, computers had been getting a thousand times faster approximately every decade or so. Built on first just smaller transistors, the more things you could pack on a chip. Then by parallel computing, putting more of these chips together into a single system. But getting to exascale.

Very early on, almost 15 years ago, it was recognized this was going to be a challenge like we haven't addressed before. So we started thinking about these systems. long before we decided what the systems would actually be because we knew there was going to be a lot of research needed to be able to utilize these systems effectively. This DARPA report foreshadowed the immense challenges on the horizon.

It highlighted key issues like power, memory, and system resiliency. Fast forward ahead to about 2015, 2016, the United States funded something called the Exascale Computing Project. which was really about the research needed to develop the software and the applications that would ultimately run on these machines. And it also funded some research. for companies like AMD and Intel and Nvidia and HPE, big players in the supercomputing industry to help them.

develop technology faster so that we could deploy those at the laboratory sooner for our mission so all this was happening about six seven years ago and at that time is when we began thinking about what's our next system going to be? What's the NNSA's Exascale system going to be? As the Exascale computing project took shape, it became clear that the path forward would require new solutions for both software and hardware.

One of the original challenges we had when thinking about exascale computing was really around the power requirements of these computers. Historically, the earliest supercomputers were just the earliest computers, right? Over time, they became dominated by something called vector systems. So that's a way of computing a bunch of things at the same time in parallel. There's kind of limitations on that. And so over time, we moved to...

networked systems of CPUs, which is the standard way of what people use in their laptops. And so for a long time, we were building systems with CPUs. That's how for decades we've been getting faster and faster. performance on these supercomputers. But if you drew sort of a straight line on where we knew technology was going in the late 2000s out to 10 years later or so, the amount of power it was going to require to field one of these systems.

we were going to have to think about putting a nuclear reactor next to the building because it was in the hundreds of megawatts, which the operational costs for that were more than the Department of Energy even was willing to accept. And so... A lot of the initial challenges and a lot of the initial research was, how can we continue to ride this wave of improved computer performance without expanding the amount of energy and power that's going to be required?

The power challenge was immense. Exascale systems like El Capitan would require a completely new approach to energy efficiency, pushing computing experts to explore new ways to design and build these machines. In order to get more parallelism, we move to processors that are used to drive the graphics on your screen, so GPUs. In 2018, Lawrence Livermore National Laboratory launched Sierra, a groundbreaking supercomputer that combined CPUs with GPUs.

making it one of the first large-scale systems to use this integrated heterogeneous approach. Sierra delivered 125 petaflops at its theoretical peak. roughly one eighth of the computational performance of Exascale. Part of what we were able to do between The community and our vendor partners, like folks at NVIDIA and AMD and Intel, were to make these graphics processing chips suitable for scientific computing.

It was really scientific computing and partnership with companies that helped us recognize that, yes, we could do this. This could become the basis for. the next generation of supercomputers. And it's going to be something like that technology that's going to be required to get us to exascale computing in a power budget that we can manage. Sierra's design was a huge leap forward.

But the shift to GPUs introduced a new challenge. Many of the existing codes weren't built to run on GPUs. These codes had been designed for CPU-based systems, and adapting them wasn't a simple task. These aren't just little codes that you can rewrite over and over again. They're big codes. They're sometimes millions of lines of codes coming together. So to make these big shifts.

An algorithmic type takes a lot of upfront thought and research to make sure it has the payoff that we need. Imagine being tasked with translating a complex manual into a different language. Except this manual isn't just a few pages. It's millions of lines long, and every detail is critical. Even the smallest error could derail the entire process.

This was the challenge Lawrence Livermore National Laboratory's developers faced when adapting CPU-based codes to run efficiently on GPUs. To overcome this, the team implemented Raja and Umpire. coding tools that simplify and streamline the process of adapting and using the codes. These tools, first used for Sierra, sped up the work for El Capitan dramatically, reducing code implementation time and pushing the exascale transition forward.

The next pivotal step toward a fully functional Exascale machine came with the introduction of AMD's next-generation processors known as APUs, or Accelerated Processing Units. These chips combine both CPUs and GPUs into a single hardware package, making them more efficient and easier to program. This invention marked a major leap forward in computing technology, not just for the lab,

but for the world. The APU was an innovation that AMD came up with, one of our partners in El Capitan, to basically integrate the idea of the CPU and the GPU all on a single package. Sierra had GPUs in it, but they were really completely separate from the CPUs. They were separate memory.

And one of the complications of using those systems and using accelerated computing in general was that the programmer now had to make some explicit decisions about when to move data between the CPU and the GPU and when to transfer. control of the program from one type of device to another. The APU now gets rid of one of those complications.

That makes the system more efficient from an energy standpoint because you're not doing those useless movements of data. And it also makes it easier to program because you don't have to program that movement of data. That's technically a big advantage. El Capitan is made up of tens of thousands of these APUs, each one linked together to create a vast system capable of calculations on a scale never before seen.

The way these large supercomputers are assembled is you have the basic unit of compute that's called a node, and a node... in our case, is actually already made up of multiple APUs. Then you take nodes and you assemble those into blades, and then blades get assembled into like a commercial-grade refrigerator-sized rack that sits on... floor and weighs a lot and then we assemble those racks together on the order of about a hundred of them for El Capitan to make the entire system.

Building the Supercomputer's Foundation

One of the biggest challenges with exascale computing wasn't just designing the machine, but building the infrastructure to support it. At Lawrence Livermore National Laboratory, they had to overhaul the entire electrical and cooling infrastructure, doubling their capacity to handle the immense energy demands of El Capitan. A new utility yard was built, supplying enough energy to power tens of thousands of homes, just to ensure the supercomputer could run at full capacity without interruption.

As part of something called the Exascale Computing Facility Modernization Project, we deployed significant increase in the electrical infrastructure to our main data center. And so that... took us from 45 megawatts to 85 megawatts. So we're essentially 2x the energy that we can deliver to the computer and floor. Now, El Capitan is not going to use all 85 of those megawatts, but it's going to use somewhere around 30 of those at any given time. That power is enough to supply around 30,000 homes.

The extra energy capacity of the Livermore Computing Facility ensures they can sustain existing supercomputers alongside El Capitan. Despite its substantial energy requirements, El Capitan is one of the most energy efficient supercomputers ever built in terms of performance per watt. But all that power generates heat. A tremendous amount of liquid cooling is required to keep these systems from literally melting because they run it.

sometimes over a thousand watts. So you think about how hot a hundred watt light bulb can get. Magnify that now by 1020. 50 times, that's how much heat you're trying to dissipate in a very small package in one of these nodes of a supercomputer, and then multiply that by the tens of thousands of nodes that make up these systems.

That's a lot of heat that you've got to try to make sure you can get rid of. And liquid cooling is the idea that you bring in cool water. You then run water across cold plates. It dissipates some of the heat away. goes out the other side of the rack and then eventually through heat exchangers goes out to a cooling tower and then that water is cooled and the cycle repeats.

At full operation, El Capitan will cycle through 5 to 8 million gallons of water every 24 hours to keep its systems cool and running efficiently.

The Power of Co-Design and Impact

Building El Capitan required more than just cutting-edge technology. It took a coordinated effort across multiple organizations. Years of collaboration between Lawrence Livermore National Laboratory, the Department of Energy and NNSA, and private industry were essential in overcoming the immense technical challenges of exascale computing. Back around the time where we were first starting to talk about exascale and recognizing the challenges.

We created a term that stuck called co-design, which was really the idea that we're going to have to take this from a standard customer-client relationship with these companies to something much more collaborative. We need to understand more about the long distance roadmaps of these companies so that we can begin to angle our research and our applications development toward what their roadmaps are.

Probably more importantly, these companies really need to understand where the bottlenecks are in our applications so that they can think about how to design their hardware in ways that are going to best address our concerns and our needs.

Codesign emerged as a way to blur the lines between hardware and software development, bringing together experts from both fields to work side by side. This deeper level of collaboration often involving clearances for security-sensitive work, allowed teams to quickly identify and address the most critical challenges, speeding up progress in ways that wouldn't have been possible without a standard customer-supplier relationship.

Being innovative is really critical to doing new things, taking new approaches. But if you have a completely new approach all on your own, you're not going to get much done because big things take lots of people. they've built something truly extraordinary. El Capitan is not only faster and more powerful, but is also able to tackle problems that were once deemed intractable. So there's a class of problems.

that are big 3D problems that we want to run at high resolution. We've been studying this class of problem for years, since the beginning of ASCII. And the first time we took it out for a spin, it took like half of our biggest supercomputer and it took over a month to run it. And in 2015, we checked again and it took... maybe 20% of our supercomputer, and it took a little less than a month. Then we took that same calculation out for a spin on Sierra, and all of a sudden it took 3.3 days. Whoa!

OK, that is like game changing. Right. Think about it. Think about what you can turn around in three point three days as opposed to a month. 10 different types of those calculations. Think about if you're designing something, how that changes what you can do, right? It's just night and day. I get 10 shots as a designer to make a choice in a month.

That's incredible. Oh, and by the way, that 3.3 days took less than 10% of Sierra. That was like, this is going to be tractable on El Capitan. We need to continue pushing. We need to get to higher mesh resolutions and do a better job with the physics. And that is our goal. Our goal is a reasonable turnaround time for a medium to high resolution full physics calculation.

in three dimensions that we have never been able to do before. It's an open question for whether or not LCAP will be the fastest computer in the world for... One year, two years, maybe three years. We can't predict that right now just because everybody's working always to build faster and faster computers. El Capitan's achievement as the world's fastest supercomputer.

isn't just about speed. It's about what that speed can accomplish. It signals a new era of computational capability that will tackle some of the biggest challenges facing humanity. whether it's understanding complex physical phenomena or advancing national security. We have a series of problems that are just going to challenge the entire scale of the machine. There are problems I can imagine.

They're big problems. They're things that no one has ever dreamed really trying. We have laboratory-directed research and development projects that have put things in place where like, if we could do something massive, we could solve this problem. And there are a few of them that over time using El Capitan, we will get the codes aligned and arranged and go after those problems as well.

Anticipating the Next Generation of HPC

They're probably not going to be the first thing we try, but over the life of the machine, we will take a shot. I'm very certain of that. El Capitan's journey did not begin yesterday, and it won't end tomorrow. Decades of planning, innovation, and collaboration have led to this moment. And now, even as it comes online, the lab is already looking to the future.

We always are planning ahead as much as possible. And we're already starting to think about what the next system in the 2030 timeframe is going to be. And we want to make sure we can begin to stand it up. And we're anticipating it's also going to be a very power-hungry system while still keeping El Capitan running. because it will, of course, be being used for the mission during that time. So we sized our facility to be able to support multiple exascale systems at one time.

during that overlap period when we're at end of life for one system and beginning of life for the next system. So that was one of our big challenges was making sure we have the infrastructure, the power and the cooling. required to field these systems. Building world-changing technology requires looking far ahead, anticipating the limits of today's capabilities, and constantly pushing the boundaries.

The team at Livermore isn't just solving problems. They're planning for challenges that may not even exist yet. What's the next big thing? That is the billion-dollar question. I think taking a step back and looking at the numerical methods that we can apply on these machines and looking at different ways to run sensitivities. or to understand how we not just get one solution, but a solution plus ensembles of answers or getting gradients of the solution is probably what we should be doing.

once we get through the El Capitan challenge. Hardware advances are uncertain. The machine learning market is driving hardware changes that are complex for our codes to deal with. They don't need... the precision we need. And so the hardware that vendors are creating, take that into account to sell to that market. So that's going to be a challenge for us.

We're looking at both technology slowing down and prices going up and very worried that for the same dollars we're not going to get a lot more compute than we have in El Capitan.

So we're thinking about how can we make the systems get more work done for the same compute capability. Because of that, the future of the architectures is unclear. So for the code teams... I think thinking about new types of calculations we can do, new types of numerical methods we can employ because we do have huge compute is what we'll do in the short term.

The real power of El Capitan isn't just in the numbers it can crunch today, but in the new frontiers it opens for tomorrow. This is a signal. of the United States continuing leadership in high performance computing, that we can continue to do something that is the best in the world. And that alone. makes El Capitan interesting. And that alone is one reason to be proud of what we're doing here at our national laboratories with U.S. industry to do something that is the best in the world.

But of course, a fast computer that doesn't actually solve any of humanity's problems, that's not terribly interesting. So it's not enough for us to just be the fastest in the world. It has to be. for a purpose. As the team at Livermore has shown, reaching for the pinnacle is just the beginning. The pursuit of what comes next, anticipating future limits and pushing past them. That's the enduring mission.

Thank you for tuning in to big ideas lab. If you loved what you heard, please let us know by leaving a rating and review. And if you haven't already, don't forget to hit the follow or subscribe button in your podcast app to keep up with our latest episode. Thanks for listening.

This transcript was generated by Metacast using AI and may contain inaccuracies. Learn more about transcripts.
For the best experience, listen in Metacast app for iOS or Android