Liquid Cooling

00:08

Hello everyone. Welcome to the next podcast From Research to Reality, I have a great honor and pleasure to host Jason Zeiler. Hello, Jason, Great to meet you. Jason is the product manager of Liquid Cooling and Next Generation Infrastructure. And this is the first time we're bringing in someone who is product manager. We've done largely the first part from research. Now we're getting closer to reality. Excellent.

00:38

You need the product in there to where it kind of gets in the customer's hands is always important. Yeah, exactly. So tell us a little bit what that title means. Sure. So I kind of work with an HPC group. Liquid cooling is really an integral part of our infrastructure and planning. How we're enabling you know, high TDP stuff, new GPUs and CPUs. So we need a product manager. TDP? Thermal Design Power. How much power goes into chips and how hot they're going to get.

01:05

And the next generation is really what the infrastructure is going to look like. How big are the racks going to be, how much power do we need, how much cooling. So today, liquid cooling product manager, but then also the future next generation. So that's kind of my hybrid role that I play today. So Jason, what does it mean to you to be product manager?

01:23

So for us and for me personally, product management is always about working kind of on the left with engineering and on the right with marketing and the customers understanding what we can do, what we should do and how we will do it with customers and making sure that their voice is represented every step of the way, but also that we represent our internal and kind of corporate intentions. How much these things should cost, how many can we sell, how do we actually deploy this in the market?

01:48

Product management always plays that middle role to make sure what we're doing makes sense and actually is going to align with customer needs. Can we dissect the little bit? As you said, on one hand, you work with engineers to define it. So how does that work? Please explain to us. Yeah.

02:02

So today a lot of what we do, especially with the liquid cooling, we're doing a lot of work internally, our own designs, also working with a lot of third party vendors to see where should we buy, what should we build ourselves. But we are pretty knowledgeable of our own competencies, our own abilities. But we shouldn't build and deploy every technology we think is interesting.

02:23

We really have to first see what we can do, but also what actually makes sense in the market because we even have to spend our own R&D dollars, you know, either working with labs really far out or short term with our own thermal engineering group to decide what is the actual product we're going to build. And so we really have to be focused on what we could build. But does it make sense? Because we can do a lot of really amazing things, but should we do it and can we find someone to use it?

02:48

And answer to that question you get by talking to customers? Absolutely. So how how does that work out? Yeah, so we do a lot of work surveying, we do a lot of customer site visits. We come to a lot of events like supercomputing in the U.S., ISC, often in Germany, where we actually get a lot of face time with customers and quite frankly, we do a lot of showing of roadmaps, but making sure that there's time at the end to say, does this actually make sense? Do you actually want this?

03:15

Because we can show products all day long. But if customers are kind of giving us signals at the end saying, yes, this aligns with what we want, it often causes us to pause and go back to the team and say we didn't get the reception we wanted with. I think we need to do a workshop, we need to do a deep dive to really dig into. Do customers actually want this? Were they a bit shy during the meeting or what is their long term intention?

03:39

Because the worst thing would be us getting our old, you know, drinking the Kool-Aid, getting really excited and we go to market and it's crickets and that would be the worst case. So we need to be validating every step of the way. So let's now focus on liquid cooling itself, how did it happen and why. So really liquid cooling has been prevalent in HPC for decades. Groups like HPE, IBM, Cray, SGI, they have all been working heavily in the space and built up great expertise.

04:09

We have to use liquid cooling today to provide much better energy efficiency, but also be able to provide much denser racks, smaller racks, smaller infrastructure overall, but even enabling some high TDP processors. So today there's really a split like if we painted with broad strokes, you know, in my opinion, anything over 30 kilowatts. per rack is prime for liquid cooling based on really the rack densities, how much stuff is in there. It becomes quite difficult to air cool.

04:40

Anything below that really between, you know, 15, 30 kilowatts is a gray zone. Sometimes it might not be totally, you know, worthwhile dollar wise to liquid cool it. There's more staff, cold plates, rear door heat exchangers that might not create, you know, the perfect tradeoff scenario. And then anything below that often doesn't make a lot of sense. And so HPC has been beyond 30 kilowatts for a very long time.

05:05

If you look at any of the, you know, TOP500 systems today, a very large percentage of them are liquid cooling and anything in the top 50 really firm today and going forward will only be liquid cooled. There will be almost no existence of purely air cooled systems. And it's just because of really the density and how hot the chips are themselves. They need a more efficient thermal technology.

05:26

So it appears to me that while to most engineers, scaling means how many systems for you, it's how many systems in a rack you can build or on a..... whatever your dimensions are is that the .... . Totally, and it is kind of interesting because when we think about liquid cooling, there's really a lot of layers at the very micro layer there is the individual server that has some very specific thermal needs.

05:53

There can be very hot CPUs, very hot memory, very hot, you know, voltage regulators that they themselves that if they're not cooled effectively. They won’t work, the boards will shut down, will experience thermal throttling. But even passed that server say you figured out the cold plate cooling at this level now how do you manage all the heat in a rack? So now we take this, you know, five 600 watt server, thousand watt server? And then we put 70 of them, you know, in a rack.

06:21

So these, you know, Cray, 2000 style high density servers. Well, now you need to take all that heat away from the rack. Do you do CDUs do use do you do rear door heat exchangers, What's the technology you want to use? And then it scales out to what's the infrastructure in the building. So when we look at systems like Frontier, very large exascale systems, they're doing cold plate cooling, they're doing rack level cooling. And then the building infrastructure has to manage all of that heat.

06:47

That is a tremendous amount of heat that has to be processed. So there's many layers. And that's what the kind of liquid cooling group is focusing on. We can build great cold plates, but if we don't have the proper technology to take the heat out of the rack itself, the solution isn't going to work. I don't want to interrupt you, but CDU for our customers is our cooling distribution unit.

07:07

Okay, so when we built Frontier, what were the biggest challenges to the extent that you can share, this is a proprietary solution for sure. So so really of any technology using Cray EX in that case, in almost all cases, when you're building blades that are not not in watts, they're in the kilowatt range. There is a lot of density that's built in there. So you need really two kind of things to be thinking about.

07:34

One, how do we take all of that heat out of the individual blades, bring it back to the racks? In this case, we use large row-base CDUs, we're pairing Big CDUs, big Poms, big heat exchangers with, you know, between three and four of these large Cray EX cabinets. But the other really interesting challenge is how do we make it fanless? And so Frontier, like all of the large supercomputers that we’ll build, have no fans in them.

07:58

They're doing 100% liquid cooling, which means there is a lot of components at the blade level. So that's definitely an engineering challenge to work through. And you wanted to make them fanless because energy efficiency and density. And so really when we're trying to liquid cool these components, anywhere that we have to add fans is also kind of a resiliency issue, reliability. We have moving parts. We want to have as little as possible, but the big thing is really thermal management.

08:28

So if we can remove all the fans and drive everything to the liquid cooled infrastructure, very, very energy efficient, if you look at the top 500 or top green kind of infrastructure rankings, always at the very top are Cray X systems, because there are no fans, very little electricity is moving the liquid around inside the cold plates. That's really important to these systems. So how do we compare to cloud based solutions? They have just like us, lots of homogeneity.

09:00

I mean, there's heterogeneous components, but, you know, once you account for GPUs versus CPUs, you have homogeneous solution in a fact. Yeah. So one of the interesting things when we think about cooling is really comes back to density. So the earlier question kind of why liquid cooling, you know, how is it evolving in HPC? The rack densities are so high, it really it makes easy sense for us to justify the additional infrastructure here.

09:27

But for many cloud computing groups, rack densities are simply not that high. So they have a lot more of a gray area for where they want to play. Do they want to add liquid cooling or are they going to invest in their air cooled infrastructure? Did they move to a state that has very low electricity costs and they can buy land cheaply? We've seen a lot of kind of hyperscalers move out to, well, you know, very low density density areas of the U.S.

09:53

They have cheap electricity and they can just build a new data center. But if you're building somewhere like San Francisco, you're going to have kind of a perfect storm. You know, high real estate costs, high electricity costs and not a lot of room to build. That's where high density kind of liquid cooling can really play a strong role. So it kind of depends where groups are building and really what their own corporate agenda is.

10:15

If they have money to burn and energy efficiency doesn't matter and they have the space like we talked about, liquid cooling is often not prioritized. But we see much of the world U.S. and Europe in particular, that even cloud providers are really trying to figure out this technology where they can have very high energy efficiency, but also very high resiliency. They contain components out. Yeah, components fit on whatever they want to put in the racks.

10:41

And that's one of the challenges of liquid cooling is unlike air cooling that you have the fans in the server, you can put kind of anything you want as long as there is effective heat sinks that are built into the servers, things are going to work. Liquid cooling. We need to design the components to fit the components. So you have to design a cold plate that's going to fit on the latest AMD CPU or the latest memory or voltage regulators.

11:07

So there has to be more foresight and kind of thinking about how we're going to design products. We can't kind of just slap a PCIe card in there and have it be liquid cooled without thinking.

11:17

How does it work with the greater infrastructure, given the tremendous complexity of the system that you just described and many not moving parts but moving heat paths, if you will, is there any opportunity to do co-design between your products and the whole infrastructure and then hardware and then even software, both systems software and applications? Absolutely. 100%, yes. And so that's where now kind of exciting. We get to talk about next generation infrastructure.

11:51

So today, liquid cooling, you know, really when we talk about Cray X, Cray XD, even our Proliant series, all of those products and platforms have liquid cooling today. Most of that is steady state. So often we'll have CDUs with Poms kind of obviously watching pressure, watching temperature, watching flow.

12:13

But they are most often providing pretty stable, constant flow to the rack of really regardless of what's happening next generation and what I would expect in the industry overall is exactly like you said this more, you know, more harmonious kind of coexistence between power management, you know, software features and the actual liquid cooling infrastructure. And so I believe we'll see much more energy efficiency, you know, kind of backing up, thinking about how server fan tables work today.

12:45

When a server ramps up, the fans turn out when the servers are going into low power state, the fans ramp down. I expect there's going to be more versatility in cooling so we can give very hot processors a lot of cooling. And then for servers that are running an idle, very low amount of cooling, but that has to be managed through software. And so I think that is absolutely where we're going. And how do you manage it, how you make sure that some CPUs don't go above the threshold.

13:15

We've been looking, for example, digital twins, but they're also sophisticated management infrastructure, software management, infrastructure to make sure. How do you do all of that? Yeah, that is, I think exactly the answer is the power management software within the rack. You know, no controller or software and hardware as well. Those pieces have to be talking with each other. So CPUs are running their own internal kind of thermal management so they know what their temperature is.

13:46

They're feeding it back to the kind of larger software controlling the nodes. But that also needs to then communicate with either the CDU or the cooling infrastructure so we can ramp up and down. Right now, there's not a lot of that we see in the industry where there is this kind of beautiful harmony between all the systems, but that is where we're going to go. And so that definitely comes from the software. Speaking of the direction where we are going, what comes after liquid cooling?

14:12

That's a very good question. So it will be more liquid cooling. You know, when we talk about the future of liquid cooling today, it is, I would say, the most prevalent technology deployed today is sky thin, cold plates, Technology that has been around for, you know, dozens of years is very reliable, very prevalent in the industry. What will come next, I think is going to be really hybrid technologies. So we're going to see more two phase cold plate cooling.

14:41

So how we're going to see, you know, within the actual cold plate a phase change between liquid to gas. I believe we'll see more immersion. I believe we'll still go see a lot of single phase. And this is, you know, always a difficult question to answer because we can see the processor roadmaps. TDP is going up very quickly. You know what used to be a high end processor three, four years ago, 250, 300 watts for CPU. We're going to see beyond 500 watts very soon.

15:11

GPUs are going to be over a thousand watts. And so the future is going to be, I think, a lot of different technologies for different customer uses. But to be very honest, single phase cold plates are still one of the most reliable, most resilient technologies that just work. And though there is a lot of different stuff happening in the market, many OEM groups have used that technology for a long time and it has worked very well.

15:37

And so I think we're going to see a lot of different stuff come out. The winners are going to be the most reliable, resilient. How easy can we swap out the components, But how do they just work? And the simplicity is going to be an important part. You know, we talk about this even with labs. There are a lot of really amazing technologies that are out there that from a science perspective are tremendously cool.

16:04

But how do we deploy them at scale and build them reliably and then put on an HPE warranty and say you're good to go? Yeah, because that's the problem. We can build anything we want. We have some of the smartest people in the world working on our industry's challenges.

16:18

But if you need, you know, your own supercomputer to manage the cooling infrastructure because it's a very sophisticated system that's kind of on the thin edge, you know, when we talk about future cooling that that need like absolute zero, How do you regulate that? Is that something any customer can buy? Maybe not all the time. So we need stuff that's really reliable that just works. So you mentioned it has to work for AMD, for Intel, for GPUs.

16:49

There's some underlying implied interoperability that you need to or supporting multiple components. Is there any implied standard? How do you deal with these issues? Yeah, So, you know, from a just a fit and form function, you know, between AMD CPU and an intel CPU, you know, NVIDIA GPU, as long as we understand kind of the socket geometry, what its thermal characteristics are, especially around TCase.

17:22

So the maximum thermal temperature these CPUs or GPUs can go up to before they experience thermal issues, we can design around them. Now the nice thing for the most part is an AMD CPU within a short period of time from one generation to the other will most oftenly have the same socket kind of dimensions. So if we design a cold plate for one and the thermal characteristics make sense, we can use that same cold plate for the next generation.

17:51

As long as we have the kind of visibility to next generation, we can design very easily for that. So often it's the geometry of how the CPU is physically built with the socket and then its thermal characteristics. We work very closely with all those groups. So often we are building our solutions with them multiple generations in advance. And really those are kind of the some of the key criteria we have to look at. So I'd like to touch a little bit on the business model.

18:17

It appears to me that liquid cooling is highly proprietary. That's our key differentiator for our largest top supercomputers. Is there any opportunity for open hardware equivalent for open liquid cooling? I'm not saying you give it away, but are there some interfaces that you can expose that others can innovate, motivating them to continue using our or maybe perhaps other business model for HPE?

18:48

Yeah, this is a very interesting question because a lot of people ask about this, and the thing I always like to explain is today, let's say, you know, no liquid cooling infrastructure at all, just air cooled stuff, HPE and Dell and Lenovo and all these different groups are not looking to sell multi vendor rack solutions. You know, it's very rare that you would see Dell gear with HPE gear in the same rack. You know, it's not very common.

19:19

And so when we, you know, accept that as the standard that when customers buy a rack of stuff, it's going to be primarily HPE. I believe that kind of gives way that HPE can then control the infrastructure. Now, how this rack integrates with the rest of the facility that really matters. We need to use standardized connections. Now, today that happens, which is very nice. We use sanitary flanges, tri clamps. There's a lot of kind of industry standard piping standards that we use.

19:48

So it's not very difficult to, you know, build a rack of, you know, HPE Cray XD stuff and then another vendor can build in the same data center to meet the cooling you know, componentry is part of the HPE IP differentiation. is part of the HPE IP differentiation. And so it's I think we're at a very interesting tipping point where today there's a lot of stuff on the market that is similar. It uses skived cold plates. We talked about there's a lot of rear door heat exchanger vendors.

20:19

So the components are very similar. These similar connections, quick disconnects, you know how the technology works. But I have always believed that, you know, when we moved to this this new thermal curve we're seeing for processors, there are going to be certain companies that can simply cool it and some they simply cannot. And there's a there's kind of a perfect storm happening where we talk to a TCase, so thermal kind of design parameters are going down.

20:48

Components want to be cooler before they experience thermal shutdown, but at the same time, customers want to use the warmer water. That's all about the energy efficiency. So how do you when this window starts getting smaller and smaller, use technology that maybe isn't ready isn't wasn't designed for that kind of componentry. So I think we're going to see some companies really focus on their innovation because their server technology is so amazing.

21:12

But now paired with cooling solutions that competitors don't have as long as we have, you know, components that can connect to the facility piping, that's important. But the rest of it, I think there will see more differentiation happening with groups. And I don't know if it needs to all be, you know, totally open because that's going to be part of that group's kind of value offering.

21:33

Where I was going more is that there is a flurry of new accelerators, including very large ones, different ones, possible different geometries. Do you have any guidance for them? Because it doesn't make any sense for them to build their own liquid cooling because they're belly differentiations? Are accelerators, possibly compiler software, not building the whole infrastructure?

21:58

Could they somehow leverage our approaches, license, ABI or a rack and all of that ABI or a rack and all of that By power, racks, and all of that so that it's win win for them and for us? Definitely. Yeah. So I think that is going to be something that HPE does very well. We know a lot about infrastructure. We are really a solutions company.

22:18

You know, when we look to groups like AMD and Intel, they are amazing at building fantastic chips, but they need partners like us that build the rest of the solutions. Even all the way down to who's going to install these systems. And so we work very closely with a lot of those groups. On what will the full solution be, where we understand the parameters of their chips, how hot they will be.

22:39

Like I said, going back to basics, we'll build from the cold plates, to the CDU to the rack, to the whole infrastructure. That's the role that we play. So we do a lot of collaboration today. I was really excited to see with what energy you talk about all of this. It reminded me of a startup guy talking not of an enterprise guy. So you, you are definitely working at the bleeding edge of innovation. So what's the difference?

23:05

What's the difference between your enterprise corporate role product manager and a startup guy? Sure, yeah. So that that's very relevant because my experience is primarily been working with startups and even kind of coming to HPE.

23:19

I worked for a group that worked with HPE and HPC to me really is a startup and know it's it's kind of cliche, but I think it is really true when we look at our organizational stack and especially when we look to our kind of colleagues in Proliant that enterprise level server for them. You know, we always kind of laugh at ourselves because we we ask them about run rates and about kind of predictability of their their deal flow. And in the enterprise market.

23:48

And, you know, it's very it's not extremely predictable, but it's much more predictable. In HPC, there's a lot that happens year over year. You know, there's only so many access scale deals, there's only so many very large HPC opportunities. And even at the mid-tier, there's so much that changes in the market. And so we really have to be adaptable. And so, you know, in my time at HPE, I have seen such tremendous change within our teams.

24:15

Out of necessity, you know, one year we are focusing on a lot of exascale systems. We are looking at HPC with a, you know, fine tuned vision on what will the software be next year. Now AI is becoming extremely prevalent. What is an AI solution? Is it just software? How does the hardware marriage marry with it? And so we are constantly scrumming as our own teams kind of cross group.

24:39

So right now, even for what we're doing next generation, we are running software core team, service core team, service is going to be such an important part of what we do, hardware core teams. Liquid cooling. But these groups are all talking because we ourselves are going to get left in the past if we are not, you know, two years out, we released a product, but software wasn't aware of it. Well, that's a big problem where software is working on something that doesn't work without service.

25:04

How are we going to marry together? And so it's very startup mentality where we're not meeting, you know, quarterly, we're meeting weekly talking about what's happening. People are very open, you know, at not quite radical candor, but very transparent intentions. We don't figure out serviceability. No one's going to buy the system. The software doesn't make sense. How is the hardware going to work? So we're all working very much in a startup mentality.

25:28

So we try to push, push, push that we are an agile group. I love your definition of HPE as a marriage console for HPC and AI. Yeah, exactly. Exactly. So tell me a little bit. You're coming from Calgary. Is that right? That's right, yeah. So how is I mean, and by the way, we are now here at SC 23. Yeah. Used to be called supercomputing in Denver. I guess temperature wise it's similar, very similar, but other than that.

25:56

Well, so it's actually funny, when I come to Denver, it actually kind of feels like Calgary. Sometimes we can see the mountains from our hotel room, just like Calgary. Still get a lot of snow in the winter.

26:05

And what is always very interesting, you know, we talk about Denver, but then we also talk about Houston with HPE and all the different groups as there's so many similarities that, you know, even Denver and Houston, when I tie those two together, are very kind of oil and gas kind of historical presence. That's where money was made. That's where a lot of the smart people went. And what is so interesting, Calgary very much like that is as oil and gas kind of booms and busts.

26:33

So does technology. And I see kind of these kind of symbiotic curves happening when oil booms, all the tech kind of groups kind of go into oil. How can we build oil focused technology as a very large kind of oil and gas focused kind of sales group? We built some really amazing technologies, but when oil crashes, we see all these very smart people coming to tech.

26:55

And so there are a lot of people we work with that have some time worked in oil and gas that also work in supercomputing and then in Calgary also build beautiful breweries. I often I see that in the U.S., too. When oil crashes, all these smart people with money use their brains to build stuff that we love to use, you know, evenings and weekends. So we see a lot of that kind of boom and bust cycle happening. Will compare the brewery notes separately. There you go. Yeah.

27:21

I think in general HPC is very humanity friendly because we are solving not just oil and gas but also weather and many other problems. So have you ever thought about it? Did it ever fulfill you more than doing other jobs? For sure. I think that that is really, you know, an important part of her basic kind of hiring strategy. Our basic kind of corporate culture is we are not just selling computers. We're not selling kind of laptops to fill quota.

27:53

What we're doing is really pushing humanity forward. So when we look at Frontier, it's such an easy flagship thing to talk about. It is working on some of humanity's greatest problems because it is a combination of high powered hardware, software, fabrics, all kind of the machine that makes it happen. That is something definitely to feel good about. And you can bet at every interview that we run for new grads and for folks kind of moving industries, we lead with that.

28:21

We're not just building computers or laptops. We're building kind of technology that moves humanity forward. Definitely. That's what I tell my family I do. It makes me feel good about it. So well. I've done future workforce evaluation and the current generation wants to be fulfilled. They they're not caring only about salary, about, you know, all kinds of old thinking, but they want to know that their company is doing something good.

28:49

I usually end up this discussion with recommendation for the book. Did you have any good one that you can recommend if you even have the time to read? Yeah. So? So. So one of the books that I actually use quite often I talk with my new team members on. It's called the first 90 Days and I 100% was recommended it by a friend in the industry before I joined HPE, read it and used it and I literally just recommended it to some new staff that joined last week.

29:22

The main kind of takeaway for me, one of the important nuggets is really what are you going to do in the first 90 days to really let people know you exist? But to really clarify, what does the group think that you're going to be doing and what's your own kind of path forward? And though that's kind of generic advice, I think that any organization you work in being the new person can be difficult. Especially a group like HPE were so large.

29:47

We've so many different kind of departments and groups that in the first 90 days you really want to understand what's the culture of the organization, how can you contribute to it? But really, what do people think that you're going to do? And so when I joined, I introduced myself as kind of the liquid cooling guy. Here's what I think I'm going to be doing. Do you think that makes sense?

30:07

Because there's nothing more genuine in candor than the people working on the ground floor saying you were hired to do what? No, no, no, no. We have some big problems over here. You have time to tackle that. That would be tremendously helpful. And so you can kind of map that out. Here's what my job title says. But here's some low lying fruit that makes sense that I can accomplish easily. But move our group forward. And so I like that book.

30:30

It's just filled with very like, straightforward advice, nothing HPC related, but just kind of how to get in there quick and get your hands dirty. Jason, you had so many advices for me, for our audience. Thank you very much. Thanks. I really appreciate it. Thank you.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript