AI is booming. The GPUs are also getting more and more advanced. The demand for the protection ICs is also going up to keep these servers protected with low downtime. Hello and welcome to the Podcast for engineers. It's the podcast. You just have to listen to. If you're interested in what's going on in the semiconductor market. I'm your host, Peter Balint and today we are joined by Nitish Agarwal. He and his team are defining and developing new products for AI servers.
In the power IC group here at Infineon. Welcome and thanks for joining us today. Thank you, Peter, for having me. Powering AI is not simply about providing power to a rack of servers within an AI data center. There's also other things to be considered, such as swapping out a tray within a rack. This presents a whole new set of challenges, and this is where protection ICs come into play. And this is what you and your team are focused on.
So tell us a little bit about these protections ICs in the AI domain. Okay. So let's break that question down. What is powering the AI. So AI is basically powered through a GPU. So GPUs are graphical processing unit which is doing all these computations for the AI which has highly complex mathematical algorithms that's running. And these server GPUs, they work in parallel to solve a particular query. Now these GPUs, they work on a voltage of 0.8 to 1 volt.
Now, it's not feasible to supply point 8 to 1 volt. So the traditional architecture was to have a 12 volt architecture where 12 volt was converted to 0.8 and the backplane was 12 V. Now, as we know, there is a boom in AI, the, machine learning models, they are getting more and more power hungry. And when the GPUs they are getting more and more power dense, the system, the industry moved towards a 48 volt architecture. Now, as I stated, we have so many of these connected in parallel.
If one goes out of service, then we need to replace that GPU without affecting anything else. So we have these trays we take the bad GPU out and put the new one in while we are inserting the new tray into the rack. The first thing it sees is a capacitor. So that capacitor in there, when connected to the 48 volts can result in a huge inrush current.
Because the capacitor is completely discharged and this inverse current can go up the fuse it can create a surge in the existing GPUs and also take the whole system down. So with our hot swap controllers, we try to, limit that inrush current. And by a in a linear region. Can you give us an overview of how these protection ICs work?
So as I explained, we have to limit this inrush current because when we are connecting this 48 volt, into the capacitor directly, it can result in a huge inrush current. Now we can put a resistor in there to limit that inrush current, but it can, lead to lot of losses. So we will lose all the advantage of having a 48 volt system. Plus, it's not efficient.
What we can use, we can use a thermistor: negative temperature coefficient thermistor which gets once it gets hot reduces its impedance, but, it's not reliable as it changes with the temperature. So we use a hot swap controller. So hot swap controller. Basically it works with an external MOSFET So controller controls the gate voltage of the MOSFET and operates the MOSFET in a linear region.
So at the startup it provides the MOSFET act as a impedance in between the capacitor and the backplane, and slowly charging this output capacitor. And once this capacitor is charged, the FET goes in a fully on state, providing very low impedance. After this hot swapping event, the hotswap controller is also responsible for doing the telemetry like, voltage, current, power temperature, fault status, warnings So it provides all that through one megahertz PM bus communication protocol.
Plus it also helps to isolate the faults. So if anything goes wrong inside the rack it detects that fault and isolate it and prevent it, getting propagated into the rack from the tray At the same time, if it sees anything goes wrong in the rack or in the backplane like there is a surge over voltage or under voltage, it detects that and it prevents it going into the tray. So these are the three things that, hot swap controller does. Okay. Now let's make things real.
Give us an example of an industry or an application where we find these, protection ICs. As an example, let's say you are, writing a question on ChatGPT. So you write your query in a ChatGPT and, it goes to its server, where the several GPUs, they work in parallel and do this complex algorithm and provide you an answer. But while doing this, it's all electronic circuits. So one, it's possible that one can get damaged. So you have several of these processes working in parallel.
One can get damage. So the load is transferred to the healthy ones and the damaged one through a PM bus. It will tell the user... not the user, the maintenance guy the server that something is wrong with the CPU and it will. So the maintenance guy can go in, take the bad tray out from the rack and put a good one in without turning off the backplane. So the end user who's writing the query will not see anything, that has happened in the background.
Can you give us an idea of what some of the trends are in this world of protection ICs? Yes. For the future trends, if you would like to say, then, so you know, this a hot swap solution. So it's a, discrete solution. So you have a controller and you have a discrete MOSFET and then you have a current sensing shunt and temperature sensor. So if if you go back to a 12 volt architecture, that's how it started. And then it switched to a something called an efuse an integrated solution.
So we are seeing the same trend in the 48 volt backplane where the industry is looking for integrated solution, where they will integrate the controller, the MOSFET the current sense shunt and the temperature sensor into one package, so it will improve the power density. Plus this will be a stackable solution. So let's say you have one efuse of 20 or 30 amps 1 kilowatt of power. If you need 2 kilowatt of power, you can just stack two in parallel, three, three in parallel.
So that's what we are looking at the trend right now. Plus, we have noticed that, as you know, this AI server backplane is growing. So as this is growing, the energy demand is also growing. So the 48 volt backplane is not able to meet that demand at the moment. So industry is looking to move towards 800 port and 400 volt backplane. So that will require a whole new family of hot swap. And the compatible MOSFETs for this one.
And can you tell us some of the challenges that you and your team are facing? The challenges. So, when this 48 volt architecture was made and when we started working on this hot swaps for this architecture, it was one kilowatt of power, roughly. We saw it, we thought like one kilowatt or maybe 1.5. Now we are seeing, the customers are doing four kilowatt and six kilowatt of power. So that means you have more and more output capacitance. Okay. But controller stays the same. It's the MOSFET now.
So we what we are seeing is the mOSFET needs to be capable enough to handle this much capacitive current. We have some techniques that we can use a pulse current mode control and all that to have the thread not stay in the linear region for a long time. It can have a time in between to cool down. But, this is also beneficial. Like, now the industry wants a second source of everything because, you cannot rely on single source. Not.
But it's very difficult to find a MOSFET which is, equally capable as the first one. So when you do the design, you have to do for the worst case, for the worst MOSFET you have in your design. So that's becoming quite challenging for us right now. And if we now take a look towards the future, what kinds of things do we see? Can you make a prediction on where things are going, where things are headed? Next 5 to 10 years. So what I can say immediately now we will see efuses as I explained.
Then maybe in two years time frame, we will start seeing this 400 and 800 volt backplanes with their own family of hot swap and a discrete MOSFET I don't know if we will go with efuses on those or not. It's too much energy, but at the same time, this is all a gamble I would say, because it will require a lot of training for the individual who's doing the hot swapping of this 400 and 800 volt, although it used to be 12 volt or 48 volt, now it's 400. It's extra risk.
So it requires a whole new set of testing adding cost on it. Then probably it's the GaN technology's improved. We can probably see GaN coming into picture, maybe inside an efuse, or with an external MOSFET for doing the hot swapping. One customer guides us in one way, others in a different way. So this 400 800 volt architecture, we have received some guidance. And for the existing hot swap, there is some direction that they are directing us to and the efuse it's still new.
They have to test how it goes and then probably it will broaden in that direction. And maybe we go away from the discrete, we go into the efuse but it's all we need to see what happens in the next year or so. Well, thank you so much for taking some of your time today to join us. We really appreciate it. Yeah. Thank you. And for our listeners, thank you for being here today. If you have any comments or suggestions about future episodes, please send us an email at wepowerai@infineon.com.
Thank you and see you soon.
