How Fast Is Fast?

Matt Godbolt

00:20

Hi, Ben.

Ben Rady

00:21

Hi, Matt.

Matt Godbolt

00:22

How are you doing?

Ben Rady

00:23

I'm doing pretty well.

Matt Godbolt

00:25

We've just had a moment of exasperation, haven't we?

Ben Rady

00:27

Yeah, we're probably talking over the intro. Probably not at this point, but maybe earlier.

Matt Godbolt

00:32

Maybe earlier. I don't know. Yeah, that... it'll... I'll fix it in post, as I always try to do, or increasingly less. In fact, I try. I think I've kidded myself that the charm is all about like the fact that a dog's barking in the background and there are children screaming across the road and... we lip smack and other things. It's kind of ASMR actually, isn't it?

Ben Rady

00:59

Yeah.

Matt Godbolt

01:00

No. Yeah. ASMR. So recently there's been a lot of talk about the Dutch manufacturing company that makes the lithographic systems for chip manufacture. And they're called ASML. And I just can't help but get the two things mixed up.

Ben Rady

01:17

Ah, every time.

Matt Godbolt

01:17

They're very different things, but that's not what we came here to talk about today. Is it?

Ben Rady

01:23

It's not, it's not. I have an idea. I haven't... I have a thing.

Matt Godbolt

01:27

You have an idea. Okay. Tell me about your thing. Whoa.

Ben Rady

01:32

Let's say, careful now. Let's say for the sake of argument, I have a program. Stop it, phone.

Matt Godbolt

01:43

There we are.

Ben Rady

01:43

I have, exactly.

Matt Godbolt

01:43

Some more charm. and More charm has just entered the room. His phone, Ben's phone, unmuted phone.

Ben Rady

01:49

I have a program and I want it to run fast. How would I do that?

Matt Godbolt

01:56

Oh, wow. That is a great question. And we only have around about 30 minutes, 40 minutes to talk about it. So I've got a feeling.

Ben Rady

02:07

So immediately this is gonna be our first 10 part series.

Matt Godbolt

02:10

Maybe so. I don't know. I don't know. This is one of those questions because I think obviously deliberately you have asked me a leading and vague question. Because what do you mean by, in fact, each of those words, right?

Ben Rady

02:25

Can you please define each of the words that you just used in that sentence?

Matt Godbolt

02:29

Yeah. Right.

Ben Rady

02:29

Yes, exactly.

Matt Godbolt

02:29

My program, make my program fast.

Ben Rady

02:32

Right.

Matt Godbolt

02:33

All right. So I think make, we can, yeah, cause it to become my program. You have something that you wish to go fast. Right. So the only word we really need to define then is fast. And I'm pretty sure you don't mean stopping eating for a period of time.

Ben Rady

02:48

Right. Well, I think you could also define my program because in this hypothetical thing is I haven't written anything yet.

Matt Godbolt

02:54

Oh, oh

Ben Rady

02:54

Right. It's like, I want to write a program that doesn't exist, but I want the execution of that program to be fast for some definition of fast that we will define shortly.

Matt Godbolt

03:07

okay. Right. And so, yes, not no eating, not also unmoving, which is the weird other case of you know stuck fast, which is like one of those really weird auto-antonymic words, you know, like inflammable or something like that.

Ben Rady

03:21

Right. Right. Uh-huh. Yes. Stuck fast. It's the opposite of this. Yeah. Uh-huh. Right.

Matt Godbolt

03:22

You mean quick, to go quickly. All right. So let's talk about what that could mean. Right.

Ben Rady

03:26

Yes.

Matt Godbolt

03:26

So the instant you say that and because of what we've been doing professionally for a long time and for what I've been doing most of my career, I think of low latency as fast. Like if you if you decide you need to do a thing, if your program needs to do a thing and it discovers that that need, it should act upon it as quickly as possible having discovered that need.

Ben Rady

03:50

Mm-hmm.

Matt Godbolt

03:50

That is what I mean by sort of low latency. It reacts to a real-time thing as quickly as possible. And I think a lot of embedded systems work in this world. A lot of UI, in fact, is this, right? If you said, hey, I want my, you know responsiveness in a UI is about latency, right?

Ben Rady

04:05

Right.

Matt Godbolt

04:06

But that's one form of fast.

Ben Rady

04:08

Mm-hmm.

Matt Godbolt

04:09

But there's a very real, very important other kind of fast, which is I am... I'm writing a distributed message passing system. You know I'm... I've Mastodon, right? I've... and I'm dealing with an enormous volume of information. And if I was slow at doing that, eventually I would back up and there would be I would never keep up with with the current... Well, it doesn't matter. Individual messages may take a few seconds to go through if that's not fast.

Ben Rady

04:27

Yeah.

Matt Godbolt

04:40

I mean, maybe that is fast for that scalar network, right? But you're talking about a throughput problem, you know, or bank clearing is another one.

Ben Rady

04:46

Right.

Matt Godbolt

04:46

They've kind of got a bit of latency.

Ben Rady

04:47

Mm-hmm.

Matt Godbolt

04:47

You beep your card when you're at the ah the coffee shop and it goes beep and sort of you've got it waiting six or seven seconds and that's a completely fine human amount of time. You know, it's an eternity for computers. But on the other hand, there are... several hundred million people in that same five second period tapping their card on their reader in the same time.

Ben Rady

05:07

Right, right.

Matt Godbolt

05:07

And so there's a lot of things to go. So those things, so we've got latency and throughput. And are there any other aspects?

Ben Rady

05:17

I think for the definition of this conversation, I think latency and throughput are two dimensions of fast which we should consider.

Matt Godbolt

05:25

Okay. Okay.

Ben Rady

05:27

I think you could maybe come up with some other ones, but again, we have 30 minutes.

Matt Godbolt

05:32

Well, I mean, again, we can make this into more. but So so do you want which one do you want to start with?

Ben Rady

05:37

That's true. I think we should start with latency.

Matt Godbolt

05:39

Okay. Okay. So then we get into this sort of amazing set of kind of questions that we get into when we're dealing with computers because how fast is fast now? Right, how quick is – like we just said, if I'm a human waiting at a sales kiosk, If it takes me 30 seconds to pay for something, that is not fast. If it takes me three seconds, it's probably okay.

Ben Rady

06:08

Mm-hmm.

Matt Godbolt

06:09

If it takes 0.3 seconds, I won't notice that that anything happened, really.

Ben Rady

06:13

Yeah.

Matt Godbolt

06:13

There's a beep and I'm out of there. And any other order of magnitude less than that, like 0.03 seconds or whatever, is immaterial to me at that point, and and it doesn't matter.

Ben Rady

06:22

Right. Mm-hmm.

Matt Godbolt

06:22

But you and I work in a world where at least some of the things that we have done or are adjacent to, we are talking about maybe seven or eight orders of magnitude faster than those three seconds.

Ben Rady

06:35

Mm-hmm.

Matt Godbolt

06:35

We're talking... Hundreds of nanoseconds, hundreds of microseconds.

Ben Rady

06:39

Yeah.

Matt Godbolt

06:39

And I have to remind myself that a nanosecond is a billionth of a second, right?

Ben Rady

06:44

Yeah. Yeah.

Matt Godbolt

06:44

You know, sometimes like we throw these words around. I mean, maybe this is just, again, you and I and the weird world that we live in. But like nanoseconds are kind of a completely normal measurement for a, you know, three gigahertz computer.

Ben Rady

06:56

Right.

Matt Godbolt

06:56

A few clock cycles is a few nanoseconds. That's completely normal. Yeah, that's reasonable. And, you know, like to, you know, good old Grace Hopper's, you know, light foot measurement, that a nanosecond is. Here's a foot of copper wire. That's a nanosecond. You're like, oh, shucks, man. You know, that's the speed of light. It's the fastest thing that there is can only go a couple of feet in the time it takes me to add two 64-bit numbers together.

Ben Rady

07:07

Yeah. Yeah. Uh-huh.

Matt Godbolt

07:20

That is insane, frankly, that we're at that stage.

Ben Rady

07:24

It is insane. You know, whenever I think about that, I just think about Grace's wire just twisted around on the inside of the CPU and all the places where it would need to go in order for the light to move in that way to add those two numbers. And it just makes my brain melt.

Matt Godbolt

07:37

It's, yeah, it is quite something. And the fact that, yeah, we've taken these Factorio factories and we've crushed them down into the size of my you know thumbnail and somehow snaking around in there are many, many, many, many hundreds of feet of cable.

Ben Rady

07:56

Yeah. Yeah.

Matt Godbolt

07:57

As you say, each individual thing. Electron is – well, they don't go all the way down. It's complicated, right? Let's stop pretending that we can talk about that as well.

Ben Rady

08:07

Yeah.

Matt Godbolt

08:07

So, yeah, we've got many orders of magnitude here. So, you know, we could break it down into things that need to be measured in nanoseconds, in microseconds, in milliseconds, and seconds. And, you know, that's, I think, probably – after that, I think, you know, the latency of something – maybe, I mean, maybe you still can measure it in minutes, right? I mean, let's think about like you know publishing a video on YouTube.

08:28

I've just updated my video on YouTube and I click the button to publish it and it takes how many minutes to transcode it to all the different formats that YouTube does before it becomes visible. Maybe that is measured in minutes and maybe I do care about that latency.

Ben Rady

08:40

Mm-hmm.

Matt Godbolt

08:41

But I think the thing that I can confidently talk about, and maybe you as well, is the sub-second orders of magnitude latency.

Ben Rady

08:52

Yeah, yeah, yeah.

Matt Godbolt

08:52

And so let's talk about what can you do in that amount of time?

Ben Rady

08:55

Right, right.

Matt Godbolt

08:57

What, you know, that's a good starting point of like, only, you know, if you say I would like my program to be fast without even knowing what your program is, I've got some bounds on what you could reasonably expect to do in hundreds of nanoseconds, hundreds of microseconds, hundreds of milliseconds, right?

Ben Rady

09:14

Mm-hmm. Right, right, right, right, yeah.

Matt Godbolt

09:15

On commodity hardware, if we're just talking about things like a regular PC or a regular, you know, so a decent server in a data center somewhere or a laptop or something like that, they're all within spitting distance of each other.

Ben Rady

09:26

Right. Yeah. So if we think of latency as sort of like the time it between an input is provided and an output is produced,

Matt Godbolt

09:33

Yeah, sounds good.

Ben Rady

09:33

then automatically down in the like tens of nanoseconds range, you're sort of like, what output could you possibly consume or what input could you possibly consume and what output could you possibly produce in that span of time, right? Because it's got to go into the computer and then come out. So like maybe a theoretical example here would be like a robotics application where you have some like active control device of a a system or or you know a set of servos or something like that where you're trying to like, you know, imagine like the, you know, like the robotic dogs type thing where you're just like, you know, taking like some information from like the sensor on ah on ah on a robot and then feeding it back and using that to adjust. Like that's already going to be up probably in the like low milliseconds range, probably.

10:24

I mean, I don't know. I'm not an expert in robotics, but

Matt Godbolt

10:26

I'm not sure either. No, but I mean, you think about things that do like autobalance, or you know, drones, like a still that can't be milliseconds because it's already upside down by the time a millisecond of wind has caught it.

10:37

Right. So they must be a little faster than that. But, you know, maybe we're talking microseconds at that level. It seems not totally unreasonable to have some sensor update and put something on a bus internal to a system on a chip kind of thing that can then go, oh, wait a second, I need to speed up this rotor a little bit more to account for that.

Ben Rady

10:37

Yeah, yeah. Yeah, yeah, yeah, yeah, yeah. Mm-hmm, mm-hmm.

Matt Godbolt

10:55

Right. But you you're absolutely right.

Ben Rady

10:55

Yeah. Cause a lot of those times those outputs can be sort of like targets, right? So they don't, it's not necessarily that it needs to go like all the way out to like a servo or something. It's just kind of like, okay, I've gotten this input and now I need to adjust, you know, my targeted angle here. Right. And there's maybe even some other external system that is like actually making the adjustments to the physical objects, but like you might want that target to update on the order of,

11:21

you know, tens of mics or hundreds of mics.

Matt Godbolt

11:24

Right.

Ben Rady

11:24

Just because when you do, you get sort of like a smoother, better prediction of like where the arm needs to be or something like that.

Matt Godbolt

11:31

Right, right, right, right. And yeah, so I mean, let's to sort of break it down to consumer hardware.

Ben Rady

11:35

Yeah.

Matt Godbolt

11:35

It is difficult to get a message in and out of a computer in, you know, I mean, like you just do a ping, right? You ping a computer on your local network, and it's less than a millisecond, but it's, you know, still hundreds and hundreds of microseconds probably to go through commodity hardware.

Ben Rady

11:52

Yeah.

Matt Godbolt

11:53

And, you know, for the kind of weird stuff that you and I have been involved with, there is some very specialist hardware that happens, which we can't really talk about but on the far end into the nanosecond range of things. So I suppose if we're talking about

12:03

Things that most folks could reasonably get, put their fingers on a keyboard and program, be it a little embedded system like an Arduino or an ARM-y type thing, or indeed a, you know, a program like I'm typing on a computer and I'm expecting, when having pressed a key press to see the key appear in the terminal, that kind of thing.

Ben Rady

12:03

Mm-hmm. Mm-hmm.

Matt Godbolt

12:22

we we've We've kind of picked now our range. Now we're talking probably hundreds of mics to low millis.

Ben Rady

12:31

Mm-hmm. Mm-hmm. So I've got a program.

Matt Godbolt

12:33

So, Yeah, so we're down.

Ben Rady

12:34

It's going to...

Matt Godbolt

12:34

Yeah, that's right. I forgot where we were now. yeah

Ben Rady

12:36

It's gonna read some input delivered from outside the computer, and this is a commodity computer, you know, like, and then it's gonna produce some output that is also going to leave the computer, right?

Matt Godbolt

12:51

Yep. Yeah.

Ben Rady

12:52

And we're thinking that this is gonna have to occur, it's gonna take at least hundreds of mics for that to occur. Is that a fair statement?

Matt Godbolt

12:59

I think so. Without without getting...

Ben Rady

12:59

So that's like the upper bound of how fast we can make it for my definition of fast.

Matt Godbolt

13:03

I think so. I mean, you know, there are exotic techniques and things that you can even use on commodity hardware that is a little bit more exotic, like DPDK, which is a sort of network bypass thing that you can fiddle with.

13:16

And if you put a second network card in your computer, you can go faster. But yeah, I think let's say, yeah, already, whatever you're doing, if you're responding to an external network event, you're doing any kind of reaction to it and sending it back out again, we're in the...

13:30

Microsecond region but probably hundreds of microsecond region would be my instinct and maybe I'm wrong, maybe I've forgotten how, you know, computers are these days. But, or at least like normal computers, right. In normal computers, I mean actually I'm... I was telling you before we started recording I'm surrounded with networking gear and computers and things and AI agents running things for me as I'm setting up. I'm actually upgrading my home network for up to like two and a half gig because that's actually something that is not totally unaffordable for the home these days now and so I'm like, hey, maybe my ideas about what commodity hardware is is out of date but...

Ben Rady

13:30

Yeah. Yeah. Yeah. No one knows how computers actually work. Yeah. Yeah. So my first question would be when I'm writing my program, how do I make sure, because my assumption here is that the input here is unpredictable. It's not like you're gonna start the program and immediately be provided the input. The program is running,

Matt Godbolt

14:24

Mm-hmm. Yeah.

Ben Rady

14:25

There's some thing outside of the computer that creates this input, and then the program has to react and create an output, right? So my first question to you is how do I make sure that there's the minimum possible latency between the generation of that signal and my the code in my program running? Something, something ePoll, right?

Matt Godbolt

14:42

Yeah. Well, that's... No, exactly right. No, well, yes and no. I mean, that goes into the DPDK kind of sort of thing I alluded to. So yeah, you know a traditional server program that is maybe blocked on a socket. So you've got a socket, you've got, you know let's say UDP just to make things maybe easier. I don't know, right? You've got a blocking call to receive, which means that your process has gone to sleep

Ben Rady

15:11

Mm-hmm.

Matt Godbolt

15:12

Told the operating system, hey, anything interesting coming in on this port, this UDP port, I'd like to know, please. And then I've gone to sleep and I've been descheduled. And now I've just been chucked in the list of all the things that are currently, all the processes that are currently asleep and have nothing to do.

15:30

So that's where we are now. Packet comes in, goes through the standard kernel process of saying, well, who's this for? Oh, it's for port 9007. That is registered to this process. So we're going to write it into a buffer now rather than discard it, or rather we keep it probably around in a buffer that exists. First of all, is there space in that buffer? Yes, there is. Okay, this is something you can configure for each socket, how much space that you can have, which gets complicated if you're sharing multiple processes listening on the same.

16:01

It's, yeah. Anyway, and then we go, well, who cares about this, right? We've put it into the buffer now. We've copied the data. Who cares? Oh, there's a process. Okay, well, this process is waiting on this, so we should mark it as ready. This process is now ready to be scheduled. And then... You're done. And then as part of the whole, you're done process, the kernel goes back to, well, I guess I'm now finished, but back to the user process. Now who, who should be run?

16:30

And maybe there was something else that was going on on that CPU when the packet came in and that other process is actually still higher priority and whatever queuing fairness there is, in which case it will merrily do whatever it was doing before until it runs out of the time slice that it had allocated it and then it goes, oh, someone else's go now, oh look this thing's ready. So but what I'm sort of getting at here is there is a lot of things that happen between that packet arriving in your, even in your kernel, and your process being woken up to do anything about it, and that's a hugely variable amount of work too.

Ben Rady

17:01

Mm-hmm. Mm-hmm.

Matt Godbolt

17:02

And you can definitely reduce the amount of work. So the amount of other things that the CPU or the kernel is trying to do on a CPU so that your process is more likely to be woken up. There are schedulers that let you set these things or priorities and stuff, but, The way that I intuitively do these type of things is using some kind of, yeah, like we said, DPDK, which is this sort of networking layer that lets you use shared memory with the network card.

17:37

And then instead of the kernel being involved at all, you just keep continuously reading a place of memory that the network card will write directly to and say, hey, I've got a packet for you. And then you just pick it up. And then essentially your latency goes down to how fast a PCIe bus transaction can notify that a piece of memory has changed and you pick it up and off you go.

Ben Rady

17:57

Mm-hmm.

Matt Godbolt

17:58

Right. But that is still fairly esoteric. So it seems like that's not a fair thing. But I guess it's the difference between... right, if the program part that we've not even started talking about now is itself going to take tens of milliseconds because of the innate amount of work that it has to do.

18:19

And so the upper bound is dominated by its processing time. Then maybe the effort and all the work to put in shaving tens of microseconds off on the like the interrupt handling time isn't worth it, right?

Ben Rady

18:32

Mm-hmm. Mm-hmm.

Matt Godbolt

18:32

It's a lot of engineering effort. It's a lot of complexity. It's a lot of non-standard nonsense. But if we wanted to make something go as fast as possible, those are the kind of things that I would take to doing is to take the kernel out of the equation and burn a CPU core just spinning, doing nothing other than saying, is there anything to do? How about now? How about now? How about now? In a tight loop and potentially reschedule everything else so they can't run on the same CPU as me. There are ways that you can do this with CPU sets with isolation.

19:06

Trickery, isolcpus in a kernel command. There's a lot of weird things you can do. Anyway, right. And then you can get it down to, in the limit, single digit microseconds between a packet arriving and you going, okay, I have work to do.

Ben Rady

19:22

Okay. So if we assume that the work that needs to be done is relatively simple, right? You're going to do some adds, you're going to do some multiplies, you're going to try to structure your code so that you're really only doing adds and multiplies, and then you're going to produce some output that is purely a function of that input and whatever state you may have held in memory.

Matt Godbolt

19:46

All right. So what you're talking about, what we've got is an ALU as a service. You know, we are, have you seen this? There's a guy who's written a microservices based CPU emulator where every single instruction is handled by a different microservice, each of which is written in a different language and the thing runs and it's so... but yeah, isn't it cool?

Ben Rady

20:05

Oh my God, that's amazing. Man,

Matt Godbolt

20:07

I love that people think of these things.

Ben Rady

20:10

you think you were in microservice hell. You should.

Matt Godbolt

20:13

Yeah, you've got like literally 256 different things and a RAM service, you know, memory service so that everyone can agree on this.

Ben Rady

20:20

Oh, that's amazing.

Matt Godbolt

20:21

It's cool. but So you're talking about something like that. Let's just say you know it is...

Ben Rady

20:24

Yeah.

Matt Godbolt

20:24

In fact, honestly, probably a decent... Unless you want to... I'm going to say these and you can tell me if where you're trying to steer it is right or not. But like something like a Redis memory cache is like one of these things.

Ben Rady

20:36

Yeah. Right.

Matt Godbolt

20:36

It's like something comes in, I look it up in RAM and I'm like, yes, I've got it.

Ben Rady

20:39

Sure.

Matt Godbolt

20:39

Here it is. It's like a small piece of data. You know, a key value store that is just, you know, like a cache, right?

Ben Rady

20:46

Yeah.

Matt Godbolt

20:47

You're not doing very much work.

Ben Rady

20:48

Yeah.

Matt Godbolt

20:49

It's not as as a banal as like I'm adding the three numbers you've given to me and giving them back to you.

Ben Rady

20:54

Right.

Matt Godbolt

20:54

It is, that you can imagine there is actual value in that, but it's still, you know.

Ben Rady

20:58

Yeah. I'm trying to draw a line about the the other parts of the computer that you're going to be talking to. And what I'm trying to say by this is that you can't ignore memory, right?

Matt Godbolt

21:08

I see.

Ben Rady

21:08

It's not simply a function of the data that you've read in, but you don't have to go beyond that, right?

Matt Godbolt

21:13

Yep. So, okay. So let's so let's use like ah an in-memory cache, like a yeah Redis is like without the disk backing type stuff.

Ben Rady

21:19

Yeah. Right, right. Mm-hmm.

Matt Godbolt

21:20

Yeah. Yeah, yeah. That's perfect. Okay. So yeah, you're looking, I mean, at that point, there's a, without going into the extremely fun and interesting rabbit hole of how you actually do the lookup, and what data structures you use. But that is probably the key thing is that like, you must consider the data structures you use. But the interesting thing is that with a latency lens,

21:45

You're not looking at data structures in quite the same way as you do when you are doing like CS, whatever. I don't know what numbers that you use, but you know when you're doing your data structures courses and you're like looking at big O notation, like, you know, big O of N, whatever, N log N for your, you know, so like a not unreasonable data structure for you

Ben Rady

22:09

Right. Yeah. Right, right.

Matt Godbolt

22:11

a key value store is something like a balanced red-black tree, right? It's log N to look into it, whatever.

Ben Rady

22:20

Okay.

Matt Godbolt

22:20

Another one is a B tree or a B plus tree or one of those kinds of trees, you know, hash maps are other ones as well. And, you know, obviously hash maps do have in theory O of one. So you'd probably want to pick that out of big O notation or whatever.

Ben Rady

22:33

Right.

Matt Godbolt

22:34

But the other ones seem reasonable as well. Like it's like, but, the dominating factor with something as straightforward as the thing you've just described, where we're literally just looking something up in the key value store, is going to be reading from RAM.

22:46

Now, if you're talking about something that's so small it fits into L1, L2, or L3, which are in the of the order of tens of Ks in the L1 case up to several megabytes at L3, and you know bigger than that on bigger service servers and whatnot, but like we're still talking like not all that much information.

Ben Rady

23:07

Right. can Can you describe for our listeners, he said, covering for his own ignorance, what L1, L2, and L3 actually are?

Matt Godbolt

23:14

Oh, I'm so sorry. So there are layers of caches. Caches are expensive both in terms of the amount of space they take up on a chip and how fast, you know going back to our Grace Hopper, you know the bigger something is, the longer it takes to get to the other side of it. So the way that CPUs

Ben Rady

23:39

Mm-hmm.

Matt Godbolt

23:39

prevent us from having to read from the very slow RAM, which we'll talk about in a second, is that we have on-chip caches, and they are layered so that you have a very small, incredibly fast level one cache, which is usually... not very large, tens of kilobytes kind of area. Then you have a layer two cache, which is maybe a meg, maybe hundreds of K, that kind of area. And then a layer three cache, which is maybe tens of megs, maybe even a little more than that.

24:09

And each of them is slightly further away, like literally physically. They're still on the actual thumb size chip inside the machine. They're further away, but it takes longer to look through them to find the data. And so if you can fit something in L1, it's kind of almost free.

Ben Rady

24:26

Mm-hmm.

Matt Godbolt

24:26

I'm going to put air quotes over this. We're talking somewhere in the region of, you know, five or six cycles to read from L1. And that includes all of the other things that are like the fixed cost of reading from memory includes doing the lookup for the logical address that you've decided to read from and then turning it into the actual physical address that it needs to go to the correct chip, you know, different processes think that address 2000 has different data in it, right?

Ben Rady

24:54

Yeah, yeah, yeah.

Matt Godbolt

24:54

And something has to do that translation. And that's not at this scale. When we're talking about, you know, a three gigahertz machine, a third of a nanosecond is a clock cycle. It's that that look up itself takes time.

Ben Rady

25:06

Right, right.

Matt Godbolt

25:06

And so, you know, we need to be doing something. So that's about the limit of it. And there's some other really complicated things that the poor memory system has to do because of the weird out-of-order engine. But we'll ignore that for now. So five or six cycles for like an L1.

25:19

Now I'm going to forget this off the top of my... should have brought this up beforehand, you know, but like tens of cycles, low tens of cycles for L2, maybe late tens, you know, 80, 90 cycles, maybe L3. No, I don't think it's that much now.

25:30

Someone's, so you know, order of magnitude-y things. It's, you know, still, probably only tens of nanoseconds but then you go out to RAM and it can be hundreds of nanoseconds if you actually need to go out to RAM. So those are the layers. The other thing that's interesting is that usually on chips the L1 and maybe the L2 are unique to the CPU core that you're running on so you just get a copy of your own and then that's yours and no one can do anything with it with the

Ben Rady

25:57

Right. Okay.

Matt Godbolt

25:57

massive asterisk and footnote for older CPUs, but we maybe won't go there. And then the L3 typically is shared amongst the the other cores that are physically on the same chip as you. And so there are some contentions and there are obviously conversations that have to happen between the chips to say, hey, I'm reading that. Oh, I was just writing to it. Oh, okay. Well, we need to make sure that we do this in the right order.

Ben Rady

26:17

Right. Mm-hmm.

Matt Godbolt

26:17

So there's a little bit more complexity there. But yeah, so if we're still talking in single, you know measurable, human, countable numbers of cycles, keeping it inside the L1 or L2 is great. L3 is okay. And if we have to go to RAM, we have to be a bit careful about things. And at that point, two things are important to note about the way caches work.

26:38

One is that the sort of the whole... the whole theory behind a cache is that having read a byte of memory, it's really, really likely you're going to read the bytes that are around that byte of memory.

Ben Rady

26:52

Mm-hmm.

Matt Godbolt

26:53

And so the fundamental unit of work from the memory system on the CPU to the caches and beyond is a sequence of bytes, which is usually 64 or maybe 128 bytes long. And that's kind of like the atom that gets shipped back and forth. And some chips even do things where you say, well, you ask for this block of 64 bytes, I'll just get the next one and maybe the one after that while we're going, because we might as well having just done that. And so this sort of temporal usage pattern is incredibly important for latency because if you're following a linked list, as you may be if you just have a naive hash map, suddenly you're jumping randomly around in memory and the poor system can't make a guess.

Ben Rady

27:32

Right.

Matt Godbolt

27:33

But if you are if you just have a damn array of everything and just go, well, it's expensive, it's an order N operation for me to search through these things, but I'm going to run through them linearly, suddenly it may be more worthwhile from a cache standpoint, especially if a memory, a cache misses hundreds of cycles. Well, I can do a lot of work in a hundred cycles on an out-of-order execution machine that has 12 execution units, right?

27:58

So why don't just do tons of work and hope that that's quicker than going to RAM? So anyway, popping the stack all the way back, to keep the latency as low as possible, you're going to have to think about this.

Ben Rady

28:09

Yeah. Mm-hmm.

Matt Godbolt

28:09

And so you might want to make some concessions where the distribution of your latencies needs to be taken into account. Now, if you're going to say like, well, most of the time, I think I'm going to be... this like 80% of the stuff we're reading is hot. Like it's going to be commonly used again, in which case I'm going to use a data structure, which is fine. And that 80% fits inside. It doesn't matter what I'm doing, whether it's a balanced tree or whatever, it fits inside L3, let's say.

Ben Rady

28:43

Right.

Matt Godbolt

28:44

But the 20% that's outside of it is just outside. It's going to RAM. We're just going to have to take the hit every time. And the fact that it's incredibly expensive because we're using a balanced tree. So I guess now we turn around and say, like, it's not just how fast to go. It's like, what is the shape of your distribution? Is it better to have the average case fast and the bad case awful?

Ben Rady

29:04

Right.

Matt Godbolt

29:05

Or is it better to bound the bad case and say, well, actually, I'm going to use a linear search every single time. And I'm going to say that even the fast one where the linear search, you know, it would have been faster just to check, I don't know, the first one or pack them in a different way. But, you know, I'm paying the cost to do 16 compares,

Ben Rady

29:25

Yeah, yeah, yeah.

Matt Godbolt

29:26

Every time and just go, well, that costs me a little bit more than one.

Ben Rady

29:29

Right.

Matt Godbolt

29:29

Of course it does, but it makes the worst case go away, but at the lowest.

Ben Rady

29:36

Right.

Matt Godbolt

29:36

So yeah, sorry. I've just gibbered around in a big, as I've sort of like realized the enormity of the question you've asked. Yeah. Yeah.

Ben Rady

29:42

that was That was the intention of the question. But yes, whenever you're talking about latency, you need to not, it's not like, oh, it's 10 microseconds of latency.

Matt Godbolt

29:51

Yeah.

Ben Rady

29:51

It's no, it's 10 microseconds of latency in what case? What percentage of the time? You know, oftentimes you talk about like P95, P99, P99.999 numbers. Of like what percentage of the responses are going to be less than some particular target time. And you may have multiple layers of that, right?

Matt Godbolt

30:01

Yeah.

Ben Rady

30:11

Like you may have a maximum threshold that you need and like a P99.99 and then like a P95 and then like a median, right? And you might want to like structure things you know differently based on whatever those targets might be. So in our case, sort of this hypothetical, you know, robot dog type plugged into a PC perhaps that's receiving the signal. Maybe what we'd want to say here as one fork of the tree to go down is actually the thing that I care about is the worst case latency, right?

30:49

I want the... because if the latency gets too high, the dog is going to fall over and that will be bad, right?

Matt Godbolt

30:54

Right. That's a very common thing for like, yeah, control boards and things.

Ben Rady

30:59

Mm-hmm.

Matt Godbolt

30:59

And you know this is why in those situations, typically embedded systems who have that kind of very strong reliability requirements will use not regular operating systems. They're using you know real-time operating systems that can give you hard deadlines for things happening within a certain amount of time and or do no preemption at all.

Ben Rady

31:19

Mm-hmm.

Matt Godbolt

31:19

So it's not like a case that your your process will have its CPU time taken away from It's like, you know, it's your responsibility to to keep running until you get to the end and then yield to the next guy so that you can... But and this is a huge problem. As computers have gotten more complicated, it becomes harder and harder to reason about what an upper bound could be. You know, picking up my 6502 in its box over here.

Ben Rady

31:45

Right. It's a little easier.

Matt Godbolt

31:45

You know, it's a little easier. You know, it's it's like I can tell you exactly how long any single thing is going to take all the time.

Ben Rady

31:52

Yeah.

Matt Godbolt

31:52

It's completely deterministic and it's well documented.

Ben Rady

31:56

Right.

Matt Godbolt

31:56

But we have to reverse engineer what the hell an Intel chip is doing in order to understand... Sort of maybe what the worst case is in that particular situation, right? You know, hey, what happens if, you know, we are trying, we have an L3 cache miss at the same time as another CPU core, like a physically distant core has an L3 cache, both go for it at the same time.

Ben Rady

32:02

Essentially. Mm-hmm.

Matt Godbolt

32:20

How does that work? Is there some really weird edge case where, you know, in the fabric, it times out after some amount of time, I don't know, and it's hard to find out, but if I need to give you an actual, like, someone will get run over if the brakes do not come on when I put my foot on the brake, then maybe I can't, I certainly can't think about the world the same way, but I certainly can't use certain, like, CPUs and designs and operating systems like that.

Ben Rady

32:45

Mm-hmm. Mm-hmm.

Matt Godbolt

32:45

So that's, but, it's interesting.

Ben Rady

32:46

Yeah. So if I come to you and I'm asking this question of how do I make this program fast? And you go through all of these follow-up questions and what it turns out that I'm asking for here is how do I make sure that the round trip latency of this program that I have running on a just regular commodity PC is not going to ever be higher than 10 microseconds? The answer is you can't.

Matt Godbolt

33:14

I think that's a fair... Yeah, that the nearest people I can think of that have to deal with that specific problem are people who make digital audio workstations for you know consumer PCs because they they have... A genuine latency bound because if you're playing a keyboard and there's a one second delay, that's not, between you're pressing the button and hearing the noise change.

Ben Rady

33:40

Yeah, right. Yeah. Yeah. Yeah.

Matt Godbolt

33:40

That is not good for anybody.

Ben Rady

33:42

Yeah.

Matt Godbolt

33:43

And so the only way that you can get that is to make everything as low latency as possible. But if you go too far and you don't have enough buffering between you and the set of audio that's coming out, then if you, for whatever reason, can't keep up, you're going to get a massive pop or a crack or a click as suddenly you have to catch up and just drop a whole bunch of samples that you were generating.

Ben Rady

34:06

Yeah, yeah, yeah.

Matt Godbolt

34:06

So that's, I mean, I don't know how they do it, honestly. In fact, I do know people who have worked on them and like we should perhaps consider getting them on to talk about this kind of stuff because it's a fascinating problem. But yeah, anyway, but back to your...

Ben Rady

34:23

Yeah. So the result of this, you can't, so now it's like, okay, yeah, right.

Matt Godbolt

34:29

Yeah, you can't, yeah. Or it's really, really, really hard.

Ben Rady

34:31

It's like, you're putting constraints on this problem that you can't really solve for. For example, you'd probably be better off using some specialized hardware rather than trying to do this on a commodity machine. If you truly have a constraint that, you know, of, of let's just say 10 microseconds or something like that.

Matt Godbolt

34:47

This is to an extent, I mean, obviously, microcontrollers have been around longer than computers because they've been in things or that's the original design.

Ben Rady

34:54

Mm-hmm.

Matt Godbolt

34:54

But that's that is why the brake disc controller in my car is not controlled by the one PC that is like running the the AV display in the middle panel, right?

Ben Rady

35:05

Right. Yes.

Matt Godbolt

35:05

It is like a very specialist piece of software running on a very specialist operating system, which probably isn't an operating system by anyone else's definition of operating system.

Ben Rady

35:15

Right. Right. Right.

Matt Godbolt

35:15

It's just a library you link against and then sets up the interrupt vector and hands you a few nice abstractions, good luck.

Ben Rady

35:22

Yeah. Yeah.

Matt Godbolt

35:23

But that's why it's doing nothing but monitoring the disc because then finally you can potentially say, all right, we can upper bound this stuff. We do know what the maximum latency could be, and it is, and I can sit and work it out on paper because it's an ARM, this is an in-order ARM and it only has an L1 cache and even if it's whatever, you know, there are things that you can say about it and you can bound it that way.

Ben Rady

35:47

Mm-hmm. Mm-hmm. Okay. So I've got, we've got our inputs. Maybe we've had some discussions about our constraints here and we're like, all right, well, we can't do this to a maximum level of 10 microseconds.

Matt Godbolt

36:00

Mm-hmm.

Ben Rady

36:01

That's insane. We could maybe do something where our like P95's, maybe less than that, more than that.

Matt Godbolt

36:07

Let's just say, I mean, yeah, you and I, we don't know.

Ben Rady

36:11

Let's just something like that. Yeah. And then,

Matt Godbolt

36:13

We're making this up as we go along as if it wasn't...

Ben Rady

36:15

Yeah, and then and then our maximum is something much more reasonable like a second, right?

Matt Godbolt

36:20

Okay, yeah.

Ben Rady

36:20

It's like, all right, it's never going to take more than a second. If it does, then we have a bug and we should fix it, right? As opposed to, yeah, that's how the system works, right? So now we've gotten our message from the outside world. We've gone and looked up the value that we've cached. Hopefully that's an L1 cache and hopefully we've done some smart things to make sure that it's there. But you know again, this is why we have these ranges of latency.

36:49

How do we write out the response? Do we have the same sort of operating system packet issue when we're trying to write this thing back out?

Matt Godbolt

36:58

I mean, yes. If we're talking about standard, like, I mean, we keep changing what it is we're doing. Is it a dog that's falling over? Is it a UDP packet going into a...

Ben Rady

37:05

Yeah, yeah, yeah.

Matt Godbolt

37:05

Yeah. But, you know, like, if we're going to send a packet back out again, then yes, there's a similar dance on the way out through a traditional operating system where you write to a bit of memory that you can read and write in user space, and then you say, hey, send it, please.

Ben Rady

37:19

Mm-hmm.

Matt Godbolt

37:19

And and the kernel goes, well, I need to send it. I can't hand this bit of memory to... the network card because this bit of memory is just wherever you decided to put it. So maybe on the stack, whatever. And certainly the DMA engine for your network card does not have carte blanche to read any piece of memory anywhere in the whole system for all the obvious reasons of like separation of concerns and to do with some of the ways that they work internally.

Ben Rady

37:46

Mm-hmm.

Matt Godbolt

37:47

So the kernel has to copy it somewhere else in a buffer where it knows it could hand the address of that to a network card. And then it returns, potentially returns back to you and says, yes, it's sent, even though it isn't. It's sat in a buffer. Time will pass. Eventually, the kernel will decide to schedule sending it. I mean, probably it will actually send it immediately, all things considered, unless the network card's already busy doing something else. You know, there's these...

38:10

And then the network card will be notified through some mechanism. And that's relatively expensive and requires a ping across the network. Sorry, I say a network. It is a network. The PCIe bus itself, or however your network card is plugged into your computer, is effectively yet another network.

Ben Rady

38:26

Mm-hmm.

Matt Godbolt

38:26

And there's another hop and there are... complicated things going on there. So the message is addressed to it. The network card goes, oh, you say you have a packet for me. How quaint, how lovely. Okay, I'll go and get the address from the ring buffer. Here it is. Okay, and I'll start streaming it out. And so that process can be

Ben Rady

38:41

Mm-hmm.

Matt Godbolt

38:41

relatively latent too. And so again, you're less likely to suffer from the scheduling randomness that the kernel does, because it's not like something asynchronous happened and now the kernel has to pick you being amongst all the things that it's next to do.

Ben Rady

39:00

Mm-hmm.

Matt Godbolt

39:00

It's like, now you've given it a piece of work and it's very obvious what it should be doing with that work.

Ben Rady

39:04

Uh-huh.

Matt Godbolt

39:04

But there still is some redundant copying of data. There is still some messaging to an asynchronous process, which is the network card itself.

Ben Rady

39:14

Again?

Matt Godbolt

39:14

The network card can be like, yes, sure, fine. And then it could be busy doing something for all we know, right? You know, I don't know what what magic network cards are doing. Maybe it's, you know, a virtual network card that doesn't even exist as a physical thing.

39:25

And really it's pretending to be a network card, but in fact it is, in a data center on a shared piece of infrastructure. And what you're really doing is talking to the hypervisor, which then goes, oh, cool, a packet from one of my many virtual machines.

Ben Rady

39:40

Hypervised. Yeah, right.

Matt Godbolt

39:41

Yeah, hypervised children, right.

Ben Rady

39:42

So I'm kind of, I was maybe teeing this up a little bit, but this is like somebody comes along and says, hey, Ben, you made this program to control this robot and you did it in a very silly way. Why are you using a UDP network? That is not necessary.

Matt Godbolt

39:54

Yeah. Right.

Ben Rady

39:58

Are there other devices that one could use, I mean, like a USB-C connection or something else that would be available to, you know, just a regular consumer, commercial computer, something.

Matt Godbolt

40:12

Yeah, that's it. I don't know too much about the ins and outs of how the kernel interacts with the USB. I know that the underlying USB protocol and the thing you talk with, the hub or whoever you're talking to on the end of it is sort of a negotiated thing. You can ask for, you negotiate bandwidth, you negotiate...

40:31

I don't think you negotiate latency. I think you can negotiate isochronous, that is repeating things. So like if you've got a webcam plugged in, it can book, hey, I need to send this much data just all the time.

40:45

And it gets sort of time on the USB bus to be able to send packets over that data. But I don't know about, like, latency. And obviously things like you know USB keyboards and things and mice. And you know gamers obviously use USB mice and whatever, but we're still talking almost human level levels of latency there, you know millisecond, sub-millisecond, I'm sure.

Ben Rady

41:04

Right, Right, right.

Matt Godbolt

41:04

So yeah. I don't really know. My best guess I would be able to have is not to say something like USB, because again, there are some layers of indirection between you and the actual device, even if you're talking the USB protocol through the kernel somehow. But if you had like either, you know, GPIO on a, which is general purpose IO pins on one of the more microcontrollery type systems where you can literally just say, if I write to this memory address, it's not really a memory address.

Ben Rady

41:36

Mm-hmm.

Matt Godbolt

41:36

It is whatever data I put there is the highs and lows of these eight pins, right?

Ben Rady

41:40

Yeah, yeah, yeah. Right.

Matt Godbolt

41:40

And at that point, you're like, yeah, you're off to the races, right? Or, you know, say a serial interface, which although serial is incredibly slow and awful, it is also something where you're kind of directly attached to and you can say, like, I'm just driving it myself. Now, I'm sure that's no longer true. Yeah.

41:56

Now I say it out loud, I'm sure that it's virtualized through some mechanism in the kernel. And so it's no longer the case if you open /dev/tty0 that like if you get the OK, you are the only person talking to that now.

Ben Rady

42:07

Mm-hmm.

Matt Godbolt

42:08

And still, it would would suffer from that. If you read from it, then you would block and the kernel would put you to sleep until a byte came in.

Ben Rady

42:15

Right.

Matt Godbolt

42:15

So even that doesn't make sense. So GPIO is the nearest thing we have to the sit in a tight loop and wait kind of approach that we were talking about before, you know, for both input and output, you could say like, hey, you know, I'm doing a quiz show, whoever presses the button first kind of thing. And there are eight buttons and they go into the eight bits inside my controller.

42:38

And I just sit in a tight loop reading from it. Obviously there are better hardware ways of doing that, but I'm just sort of thinking out loud. Yeah. Yeah, I don't know. Did you have an idea?

Ben Rady

42:38

Yeah, yeah, yeah, yeah.

Matt Godbolt

42:49

You've been driving this in a direction, and I'm wondering if you have a solution in mind that I have not got to.

Ben Rady

42:54

No, no, no, I mean

Matt Godbolt

42:55

By the way, have I got the job? Or oh... is it Is it too late?

Ben Rady

43:02

Well, you passed to the next round, but I'm going to have to, yeah, I know.

Matt Godbolt

43:05

Oh, okay. We'll have to see. We're now 40 minutes in, so yeah.

Ben Rady

43:08

Do you have any questions for me?

Matt Godbolt

43:09

Do I not have questions for you? We've done far too many interviews, haven't we?

Ben Rady

43:15

Yeah, yeah, yeah, yeah, yeah.

Matt Godbolt

43:18

Yeah.

Ben Rady

43:18

I mean, I think that, I mean, we made the full kind of round trip, and I think that's good. I think the other, and if there is a part two to this, I think the part two should be, okay, now what happens when it's the actual like calculation that is the part that needs to quote unquote go fast, right?

Matt Godbolt

43:34

Oh, yeah, yeah, yeah, yeah.

Ben Rady

43:34

Like like the IO no longer dominates as it did in this example. And it's basically just this exercise. And how do you take this signal, make sure that you don't accidentally make it slow when it's being processed in the CPU and then get a thing back out.

Matt Godbolt

43:47

That's an interesting one. Yeah, we should.

Ben Rady

43:49

Now it's, okay, this is gonna take a material amount of time and we're trying to make it as fast as we can.

Matt Godbolt

43:52

All right. Yeah, which is probably what most folks think of when you say, can I make my code go fast, is the code part, not all the other stuff around the outside, but often that can dominate. Yeah, no, that would be an interesting one.

Ben Rady

44:03

Right.

Matt Godbolt

44:03

So yeah, okay, we'll... well and then we've got to talk about throughput at some point, and maybe that will leg into throughput, because, you know, I think there is part of it, but we'll see.

Ben Rady

44:13

Yes. Yeah. Mm-hmm.

Matt Godbolt

44:13

Well, all right.

Ben Rady

44:13

Maybe.

Matt Godbolt

44:14

We rather mysteriously then, and now you and I have to look each other in the eye and go, we will remember that the next episode we record will be the continuation of this one.

Ben Rady

44:23

Yeah.

Matt Godbolt

44:23

And we won't just leave our poor audience hanging.

Ben Rady

44:25

Yes. Right.

Matt Godbolt

44:26

So I apologize in advance if we, if we do in fact, but I think this has been very interesting.

Ben Rady

44:31

Right. Yeah, that's good.

Matt Godbolt

44:33

Hopefully our listener agrees and we will, I will see you next time, my friend.

Ben Rady

44:38

Next time.

Matt Godbolt

44:45

And cut.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript