Hey, Ben.
Hey Matt,
How you doing?
I'm doing great.
Awesome. Well, I've been thinking a lot about performance recently. It's not something I've, uh, been, uh, worrying about for a while. I've been doing more functional stuff, but now we're sort of trying to think about how best to measure track and, uh, improve of the performance of code I'm working on. And I wondered, you know, we should just talk about it because frankly, that's what we do on this podcast is just talk about the things we're we're doing in our day job in some way or other or alude to them.
Them, thinking about the exact same thing, actually. So this is, this is good. This is good. Yeah. Make it right. And then make it fast. Well, you make it fast part
Measure. Well, that's I think that's the, that's the, the key thing I think, you know, I think if there's, and I think we've mentioned this before, one of the key things is always measure, never trust your instincts because no matter how many years you've been doing this, and I've been doing this for nearly three decades on and off, you're always wrong. And even when you account for the fact that you know that you are always wrong, you're like, well, I know I'm always wrong. So it won't be A, it must be the B thing. And then you measure it. And it's some other thing, right? I there's a, I think I've got a present. Is this on my YouTube? I don't know. Um, but there's a presentation I give when I go around to universities, when we're trying to like, uh, encourage, uh, graduates, newly new graduate folks to apply.
And I do a, like a deep dive on like a, uh, a C++ optimization problem. And I set it up. And the funny thing is I set up for myself thinking, haha, everyone will expect A, but it will be B. And then I ran it. Of course it was neither of those things. It was something I'd never even heard of suddenly rearing its head. I'm like, this is amazing. This is exactly what I wanna do. So that's what the presentations about. And so always measure. I think we all know that and you know, profiling is, is one thing, but like what, how do we, you know, what, how what's your general approach to, to performance?
I mean, so like, I think, I think, you know, we hit on this a little bit already, but I think you gotta really start with the basics, which is, is, um, you know, don't be superstitious, right? you, you, you,
what do you mean by that? I mean, I'd say that's intriguing, right?
Like, I mean the, the superstitions, I there's so many of 'em you hear them all the time. It's like, oh, you can't use this structure or this call or this function cuz it's too slow. Oh, right, right. That's kind of stuff. Right. Um, and it's like, all of that stuff is so sometimes it's true, but it's, it's always, it's always context dependent. So it's like, if it was true, it was only cuz you got lucky
that sort of thing. Right, right, right. Um, so, so, so don't be superstitious. Right? and part of that and, and so like I think part of the reason people, um, fall into that trap is they're like, well, but if I don't like learn what things are, are good, what things are fast? What things are slow, which things are memory efficient, which things aren't. If I don't learn these things, how am I gonna write any code without performance testing, every single line of it as I write it. Right. Right. And I, and I think that is where you get the very old, you know, gray beard programmer wisdom of first make it right. And then make it fast because the first thing in measuring is figuring out which parts of the code actually matter because a lot of them just don't right.
That's true.
If you're doing sort of superstitious memory tricks in your command line, parsing arg parsing, it doesn't matter.
I'm with you. That's now I get what you mean by that. Yeah. Yeah. I would to sort of take the other side of that just a little bit. There are some things that are in particular in C++ us that you could do quote, wrong, um, forever. And it doesn't really matter. But if you've written an entire application where you've passed everything as a string by value or whatever, that you profile it and it's like, why is my application slow? It's not slow anywhere in particular. It's just slow everywhere, because you've done something that's kind of harmed you in the approach you've taken. And it can be hard to unpick that after the effect, even if it's right.
Right. Absolutely. Absolutely. So those, so in those cases, I think, I think the key there is you still need to measure and something that I do a lot with these kinds of things, um, this is like a classic example of this is let's say that you have, I'm gonna just pick on Java for a minute. Let's say that you're
Other languages are available for picking on.
Other languages are, and this technique would work in any of those languages. But let's say that you've got some, some data structure, uh, in the JDK that you want to consider using some list implementation or something like that. And you're like, I want to use, I wanna go and write my own version of this instead of using the JDK one, because my own version will be faster. Right. Let's just say you want to make that claim.
We've all, we've all made that mistake before now. X is slow. I'll just write my own. And then very quickly you discover, oh, it's harder than I thought. Or, whatever, Sorry. Continue.
Yes, yes, yes. . So if you're gonna make that very bold claim, right. I would submit that what you need to do is build a, build a performance test that shows the performance of the built-in thing, the performance of your new fancy thing at the most basic level, make sure that like when you run that once you're actually correct that it is faster in that context, but also keep that around and have it part of your pipeline. So that as you know, in again, case of Java, upgrade your JDK or run a different architecture or change some Java heap setting or whatever it is you have, but that holds true in all situations. Cuz guess what? The guys that wrote the JDK thought a lot about all of those different situations and you maybe have not. Right. Right.
Yeah. So writing those tests that that's, that's probably the real you difficult thing, right?
Yes. It is difficult.
In my experience. I mean, if we're picking, you know, you picked Java, Java has some extra fun and games on top of like what I would normally worry about when I'm writing benchmarks or similar tests that are trying to do performance in, in as much as, you know, the J the, the, the JVM and the JIT are gonna come along and like optimize and give you even more of a hard time than like your equivalent, um, in, in other compiled languages, although they have their own problems too. Right? Like, and the number of times that one's written a benchmark around a C++ thing only to discover that because you passed in constant values to your test, the compiler has just gone. You know what answer 12 we're done here.
no, there's no code runs at all. Right? Yep. So there are definitely challenges in this. So.
yes, definitely.
How does, and, you know, micro benchmarks are a great at showing very specific cases, but they're often not necessarily very representative of your program at large, there are macro, uh, effects sometimes, you know, especially if you're like, I can be more cash size efficient with my implementation, then you might discover that actually, that all goes out the window when you're actually running in a program where someone's memcopying two gig of data around before they call your function or whatever, you know, there's a lot of things to think about. So how, how do you even go about writing that apparently straightforward? Just write the test of my list versus the, the...
Right. Well, no, it's not, it's not straightforward it, and that, that was sort of my point is that if you, you get the easy stuff outta the way, don't be superstitious, right? Yeah. Like measure these things, don't assume that you're super smarter than all the people that wrote the standard library. Right. Like, don't do that. But like now if you get all the easy stuff outta the way now you're to the actual hard part,
The meat of the problem.
The meat of the problem, which is okay, you're gonna write the benchmark that shows that your, your super fancy list implementation is faster. How do you do it in a way that's really real right.
Fair as well. Because like, you know, it's like, Hey, I tested it and it's faster. Good. Don't, don't, don't worry about it. What happens if you pass null values? Oh yeah. Don't worry about that. I, there some special cases thing. Right,
Right. So you, yeah. So to me that is, that is the tricky problem. Right. Um, one of the things that I was just sort of, we were talking about this, uh, I think a month or so ago. So I was kind of wondering, it's like, I wonder if you could use, um, and this is of course on brand for me, but it's like, I wonder if you could use some of the generative testing frameworks right. To generate. Cause you're just talking about like, you know, null value, use constant values, stuff like that. And you can also generate random data. Right. But, but I was kind of wondering, it's like, you know, generative testing frameworks have these things where it's like you define, you know, a function that has, like, we can create a range of properties, can create different types for you. And it's like, you could hand roll all that stuff yourself. Um, but maybe you, could you use some of those to generate some data for, for performance testing
For your performance testing. Yeah.
And it wouldn't be totally bonkers. I don't know.
I mean, I've certainly, I've written stuff before for like maps. When I we've been doing stuff for the market data, there's a lot of map accesses in, in our, in the finance world of adding and removing and looking things up by giant, uh, 64 bit key. And if you're trying to do things fast, there's a hundred different ways you can skin, uh, skin that. And I, yeah, I've ended up generating effectively add, modify, remove loops in the generative thing and they have to balance, you know, that for every ad, there must be a remove at some point, but there could be some modifies in between. So you write this sort of, as you say, generative approach, and you're like, okay, here's a hundred thousand operations that I'm gonna stream through, and I can randomize them with a seed so that I know they're going to be the same each time to be somewhat fair.
And then I can run it through map choice one, map choice two, map choice three, and get some idea that it's representative. And maybe you could even measure your, uh, your, your actual workload and say, typically we have more modifies than we have anything else, or maybe have adds and removes or, you know, or, or, you know, get some kind of smell about the data that tells you that, that, that, that it looks representative, but yeah, to your point generative stuff. And if you can, if it's not as, as, as complicated and path dependency, as, as, as, as a long int map where you're adding and removing, if it's just, you know, like, Hey, I've written my own square root routine, then of course, then you can throw random numbers at it, or you can throw ascending numbers at it, or you can throw all those things at it, I suppose. But so that sort of covers some of the, the tricks and problems, um, of like generating representative data. Yeah.
I, I haven't actually tried, I have to put a disclaimer out there though. I haven't actually tried, I haven't gone and gotten a generative testing framework and like plugged it into a performance test
And give me the name of one of 'em I'm forgetting there's there's um, oh,
Well there's quick, quick check, which is the original one. And then there's all the quick check derivatives. Um, uh,
Sure. It was just quick check would be enough for me to remember. I knew that's what I was, I was thinking of. I couldn't,
I'm trying to remember some of the specific names, but they're all, a lot of them are quick check derivatives. So it's like, you know, it's probably J check in Java, I imagine pycheck in....
C++ check or something somewhere. I'm sure. Rings a bell anyway. Yeah, no, that's cool. So that kind of somewhat covers that. I'm sure it's a huge topic. You and I have got a little of experience in this, but maybe not as much, but then, then the thing that immediately springs to mind when I'm doing this, these kinds of performance things, particularly in, in my world where I have the luxury of knowing what kind of computer system I'm gonna be deploying my code onto. In fact, sometimes I get to choose what system it is. I get to pick the CPU and how much cache all that kind of stuff. But my development workstation may not be the same as that. And my development workstation is running a thousand other processes, all the Javas in the world for all my IDEs, my music player, it's got Chrome and 300 tabs open. And that's taking all the CPU in the world for reasons I don't know, or I mean a video conference with somebody and like,
Mining, cryptocurrency somewhere.
Yes, I, yeah.
Or a web page page that you're visiting is mining cryptocurrency
More likely, More likely as it happens. Yeah. Yes. Um, and so obviously it's not necessarily representative and certainly another sort of key piece of advice when people talk to me about like performance testing, other than, you know, may measure everything is measure what, on a system that is actually representative of where you're gonna be running your code. because it's so easy to sit there and kind of go, oh, I pick this map. It's perfect. It does the work really well. And then you discover that well, on your, it works fine on your system, but with four times as much cash in three numa nodes or whatever. Three numa nodes? I dunno,
Powers two only please.
Powers of two. Yeah, exactly. Yes. That, um, that it doesn't work as well or something else is more efficient because of it. And obviously this is very micro optimization based, and many people won't be worrying about this in general, but it's something to be aware of. Right? Yeah. We've all, we've all run something and gone, oh, that seems a lot slower. And then you run it again. You're like, oh, it's packed to normal. That's fine. And sometimes that's an interesting thing to know because you're like, well, why is it so variable? In other words, it's like, well, yeah, I was, I was, uh, mining cryptocurrency at the same time or, you know, I, I, something else happened in my computer. So what, what all can one do to, to help in that, in that regard. Do you think?
Oh man. I mean, I think a lot of this ties into how frequently you measure and check these things, what you would want to do, what you want to do. I feel like is you have it as part of your regular continuous deployment pipeline, right? Like right. Every time I make a change to the code, I want to run the unit test to make sure that the behavior is correct. Then I would wanna run some kind of performance test to make sure the performance hasn't changed. But that second part is way harder. Yeah. Than, you know, when I call my square root function, does it give me the right result? Right. Um, so it's, it's really tricky. I honestly can say, like, I know some ways that to do this, that don't work
This shouldn't take more than 10 seconds time out on a shared build server. That's running in three layers of Docker and 12 different hypervisors, and whatever.
What I mean exactly. Right, right, right, right. Right. So, so like those things I've ne I mean, maybe someone somewhere can say that that works for them. It has never worked for me. Um, and I, and I, every once in a while, I'll try it again, just to make sure that the world,
just to make sure it still doesn't work. I think the world does not move on.
Yeah. And still doesn't work. Yeah. Um, so there's that I haven't really looked at, um, sort of other ways to measure, you know, performance, like something, you know, we were talking about today is like, yeah. Measuring, measuring, like instructions, like yeah. How many instructions did this take the last time I ran it? How many is it taking now? And maybe why is it different? Yes. But I've never actually done that. So I don't know...
I have a friend who, who, who uses that approach, at least as a, as, as, as at least a starting point. I mean, there are some simulators that you can get that you can run your code through that will give you like a fairly decent approximation as well as actually measuring the instructions. Obviously, if you measure the actual instructions, do you put some amount of noise from the measurement system at self? Um, but if you're using like effectively, uh, an emulator to, to cycle level, run through your program, and it's fast enough to actually run, you can say, well, how many, how many instructions did you get processed? How many times did you access the cache? That kind of thing. Now, again, that simulation may not re represent reality, but it's something that is measurable and deterministic, and that is kind of a good thing to have.
So that's an option. Yeah. Um, another thing that I've done in this world is had a special build agent that runs on just that one machine that's only there to, to service your request, or it runs at three in the morning and you're like pretty much guarantee that nothing else is running on that, that agent and, um, rather an erroring on it, you just graph it. And that does require human persistence to kind of look over and say, oh, wait a second. When did that spike up? Right. And, you know, maybe you could use some kind of, um, anomaly detection type stuff. To, to say, Hey, it, the, you know, the last four re readings have been within this band and suddenly it's outside of it, um, in both directions. Right. You know, sometimes the, the, the "too good to be true" error is worth it. Like, Hey, suddenly this went faster. What happened? Are we upgraded the compiler. Ah, right. Now it's just not, it's the, the compiler is too smart for us now. It's not actually testing anything useful anymore.
Yeah. It's not testing anything useful anymore. So, yeah. I, I almost wonder if some of the, and you know, we've definitely been dealing with some of these things that work is sort of like the, uh, comparing test one test run to another, um, and, and, and looking for, you know, like deviation. Right, right. Um, so it's like, you know, manually looking at it in a graph is one thing, but if you want to get some amount of automation around it, like, I can imagine like a continuous deployment pipeline where it's like, if the, if the performance characteristics of this are radically different, it doesn't just automatically go to production. Someone actually has to go and bless it and be like, okay, yeah, this is fine. Like, this is fine. Like, we understand that it's gonna be 10 times slower, or like you say 10 times faster. Right. Cuz maybe your metrics
Hopefully the latter, for good reason. Yeah. Not for the bad reason
We were expecting it to be 10 times faster. We did other tests to confirm and it was great. Awesome. Yeah.
I guess the other thing about like graphing that I sort of implied in this as well, is that the noise level can, you can sustain higher levels of noise. In a graphing based thing, because you're naturally gonna say, well it took 0.7 seconds on average to do this thing. Now it's gonna 0.78. Now it's down to 0.69, and it kind of fluctuates around and you can kind of squint and look at it and kind of go, well, this is within what we'd expect. And maybe every now, and there's an anomaly and like rerun it. I know it's not a great answer. It's very manual. It's very labor intensive, but it allows you to see a trend more so than, um, than, than, than have some kind of tight bound that you set and say, well, it just should definitely take no more than 0.9 seconds to do this operation, whatever that is.
How much overlap have you found between the, that you would use to like automatically check these things and the like actual, like, you know, crack open the toolbox, get the wrench, get the screwdriver out for like doing the optimization. Right?
Interesting, yeah.
Because cause there, I think there's some areas where they overlap, but there's definitely some areas where they wouldn't.
Yeah. There's I mean, so there's the first thing that springs to mind is a lot of the, the, the tools that one would use that are very, as you say, specific to tools that you could add to the toolbox, you know, count the number of CPU instructions, retire, count the branch prediction, mismatches and things like that. Those do require, um, a system where you have access to those things. And there are normally layers of dockerism and virtualization that prevent you from actually accessing the CPU's functionality because you're sharing it, not just with other C processes on, on your operating system, but actual other operating systems probably on your build agent compared to I'm my keyboard or my actual computer. And, and I've got the freedom of, of pseudo installing and like monkey with stuff. So that that's, there's a sort of practical issue there, but that's not necessarily problematic.
You can configure these things. It's just, I often had to badger various folks who are, uh, uh, So again, they could maybe graph them. So I guess you could use them, but then once you get down into the minutia, it doesn't necessarily nothing springs to mind as what you could, um, extract. Um, because I dunno the thing about these tools when I use 'em, it's like it's a journey into the unknown, it's an exploration into the literal unknown.
Yes, exactly!
And again, like, so like the, the presentation I give where I'm like, where on earth are we spending all of our time in this simple function and it's like, oh, it's in the locale system in C++ you're like, I didn't even know there was one, how did this happen? Where are we, why are we here? And so time sort of like captured that in an automated way doesn't necessarily. Right. But obviously metrics you can get out might be the same ones that,
Yeah. Yeah. I, I, I, I think it's the difference between detecting that there's problem and then figuring out what the actual problem is. And those, those are two, sometimes there's overlap between those two things and it can make tooling simpler to use, but for the most part they're, I feel like they're very separate activities.
Right. There are some, I mean, again, the broad phase stuff. So the, the top level stuff, if I'm literally just gonna do perf uh, um, stat on a thing and just get back, like the things that like by default, which is, you know, how many CPUs, um, uh, how many CPU instructions were retired, how many were issued, you know, there's about a half dozen stats and some things that are inferred from it. And that kind of gives you some idea about it. There's also, um, an approach that I think Intel first gave out. And this is very Intel centric. Again, for folks who are listening into our, um, maybe not so of, but like CPUs have in general have multiple like stages in them. And there's a sort of front end stage, which does fetching of the instructions that need to be, uh, done, which is, you know, like, um, executed, which is branch pretty prediction feeds into that.
And the, all the badness of badness inaccurate branch prediction, branch misprediction, I suppose. Um, there's the reading of the bytes that are actually the instructions, which is, you know, cache and memory. There's the decoding of the instructions and turning them into the actual thing that the CPU can execute because nothing really executes the bytes directly anymore. So that's the front end. Once those, those things have come out, they go into like a, a, a box of things that I know I need to do. And then there's some out of order, magical stuff that happens. And then, you know, like maybe you've got only one divider on your chip and now you've done lots of divides. And now you've got like contentional divides that sort sort of in the back end. And then there's retirement where, where it's like now the end of the day, all the instructions have to come out.
So there's, there's this approach that you can kind of hierarchically zoom in on the bit that's actually the bottleneck. And so the first couple of levels of that might be something you could run automatically. And note when something changes like, Hey, this code ran before, and it was absolutely limited by the front end and we couldn't get instructions out of memory, into the CPU, fast enough to sustain the, the whatever workload you gave it. But Hey, now it's suddenly switched now, suddenly your bottleneck somewhere else. And that's an event you can say, and I don't know why or what, but I can perhaps flag that up as being something that happened. I don't know how, how, how feasible that is to actually do automatically, but it's, it strikes me as suddenly, like if someone handed me a product that did that, I'd be, yeah. I guess I'd use that. Just run it on my, my little, uh, test case and see, but yeah, I don't know. It's, it's a fun thing.
Yeah. What about like, um, visualization? Like I know that we've, we've used a few different kinds of tools for like visualizing like flame graphs and like, you know, always get the, like the tree of calls, right. That you, you know, click down and sort of see. What tools have you used for that?
Brendan Greggs, flame graph thing is, is, is just like the best, one of the best visualizations. But then, um, the, if you've ever brought up Chrome and like done a recording in Chrome, like JavaScript, uh, thing, um, Google's, um, viewer that, that, uh, shows like a timeline and gives you sort of a combination of both the flame view. And also this is what, see what things were happening on what threads, when, um, which is great for debugging webpages, but you can actually record this stuff and convert like system level traces, either on your Android phone, or if you use your own stuff, you can output the formats pretty straightforward, and then you can load it and visualize it in that. So that's been really, really helpful thing. Um, they've recently spun that out, or I say recently, I dunno when they did it, but I, I noticed recently that they, they used to embed, um, the, the viewer separately, um, in, I think it was in the Android project originally, but now it's been spun out of its own thing.
And so perfetto.dev, uh, you can go and see the web-based browser. You can load traces into that. It's pretty cool. And I think, you know, the thing that we were talking about before we started recording this was that there are a number of tools that are now starting to target Perfetto, um, traces so that you can load them and visualize them in that, um, of them was, I think Jane Street released something where they were using like the new Intel tracing stuff. Um, you know, I should, we should really have done more, uh, reading about this beforehand, but like, I, I've seen a couple of things recently, which have targeted this. And, um, and certainly the newer things that are coming in, in, in the most recent revisions of Intel chips are allowing more and more fine grained, accurate and low, late, low overhead ways of tracing what the heck is going on in your program.
which is super awesome. You know, normally, um, the, the one finds one self compromising and saying, I have to just keep running my program over, over again. And I'm doing some kind of like sample based pro profiling and that's okay for getting, oh, there's a lot of branch mispredictions, and I can sort of vaguely see the area of the code where they're happening. But if you really wanna know, like in what iteration of what loop something happened, you need to have this new, um, branch tracing. And I can't think what the other new thing is now, which is terrible, but there are some new things in the chip and effectively there's a highly compressed buffer that the CPU is streaming out. This is what I did to, which is the coolest thing. Right. I mean,
Like a journal, basically?
Yeah. But I used to have this on the PlayStation2, right. There was this crazy, this is a we're running out of time, unfortunately, because we, we maybe have to talk about this another time. But, um, there was like this crazy piece of hardware that was like a four times bigger PlayStation2 dev kit, which were already much bigger than like a PlayStation two ever was. Uh, I used to use it as a foot res mainly cuz it was the right height
You could see the graphics being drawn. You could then see cuz effectively the CPU only needs to tell you when it, um, when it took a conditional branch and which way it went, cause the rest of the time, if you know where it is, you can say, well, I know what a CPU does. It follows one instruction after another, the only time I, I need to know what happened is when it says, oh, I went either, I took the branch or I didn't take the branch and I can reconstitute the the flow then provide that I know which CPU cycle it was. And then everything else is like, oh, I branch mispredicted. And you're like, that's great. Okay. That's an event that goes into the journal. Yeah. Lots of fun. Anyway. Um, yes. What were we talking about? Visualization tools!
Visualization Profetto, uh, by the way, magic trace, I believe is the name of that Jane Street thing. And that is.
awesome.
on the Jane Street GitHub account
I should find the other thing that, uh, Travis Downs, uh, tweeted about the other day, which I was super excited about, but not apparently excited about enough to remember the name of which was to use, um, uh, the, the same kind of techniques. I believe that they're now available in these newer CPUs to do effectively a, a cycle by cycle simulation of a piece of code by running it over and over and over and over again and saying, Hey, stop after one cycle. And tell me what the, uh, that the counters look like stop after two cycles and start tell me what the counters look like. So you can kind of like slowly run the same bit of code over and over again and kind of say, well, now measure this now. Now where should that a, a lot of the problems come from the fact that the CPU has many, many, many, many, many counters that are available, like for events, interesting things that happen, but they can only count like three or four of them at a time. And so you have to keep being able to rerun the same piece of code over and over again. Um, that's very interesting. Anyway, we're actually running it out of time. I can't believe this, but I've gotten excited again.
I know be so animated about all of these things, but yes, yes.
But we should try this and talk more about this other time. I, I there's, there's so much to, to think about, yeah.
Maybe there's a part 2 here?
Maybe it could be a part 2. Let's think about that.
We should think about that.
So listener, if you, if, if there's a part two, there will be a part two after this, but maybe this is the whole thing. I dunno. Maybe we've maybe we've actually peaked
Yep. That, that sounds about right.
All right. Well I guess until next time, my friend.
Until next time.
