Building software that survives contact with reality, with Will Wilson - podcast episode cover

Building software that survives contact with reality, with Will Wilson

Sep 04, 20251 hr 17 minEp. 58
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Summary

Patrick McKenzie and Antithesis CEO Will Wilson explore how software testing has advanced from basic approaches to cutting-edge deterministic simulation. They delve into how Antithesis creates "time machines" for distributed systems, allowing developers to identify and resolve complex, hard-to-reproduce bugs in critical systems and even classic games like Super Mario Brothers. The discussion also touches on the implications for future software development, particularly with the rise of AI-generated code.

Episode description

Patrick McKenzie (patio11) is joined by Will Wilson, CEO of Antithesis, to discuss the evolution of software testing from traditional approaches to cutting-edge deterministic simulation. Will explains how his team built technology that creates "time machines" for distributed systems, enabling developers to find and debug complex failures that would be nearly impossible to reproduce in traditional testing environments. They explore how this approach scales from finding novel bugs in Super Mario Brothers to ensuring the reliability of critical financial and infrastructure systems, and discuss the implications for a future where AI writes increasingly more code.

Full transcript available here: www.complexsystemspodcast.com/software-testing-with-will-wilson/

Sponsor: Framer is a design and publishing platform that collapses the toolchain between wireframes and production-ready websites. Design, iterate, and publish in one workspace. Start free at framer.com/design with code COMPLEXSYSTEMS for a free month of Framer Pro.

Recommended in this episode:

––

Timestamps:
(00:00) Intro
(01:23) Database scaling and the CAP theorem
(08:13) Abstraction layers and hardware reality
(15:28) The problem with traditional testing
(19:43) Sponsor: Framer
(23:16) The fuzzing revolution
(30:35) Deterministic simulation testing
(42:36) Real-world testing strategies
(47:22) Introducing Antithesis
(59:23) The CrowdStrike example
(01:01:15) Finding bugs in Mario
(01:07:37) Property-based vs conventional testing
(01:09:51) The future of AI-assisted development
(01:14:51) Wrap

Transcript

Intro

Welcome to Complex Systems where we discuss the technical, organizational, and human factors underpinning why the world works the way it does. Hi to you, everybody. My name is Patrick McKenzie, better known as Patio11 on the Internet. I'm here with Will Wilson, who is the CEO of Antithesis. Thanks for having me on. So as a bit of level setting for the audience, I know folks come here with all variety of backgrounds. Today is going to be...

I'm talking about a quite technical subject, how we do software testing and fault discovery, and how that has evolved over the last couple of years. But I will try to be the voice of the audience.

making things comprehensible. And Will will correct me where my increasingly out-of-date engineering degree does not serve me well. Yeah, and I will also try and... keep this grounded in real-world examples and try and draw out some connections to the non-software parts of the world, like financial markets or the airplanes you fly on, your power grid, that kind of stuff.

Much obliged for the general infrastructure audience here. So speaking about things in the real world that everyone has touched before, many of us use iPhones. iCloud runs behind a commanding majority of iPhones at this point.

And I think you and the team wrote the database that essentially orchestrates iCloud or some successor technology orchestrates iCloud. Do you want to talk a little bit about what people don't understand at databases running at the scale of the largest companies in the world?

Database scaling and the CAP theorem

Sure. So yeah, that's a great place to start. So my previous startup was a company called FoundationDB, which got acquired by Apple back in 2015. And basically in the early 2010s, I don't know... how much of your audience knows about this, but there was kind of a fad for, let's say, really weird databases. Because essentially what you had up to that point was single machine databases that ran on a single giant computer, and they offered you SQL, which everybody...

or a lot of people knew how to use. But these had a problem, which was that sometimes a very, very big company needs to store so much data that it doesn't all fit on one computer. Or sometimes they need to do so many transactions per second.

that all that computation can't happen on a single computer. And so lots of really smart people started trying to figure out how to write distributed databases. That is a database that runs on multiple computers at the same time. The problem, or let's just say the thing that happened... was everybody decided that it would be an awful lot of work to get that whole collection of computers all together speaking SQL or acting like it was one computer.

for the benefit of the programmer or the person who's doing the query. And there was this sort of vogue for like, let's make really, really weird seeming databases with... really strange consistency semantics, meaning different people querying the database at the same time.

might get different views on the data. And if that's a problem, well, sorry, you, the database user, need to reconfigure your business processes. Or you need to write really, really smart software that can figure out how to deal with that problem and give you, you know...

whatever sort of level of consistency it is that you need. And the people who took this view were bolstered by an academic result that came out of MIT called the Cap Theorem, which... seemingly proved a conjecture by a really smart guy named Eric Brewer, who basically said that a distributed database, that is a database running on multiple computers, you can kind of have it be consistent.

meaning everybody gets the same picture when they query it, or you can have it be available, meaning somebody can always query it, but you can't have both those properties at the same time because computers might get disconnected from each other and so on. And basically what happened...

was people radically over-interpreted this result and decided that it meant all kinds of things that it didn't actually mean, and decided that it was an actual impossibility theorem, meaning that you couldn't build. like a useful distributed database that offered you a kind of single machine-centric view of the world. And so they all gave up. Yeah, I will cop to having used the CAPS theorem as a shibboleth a few times in my life. And it's interesting what...

engineering truths get used as shibless and then repeated until they prove something that the actual paper didn't prove. There's another paper in computer science about that. diagonalization proof that essentially proves that a computer program can't answer any meaningful question in the world, which is not the takeaway that people have about computer programs because they, like, observationally, they do answer meaningful questions. And so...

The fact that they're theoretically incapable of doing that just passes by the wayside, and no one says, oh, AI is impossible because computers can't answer meaningful questions. I return the flow to you. Yeah, no, totally. There's impossibility results in AI as well, which, I mean, we'll maybe get to that later. It turns out they don't mean what people think they mean. Hopping back a few decades for people who didn't cut their teeth on SQL. SQL...

Fantastic technology. We've had it since, I think, the 70s or so. And fundamentally, it's built around a time when an enterprise's amount of data, you could bound it by things that are achievable by humans typing into machines.

And why do the largest enterprises in the world have much, much, much, much, much, much more data than we had in 1970? It's because we are recording data that is generated by computers about interactions in the economy, interactions of users and computer systems, interactions of computers. such that it is no longer bounded by the amount that the entire population of the world typing into computers all at once could give you an upper limit on the amount of data you have.

a Google or an Amazon or et cetera needs to store is greater than the entire population of people capable in typing in 1970 could produce given all of their time devoted strictly to typing into the computer. That's right, and the rate of updates is potentially much higher as well because the updates are being generated by other computer systems, which can write data very, very fast. Yeah, so the CAP theorem actually is a pretty interesting example of like the, I mean, call it the like...

production function of academic knowledge or something. When Brewer conjectured the theorem, the cap conjecture that he came up with, it was like a very, very interesting question, very interesting hypothesis, which I think... My hot take is that it's been proven false now. And then when this conjecture got formalized into a quote-unquote theorem, what happened was that the academics involved in that effort, who are very smart people, who I have a lot of respect for, they defined the terms.

in an extremely narrow, extremely specific way, which made it a possible thing to prove. But the theorem they proved was semantically actually quite different from Brewer's original conjecture. even though it used all the same words. And so there arose this terrible ambiguity of what has actually been proven. I would argue that the existence of systems like FoundationDB or like Spanner or others is sort of proof that's not what they thought they proved. So anyway...

There was this vogue back then that was basically like, we're going to make you, the user, do all the hard parts of having a database. FoundationDB found it on the principle that like, no, we can have our cake and eat it too. We can make a database that scales across many, many machines very gracefully, works the way you expect intuitively a database to work, handles all of the conflicts.

handles all of the like, well, what if this machine is down at the moment you ask a question and the question goes to this other machine instead and this other one comes back up, right? Like all of that stuff handled for you under the surface, nobody can notice.

And one of the reasons this is important, particularly at the scale of the largest enterprises, is that at the scale of the largest enterprises, you have to be aware that people make mistakes all the time, inclusive of people that have engineering degrees and are employed in, say, junior program in the office in Brisbane, and you can't have the property of your system where, yes, data in our database, quote unquote, is reliable.

As long as every junior engineer in Brisbane understands exactly the properties of the distributed database system that we are providing them with. Yeah, that's right. And so long as all of the network switches between these

Abstraction layers and hardware reality

computers behave correctly. And so long as the sysadmin in the data center doesn't unplug a cable at the wrong moment and like dot, dot, dot, right? Like there's weird stuff that can happen in the physical environment that computing happens in as well. the mark of a good computer system is that it should hide that from you, right? You should not have to care whether the electrician at the data center is doing his job correctly.

bit of ancient lore for computer programmers at this point. The genesis of the word bug in computer programmers was that I think it was an admiral in the U.S. Navy observed that root cause analysis for a failure of a program running on a mainframe. was there was physically an insect that got into a particular mainframe and caused a short. And similar things, you know, backhoes, severing connection lines, literally cosmic rays corrupting databases.

happen all of the time at scale. And so we have to be robust against the world of atoms, even when we're generalizing and theorizing about the world of bits. That's exactly right. And it's actually even stronger than that.

The more distributed your application, the more you have to care about the world of atoms, because the more the world of atoms cares about you. If you are running a program on one computer, it is maybe vanishingly unlikely that a cosmic ray will hit that computer at the wrong moment.

are running your program on a million computers and they're all working together, the odds of a cosmic ray hitting one of those computers is now a million times greater. And there's other kinds of weird things that happen in the world. which actually are more likely to produce correlated failures. Like if you get a bad hard drive in one of those computers, there's a pretty good chance there's a second bad hard drive out there because bad hard drives come in batches and the supplier...

who sold those hard drives to the data center probably sold them a whole batch, which had a higher than average failure rate, right? And so all of a sudden, all these questions of, you know, it's actually very similar to the kinds of things that... risk managers at hedge funds have to worry about, like correlated market moves and stuff like that. Or even correlation between employee behaviors, yada, yada. The hard drive failure at scale is an interesting problem.

consumers, let's say, think of their hard drive either these days. They're fairly reliable. You are unlikely to observe a hard drive failure in your lifetime with a particular computer, which I think was not the case when...

Some of us who might be a little bit older were using computers in the 90s or 2000s. But the way engineers reason about... hard drive failures are used to back in the days is that they had a rating from the manufacturer mean time between failure, which is when we had these in the lab and ran them with sample workloads, we can rate this particular model as or even this particular model sold into this particular channel as having, let's say, 8,000 hours of continuous use between

on average, for failures, which means some of them might fail after 1,200 hours. Some of them might fail immediately on arrival. But that will be balanced out. And so you, the person who is buying 10,000 hard drives... You just have to math for this. And then the thing that the hyperscalers found out when they started employing 10,000 hard drives at the time is, oh, that average summary statistic is a dangerous one. Because sometimes it is the case that, like, yes.

The lab didn't do anything unethical. The number is the number. However, this particular batch produced on Tuesday at this particular plant all fail at 500 hours on the dot. If we had a deployment topology where a particular American insurance company was buying 500 hard drives for us, that all happened to be...

the Tuesday batch from that particular plant, they'd be a little pissed off. And so we have to be robust against having potentially a container unit sized full of hard drives all going out within a minute of each other. Yeah, that's completely right. And there's actually other ways that the math is deceiving too. So one that I recently learned about is a physics reason.

When you're a large hyperscaler buying hard drives, you really want to maximize your hard drive density because you want to reduce the power requirements to store a given amount of data, and you want to have each physical compute node able to access the most data it can. that makes it easier to balance data across places. And so what you really care about is drive platter density. But the denser you make the drive platters, the closer the magnetic grains get to each other.

the more likely it is that the magnetic read-write head, when it comes to read or write a piece of data, actually flips an adjacent magnetic grain as well. And so basically, the odds of disk corruption increase super linearly with the density of the data that you're storing, which is this really weird...

intrusion of the messy world of solid-state physics into the beautiful platonic realm of computer science, right? But this is actually a thing people have to worry about, and it can cause serious problems. Yep. And one of the fundamental things that... computers offer their users, that engineers offer their customers, and that the companies providing these infrastructure services supplied to their customers is the world is a messy place. Physics doesn't care what you want.

But we're going to wrap that in an abstraction layer, and to the maximum extent possible, you don't have to care about the things that are below you on the abstraction layer. And then the joy and terror of engineering is that those words, to the maximum extent possible, have like 15 asterisks on them. But it's a true fact about the world that the typical junior programmer working at an American insurance company who is writing something to disk on without loss of generality.

Amazon doesn't care that much about power distribution at Amazon data centers or how dense their disks are packed together. That's right. And I think this is a big part of why FoundationDB got so popular and why ultimately Apple bought it and why ultimately...

they open sourced it and now everybody else is using it too, is that it just does actually do a very good job of hiding a broad class of such considerations from you. You know, we had a demo back in the day where we literally rigged up a little...

mini baby cluster of five computers and put them, you know, we'd bring this out at conferences. And then we would just have power switches in front of each one and power and like the network cables like right there. And we invite people to come in and just turn things off.

or unplug cables and plug them back in, plug them back in at different places. And it was a really compelling demo because no matter what you did, it just kept chugging along and didn't make mistakes. And that was so far outside of anybody's experience back then. that, you know, they were like, oh my gosh, this is going to save me a lot, a lot of trouble. Finding and detecting trouble is an unsolved problem in computer science slash unsolved problem management science as well.

but we continue getting better at it for the last couple of decades. I'd love to talk about what Antithesis does, but I think it might be useful context setting for people to talk a little bit about fuzzers first. Do you want to go in that direction? Yeah, I can do that.

The problem with traditional testing

So let's first take a step back and talk about how do most programmers think about software testing, QA, and finding trouble before it hits you in the real life. I'm going to describe it to you in as fair terms as I possibly can, and you're still going to think it sounds crazy. And that's because it is crazy. So software is a highly complex system. It has emergent complexity.

Software artifacts, even pretty basic ones, are among the most complicated things people have ever produced, right? And so whenever you've got something really, really, really complicated... And it needs to interact with the real world and with physics in all these ways that we've just described. What happens is behaviors emerge, which nobody designed and which nobody expected.

in which bridges have resonant modes that can cause the Tacoma Narrows Bridge to fall down. All this weird, weird stuff happens that... you didn't realize that, oh, when this particular O-ring gasket goes below a certain temperature, it's going to get brittle and then the space shell is going to explode. It's so much of the real...

bad stuff that happens in the real world, both in software and hardware, comes from, I think, what finance people would call like Knightian uncertainty. It's model error. It's not like I was wrong about the tolerances of this part. It's like there was this interaction between these two parts that nobody ever thought of or nobody ever saw, or this interaction between this one part and the world that nobody ever thought of and nobody ever saw.

And if I can interject for a moment here, the uncertainty starts coming in at very, very... Even if you construct toy models of programs, you are already past the point where you can... reason over the entirety of what those programs could do. So, finger to the wind for people. Imagine Twitter as a program just sitting in a box, and we still have 140 characters.

And you just want to have a programmer go through a test that none of the possible tweets break the Twitter box. That's already strictly impossible because the number of states Twitter can be in is greater than the number of atoms in the observable universe. Yes, and 140 characters is short.

Yeah, our favorite example of this is actually Super Mario Brothers, which is like the original Super Mario Brothers, 1985. That was a very, very simple game. But as you say, the number of states of that game, far greater than the number of atoms in the universe.

it is actually quite easy to find bugs in even that very, very simple game. You can clip through walls, you can do all kinds of crazy stuff, because there were scenarios that the makers of that game never envisioned. And they could never possibly envision them, because there's just so many.

And I think I'll asterisk this claim with when you say it is easy. It was not easy as of the development of Super Mario Brothers to find these things because the state of the art in testing software back in the days was you hired a bunch of people with the pitch. get paid to play video games, and you sat them in front of a development machine of Nintendo with the diskette inserted, and then they perturbed buttons on their controller really, really fast, and then wrote down...

like paper log sheets, and maybe those paper log sheets got pasted into Excel by somebody later in the day. We got somewhat better at software testing over the years. Somewhat. Well, we got a lot better itself for testing over the years. I shouldn't undersell decades of intellectual accomplishment. But it still remains hard. And so some of the things that we...

developed in the intervening decades were automated testing, where an engineer uses their noggin and suggests, okay, I understand where this program is likely to fail. I'm going to ask a computer to use those failure states. potential failure states and report to me whether the program works or not. And then there will be no more bugs in software ever again, and this will be wonderful. Right, right, right. Why didn't that work? Yeah, right. What could possibly go wrong with this idea?

I think the acknowledgement of an ad read sounds cooler in Japanese. This podcast is brought to you by Mercury. As many listeners are aware,

Sponsor: Framer

I love a good bit of banking. I even enjoy the sucky, frustrating bits of working with large banks, because I'm broken. You know who isn't broken? Mercury, which offers business banking services to 200,000 companies, including mine.

I've used them for business banking for more than six years and been quite happy for the duration. Everything happens in a well-designed website and mobile app. I use them for the debit card that pays for the studio rental, for paying myself profits, and for transferring money to contractors.

I even use them for wires to angel investments, and I've never gone through an involved rigmarole over a single one. And, wouldn't you know it, most of those wires go to other Mercury customers. Mercury works well for businesses at a variety of stages and industries. from quickly growing funded startups to this relatively tiny internet publishing operation. Visit mercury.com to apply online in 10 minutes.

Mercury is a financial technology company, not a bank. Banking services provided through Choice Financial Group, Column N.A., and Evolve Bank & Trust. Members FDIC. I have this emergently complex system with all kinds of behaviors that no human being could ever foresee. And so the way I'm going to guarantee it works correctly is by sitting down and thinking really hard and thinking of all the problems that could happen.

and then I'm going to write a test to cover each one. That sounds great. No, it doesn't sound like it's going to work. It's like saying I'm going to test a plane, not by flying it. But by sitting down and looking at the blueprints and being like, oh, okay, I should make sure that if I wiggle this, that thing happens, right? And if I just do that enough times, I will eventually have a safe airplane. But of course...

As soon as I put it in those terms, it's like obvious that that will never, ever, ever give you a safe airplane. At some point, you need to actually put it together and fly it, and ideally fly it in many different weather conditions. And probably...

there's going to be some pretty scary things that happen. And then maybe if you do that enough times and get enough experience, you will have a safe airplane. But that concept came very, very late to software, even though software has many of the same problems. And we have management science, which is predicated on this incorrect conception of software being not merely correct, but this is the way we should organize our industries and lives around.

this lie, essentially, where, for example, waterfall development, where, okay, as part of the software development process, first, we're going to write a requirements document, and it's going to exhaustively specify everything the software will do. And then as part of the testing process, before we write the software, because we don't want those engineers in the room for this one, we're going to read the requirements document and then write all the ways the software can fail.

And then we are going to put them all in an Excel file. And when we get the prototype of the software, we're going to go through our Excel file and go through manually 12,000 different ways the software could fail. And when we observe zero at the end of that, the software has no bugs.

That's right. And this sounds like a parody, but if you work in government contracting, et cetera, et cetera, not merely will you be asked to do waterfall software development in 2025. It's literally the law in some places that the only way the government can buy software is...

by, you know, embracing this parity level understanding of how the software they're buying works. Right. Well, the crazy thing is it's actually not just the government, like even very sophisticated software development organizations. have imbibed a version of this. And it's actually pretty deep in their DNA. It's sort of how everybody thinks about it. You mentioned fuzzing, right? So for the listeners who don't know, fuzzing was kind of an amazing intellectual breakthrough.

The fuzzing revolution

that happened a few decades ago, which was mostly amazing. Amazing not because it's like a genius idea, amazing because it's such a dumb idea, and yet it was so powerful. And that really shows just how bad. the status quo was, right? The idea of fuzzing is, hey, why don't I just throw completely random garbage into my software program and see what happens? And it turns out that if you throw completely random garbage into a software program...

pretty much any software program, it immediately crashes. And so it's like, oh, all of your 12,000 item checklists and your carefully crafted automated tests. like they're all worth the paper they're printed on. Because as soon as I literally threw random garbage in, I found a new situation. I found a new scenario that none of your thinking ahead of time had covered. And that was like...

it was like a striking kind of humiliation for the way that people had developed software. But these days, it's become absolutely standard practice for anything security critical, for instance. And it's really actually changed how people do stuff. I think humiliation is kind of the right word. The software industry has oral folklore in a lot of cases, even more than it has research results in papers and academic conferences.

you know, the equivalent industry. And there was, for example, oral folklore that many eyes makes all bugs shallow. And so an arbitrarily well-distributed open source program is extremely unlikely to have security bugs. Of course, we'll find a few occasionally as a result of intense effort on the researchers, but no one could design a program that would find bugs at scale. And then to give a particular person some credit for an intellectual achievement, I think...

Michael Zalewski, if I'm remembering his name correctly, back in like 2013, just released this program that would do fuzzing based on an image of a rabbit. AFL. Yep, AFL. And you pointed AFL at arbitrarily important code and click go, and it would start reporting crash bugs and memory corruption errors and all the list of baddest of baddies that we had back in the day.

That's right. And AFL basically made one very simple change to how fuzzers work, which was another tremendous leap forward. AFL just said, hey, look, let's... We found all this stuff by throwing totally random garbage into programs. Let's just make it slightly less random. Let's feed in some random garbage. And then this is not a black box to us. It's a computer program. We can see which lines of code run.

And so then let's have a genetic algorithm. Let's use evolution. If a new piece of random garbage makes more lines of code run in the program, that has a higher evolutionary fitness. and we're going to make more garbage that looks like that piece of garbage. And he defines the reward function from new lines of code to how fit are you, and he defines a number of what he calls mutators, which take some random garbage and...

add some stuff to the end, or chop off some stuff from the end, or change a one to a two, or whatever. And you just run that for long enough, and it's exponentially more effective than pure garbage. And so that was, yeah, that was sort of the second generation of fuzzers and has spawned like a huge number of imitators ever since. My understanding, and feel free to correct me if I'm wrong, was that there was a 1.5th generation of fuzzers where...

We started with just throw random garbage or just throw this list of random garbage that we know exercises bugs in a lot of programs. And then the generation 1.5 was, okay. We're going to train machine learning models of which you could describe genetic algorithms as like one class of machine learning model, but we're going to train machine learning models to generate like a bigger list of garbage.

that we're then going to throw against programs without knowing the internals of those programs. And so the one big insight is... I can know the internals of the programs because I have source code available and because I have them actually executing here and can observe things happening in memory. And so I'm going to use that knowledge plus these.

genetic algorithms, which interesting convergence, like genetic algorithms were present enough in the literature that they were covered in my undergrad CS course in like 2004. I was doing it on a hobby basis to control ants in a simulation in the mid-2000s. And then fast forward 10 years later, and then they get implemented into...

a research result that immediately gets productionized by the largest firms in capitalism because AFL was like a stunning achievement in the state of the art. Yeah, there was, you know, it's funny, most things get invented many times.

And I think this is true of fuzzing as well. So another time that it got invented, right around the same time actually, was called property-based testing. I don't know if you've heard this phrase, but the idea behind property-based testing is, rather than... have, let's say, a test, an example-based test, where I insert one into my database, and then I assert, like, my database contains a one, and then, you know, now I have a test.

Instead, I write a property, which is a more general specification that says, hey, I can insert numbers into my database. If I insert one, I should then be able to find it afterwards. And now I can run that with any number, and I can run it... many times, I can run it concurrently, and I can add more operations. And this idea of basically constructing a model or a set of invariants or constraints on what my program must do, and then being able to randomize.

a set of API actions. And if you think about it, the use of randomness, this idea that we're trying to check some final property at the end, in the fuzzing case, it's did I crash? In the PBT case, it's did I do something wrong? It rhymes. It just came from a totally different community. Fuzzing came from security researchers, mostly, and property-based testing came from the Haskell world.

And, you know, they never talked to each other. And so they just sort of invented all these ideas again. And they like invented many of the same tricks and using coverage as a form of feedback and like evolutionary stuff and ML. And like, it's sort of, it just sort of happened again.

which I think is so interesting. And, you know, it's a thing that actually happens quite often, you know, in most disciplines, I think. I think we have, this is every paper about interdisciplinary studies ever written, but... Failure to communicate between smart people that do different things as their primary day job. And then over and over again in the tech industry, for some value of the word tech, we have roughly simultaneous invention of different concepts.

A person in the financial side of the technology world failed to read a CS paper because they don't spend much time with CS papers. And therefore, many, many tens of millions of dollars was spent at a large firm in developing something which... is on undergrad curricula in a different part of the building. It is fun when we get transplanted ideas across industry and then are immediately able to operationalize them or weaponize them, to use the term sometimes used in the industry.

to great effect at actually making everything safer, making it more robust, etc. Right. So now joining together the two stories that we've been telling here, Fuzzing got reinvented a third time, and it was at FoundationDB.

Deterministic simulation testing

The key insight that we had at FoundationDB, or one key insight, was people are basically trying to test for all kinds of reliability properties or fault tolerance properties. the old sort of example-based way. For example, I want my database to keep working if this network connection drops halfway through a transaction. That sounds very reasonable.

The thing that 99% of developers do is they write a special test that tries inserting some value, and then 45 milliseconds into the transaction drops the connection, and then they check and make sure that everything worked. Great. What have you now proved? Well, you've proved that under certain circumstances, if you drop the thing 45 milliseconds in, you know, the thing still works sometimes, maybe.

You have not proven that it works if you drop the connection 44 milliseconds in or 46 milliseconds in. You've not... proven that it works if you were writing a different value at that moment, and you've not proven that it works if somebody was walking by the computer and gave it a shove at that moment, and so on and so forth. And so actually it turns out that pretty much every kind of like...

My computer effectively hides physics in the real world from me beneath a layer of abstraction type property that you might want is a property that can really only be checked via this more randomized. constraint-based testing approach, right? What you want to say is, if I'm writing a value, you can drop the connection at any point and it will work. And then I want the computer to go try and do it at a whole bunch of different points.

so that I believe that this is actually true in general, not just like the one that I happen to pick. And you want to be able to say a requirement, which is in many RFPs and similar. And also the system should be robust to hardware failure.

which is one sentence to write, but it's not one sentence to test anywhere in the world because there's an infinite potential universe of potential hardware failures and scenarios and so on. That's exactly right. And so basically at FoundationDB, we developed a new... style of testing, which, speaking of parallel discovery, may have actually been invented at AWS at the exact same moment that we were inventing it. We had a podcast recently with Mark Brooker over there about that.

But basically this new style of testing was called deterministic simulation testing. And you can think of it as fuzzing for the world, right? Instead of fuzzing the inputs to my program, I'm going to fuzz the environment in which my program runs. I'm going to try a whole bunch of different random, not random, like, what did I do, but random what was happening while I did what I did.

we're going to try running the transaction with network connections dropping in all these different moments and different combinations of machines failing and different combinations of hard drives failing. And, you know, every sort of... Not just everything you can think of that could happen, but everything you can't think of that could happen. We're just going to create random generators for real-world chaos. And we're just going to keep our computers crunching and simulating.

what would happen to the database in all these different situations until we find a case where it does something wrong. And then the beauty of the approach is a real challenge, usually. with this style of testing is because correctness can be so heavily timing dependent, right? Sometimes a computer algorithm or process will work right.

every single time except for one in a million. Because in that one in a million, this thread gets slightly ahead of this other thread. You can't exactly know ahead of time what the crucial condition is. That's especially true in distributed systems. But the thing about deterministic simulation testing is because all of these failures are virtual and they're happening in this simulation that you've constructed of your software, they're also perfectly repeatable and replayable.

Rather than discovering one of these weird one in a billion failures in production, which might not sound like a big deal, except if you're Facebook, one in a billion is happening many times per day, right? But it will definitely not happen while you've got your microscope out and are looking for it. So instead of having to deal with that and to work backwards and try and figure out what were the conditions that caused this to happen, you have it in a time machine. You have it trapped.

with your YouTube scrubber and able to go back and forth and look at exactly what went wrong. And so it just makes it a drastically more productive environment for figuring it out and for making systems really solid. the secret behind how FoundationDB got so good. And it's now a technique that's escaped containment and is like sort of spreading through the industry. Yeah. I think a few years ago, Patrick Cattleson, who runs Stripe, which I previously worked at.

lamented that perhaps not the state-of-the-art, but the standard way that we do software development is not time-traveling debuggers everywhere. Like every other technology, it had to be invented. And you folks had a particular deployment of what is... essentially a time-traveling debugger for testing databases. And it helps sort of close the loop between we are able to speculate with high fidelity that the program will have a failure in cases that rhyme with this one. Now you...

software engineer, get that report, and you have to do something about it. Sometimes doing something about it is, eh, that'll never happen. And I think we've all been bitten by that one once or twice over our careers, but sometimes it's like, okay.

I have to go back and make my code more robust or my deployment environment more robust or include countermeasures against this particular course of corruption, etc., etc. And essentially, the software is making the... engineer and the system the engineer is embedded in much, much more efficient at going through this loop versus...

The logs say something happened. The state in the database is, after I've spent three hours looking at it, not what I expected it to be. We're across this, like, you know... artifact that is literally more complicated than the space shuttle is the problem. And ideally, I can solve that with myself and a small team of people on a budget of 12 hours.

There's actually another way that it makes you vastly more productive. The thing you said is 100% true, but there's actually another one too, which is the latency between when you introduce the bug and when you find it. has a huge effect on how productive you will be debugging it. If you introduce a bug and it is immediately found, you get a red squiggly mark in your editor or whatever, you can solve that bug in one second by pressing Ctrl-Z.

If you introduce a bug and it's not found for six months or 12 months and it's found in production and... By that point, you may have left the company. Some other guy is now debugging it. He has no idea how your code works. Nobody remembers what was being modified there. Nobody remembers the context of the change that introduced the bug. That's going to be a...

orders and orders and orders of magnitude harder to debug. Plus, the environment the bug is found in gets increasingly removed from the one that it is created in. And so if you find a bug in your own code the day you are writing it, you have... It is not necessarily the case that you have written that bug. It could be literally a flaw in silicon that was laid down many years ago. It could be a bug someone else at your company created 12 months ago and you were the first person to exercise it.

On BC and evidence, it's probably you. And, you know, you can use the fact that it's probably me. It's probably what I'm doing right now to quickly locate it. Whereas if... 10,000 developers work at your company and they're all cranking and doing multiple deploys a day, then there might be hundreds of thousands of code changes to step through 12 months from now to figure out, okay, which code change...

was the one that introduced the issue. And we have some fun tools for this. One of the most interesting moments in my life as an engineer was being told about git bisect, which is this... Fun little thing that you can get source control programs to do to say, okay, if you write a minimum test case, which can exercise a bug, and that's a big if, if you can write that minimum test case, magic software.

can find where exactly in the timeline that bug was introduced and pinpoint it to the minute. And that is not primarily used for, we know who to fire now, but is primarily used for, we know where exactly to put our remediation efforts. But like find the minimum test case is hard. Yes, absolutely. And the way I would put this is that basically really powerful, robust testing, which to your point is able to consistently and quickly find a particular issue.

actually makes debugging in the sense of root cause analysis and detective investigations through log files and data files and forensic analysis. It actually just makes all of that. not happen. It actually makes it incredibly rare. And that's such a gigantic productivity boost for a team that people who have lived in one world and not the other...

often I think don't actually understand what the other world looks like. There's one world where your job as a software engineer, 90% of your time is being spent trying to figure out what on earth is causing something to explode in production. And then there's another modality where effectively 0% of your time is doing that, and you're just going 10x faster. And I think these worlds both exist out there in parallel, sometimes even within the same company. And people just like...

If you've only ever had one, you just have no idea that the other one is occurring. And a factor which I think is less true at the tech majors, although still happens at the tech majors. but is extremely observed in many other places in the economy which employ lots of developers like, say, finance, is often...

The issue is not strictly speaking with you. The part of the broader quote-unquote system that is causing the issue is one that is at a counterparty or on somebody else's computer, etc., etc., etc. And there's a...

FunSong Software Engineering Community Distributed System means a computer that you don't even know exists can blow up your machine. But when you don't have full visibility into the full state of the system, and when you don't necessarily know what is the business process that ran... prior to me getting this data file that caused this later API request that I made to have an error in it, then the forensic reconstruction very literally sometimes involves forensic accountants.

And your job cannot be that satisfying or fast when literally doing software development requires bringing in the forensic accountants to figure out what went wrong. Which sometimes you have to do because the thing that went wrong is not simply like, oh... A cat photo on Facebook got corrupted and someone had to hit control at five to see the cat photo again. It's like, no, people lost money over this. Or no, the computers run the world we live in now and the upper bound for...

bad things happening in the world. It's not people lost money. Yeah, no, you're completely right. And this is actually a super, super interesting topic. I think that when you are faced with this kind of situation, right, the conservative choice as a software developer... is to assume the worst possible behavior on the part of your counterparty. And the problem, and the reason why this often doesn't happen, including within organizations,

Basically, we had a saying back at FoundationDB, which we've also carried over here, which is that all observable behavior of a system eventually becomes part of its interface. Meaning...

Real-world testing strategies

Let's say that I have an API, and my promise to you is the API will always give the right answer. I do not promise that the API will always return an answer or will always return an answer promptly. But... In reality, I'm a good engineer, and so I've designed this API, and it actually works really well, and it does always return the answer promptly. You will naturally grow to assume that that is part of its contract and part of its design behavior, even though it's not.

even though it's not something you should count on. And so your code will get less robust as a result because it will start to make the same assumption. And so the solution to this that we came up with was something that we call bugification. And what bugification means is basically if I am writing an API, at least when I'm running in test, maybe also when I'm running in production, if you like to live life on the edge, occasionally...

probe the outer limits of what I'm allowed to do. I know that I can always return to you within five milliseconds, but I've only promised to you that I'm going to get back to you within a second. So occasionally... deliberately just delay for 900 milliseconds, and then send you a response so that you will not come to count on my responses taking five milliseconds. And this can be done sort of at every layer in the system. And if it's done well and done thoughtfully...

it results in a much, much more resilient system overall. Can I give you an example of something that Stripe does, which is now listed as, and I'll preface this example with, I'm not speaking on Stripe's behalf, I no longer work there, and... my technical understanding might be somewhat out of date. People often ask why credit card failures happen in the real world. Like, I have a valid card. It definitely has money available on it. I tried to buy something. The website said,

My bank rejected the transaction. Why did this happen? And for many, many years, the financial industry's best answer to this question was, we don't know gremlins. And the reason that all the king's horses and all the king's men didn't have a better answer to that than...

gremlins, was the financial industry is not one computer operating in one room where we can just inspect the code and find the bug. It's an ecosystem of computers talking to each other. And somewhere inside that ecosystem, something went wrong, but we're not exactly sure why. And so the thing that Stripe does is for retries where, you know, the first thing to do, similar to the case of like, you know, losing a photo on Facebook, just hit refresh and see if it works.

So the first thing you do, if you send a credit card transaction to the ecosystem and it comes back now, is to say, all right, let's just try it again. For some very, very small fraction of retries, which is in... absolute numbers, because Stripe does more than a trillion dollars a year of volume these days, a very large number of retries. Say, okay, I'm going to make some semantically neutral edits to that transaction that I just tried to do. So the transaction is fundamentally this tweet link.

thing that you ask the world to do for you and say, all right, if I was to phrase that very slightly differently, it shouldn't change what I'm asking you to do, but maybe it will exercise a different path through... particular bank that this transaction is going through. And you run that loop many, many millions of times. And eventually, you learn quirks about, say, various financial institutions. And so it might take a team of historians...

And system engineers looking into a particular financial institution to understand why is it the case that you really, really hate when zip codes in the United Kingdom are down-cased versus being uppercase, which is how they are traditionally rendered. No set of humans should ever have to ask that question in the real world. A computer can just learn over time. I had millions of...

bites at the apple. And this is just one of the apples in the orchard that I found. If I'm working with this particular bank in the middle of Kansas, upcase all the British zip codes. No one has to ever find it. You're basically fuzzing, but you're fuzzing your way to victory and to a transaction completing successfully rather than trying to cause it to crash.

Right. And that is exactly the end point. Like, what portion of good transactions get approved versus declined, where we have retrospective knowledge that a transaction is good because when the transactions are bad, when it should have been blocked, when it actually was fraud.

You hear from somebody later, like, nope, that was wrong. So you run this experiment, not merely running it millions of times, but running it over time, and then get to iterate towards the full shape of the complexity of the financial universe. Anyhow, that's my fun... fuzzing in the real world story. That's super interesting. So, antithesis. We have this notion of a time-traveling debugger, which had to be invented like all good technologies have to be invented. But...

Introducing Antithesis

And Titsys is more complex and better than simply having a time-traveling debugger that works on just the code that you wrote on your own machine. Can you give people a little bit of an explanation why? Sure. When you are trying to test a piece of software that just runs on your machine...

It is relatively straightforward in some sense. I mean, we've just spent an hour talking about how not straightforward it is, but it's actually quite straightforward relative to the other possibility, which is that it involves other people's computers. or even just other computers that are also under your control. Because when there's a whole bunch of different computers, suddenly...

it becomes, well, number one, it becomes impossible to say to this entire system, please make the exact same set of events happen again. Because computers are very complicated things. They're real-world hardware. literally what temperature they are and what humidity is in the data center can result in operations executing in a slightly different order. The time that a packet takes to go through a network can take different amounts of time, different times you did it.

And if some bug, let's say some really catastrophic bug, requires some really quite specific set of circumstances to manifest, you may never be able to get that to happen on command. And so time travel debuggers... which have existed for 20 years for a single process running on a single computer, were long believed to be completely impossible for multiple computers.

Antithesis basically completely fixes that problem and gives you a fully reproducible, fully deterministic environment that can encapsulate many computers and many processes and real-world, big, complicated, old, crufty software written over decades that's spanning some crazy organization.

and lets you take that whole thing and boil it down and run it in a totally reproducible way. And this can be used for highly efficient debugging, and it can also be used for finding the bugs in the first place. Because... The other bad thing about having many computers that are doing things a little differently every time is that it's basically impossible to fuzz them. It's basically impossible to do any of this stuff.

Well, because you yourself said, right? You've got some machine learning model that's trying to learn, oh, when I give it this input, I get this result, and I've made something happen. But if things can happen differently...

on different trials for completely random reasons outside of your control with nothing to do with what you did, how can your model possibly learn anything, right? And so by... by taking all of the complexity and noise and chaos of the real world and turning it into a fully controllable, fully observable simulation of the real world.

We both solve the debugging problem, and we find the find all my bugs really fast, please, problem. And generally, our customers approach it in the opposite order. First, we find their bugs really fast, and then we give them... a super-powered time machine for debugging them. Basically, what we're trying to do is we're trying to bring some of these new software testing paradigms, which are, let's say, a little bit more empirical.

a little bit more like, let's actually take the airplane out there and fly it and see what happens. And we're trying to bring them out of production into test earlier in the development lifecycle so that you're much faster and more efficient when you solve the issues. And we're trying to bring them within reach of every organization that builds or consumes software, not just the very most advanced, sophisticated people who live and breathe this stuff.

And so just voicing over that for a minute, production versus test are two different environments that software might run in. And a practice of the most sophisticated companies running software at global scales for the last, let's say, two decades has been... We do quite a bit of ongoing testing, but the actual users hitting software in the real world and seeing the bit flips caused by cosmic rays and the backhoes hitting internet connections is the only thing that exercises the software.

to the degree that it happens in the real world. So we're going to tighten the loop between getting those failures in the real world and surfacing them to engineers in some sort of... facilitated fashion. That's correct. Although we actually think that even that is too much of a concession to the old bad way of doing things. You know what's better than waiting for a real backhoe to hit a real fiber optic cable?

let's have a simulated backhoe hit a simulated fiber optic cable a million times a second in a simulation running on your computer. Yeah, and so this is, you know... Not live fire. There are no consequences in the physical universe for you running this simulation on your cloud fleet or similar, which, yeah, it was a concession by AppEmagoo. FaceSoft is my...

Fun phrase for them. But, you know, we did not have the technology available to simulate the universe in real life. And so we let real life service our training set, including like... some amount of regret to have, oh yeah, things actually broke for real people in the training set. Which interestingly creates a very different feeling with regards to software reliability among firms which are extremely good at

global scale, reliable software, and many other firms that write software. This has come to mean a different thing than it meant back in the day. But once upon a time, Facebook had a phrase, move fast and break things.

And many people in serious places, like say the financial industry and government and similar, said, oh, those tech idiots who are breaking things all the time for no reason or because they actually wanted to break things. Clearly, that will not fly when people's lives are on the line, when money is on the line, et cetera, et cetera.

But you end up with extremely different paradigms in different places. To what degree are we relying on the physical universe as a source of input? What is the cycle time between? errors happening in the real world and that getting to an actual engineer to look at what is the number of errors that are required to rise to our attention. Right. I think the way I think about this is the real world, the physical world, is not

optional, right? You are eventually going to deploy your software in the real world, or at least I hope you do. And you're eventually going to have lots and lots of users, at least I hope you do. And so you do always need a way. of noticing what is happening there and responding quickly and rolling something back or jumping in and troubleshooting if necessary. That, you know, just like we all need fire departments, right?

Fire departments are very important because no matter how good you are about fire safety, sometimes a fire will happen. But wouldn't it be nice if that were a very rare circumstance? And wouldn't it be nice if this were a really exceptional thing? Wouldn't it be nice if we could make... fires happen 100 times less often by using extremely flame-retardant materials in all of our construction. That's how I view sophisticated, good...

testing with fault injection and simulation and all the rest of it. We are eventually going to get to the real world. It's going to happen. But I would rather it happen in the real world 1% of the time instead of 100% of the time.

And I think that that's going to make for a happier and more productive software development organization. Increasingly, it is not optional to incorporate things like, say, if you're in finance, you have counterparties, deal with it. Because otherwise, you do not get the money.

And those counterparties are running computers that are not under your control, but which will intimately influence computers that are under your control. And our set of tools for mocking is... magic word of art in software testing, but our set of tools for mocking the behavior of counterparties or computer systems that weren't under our control were extremely limited, bordering on primitive for most of, let's say, the last 10 years.

So one thing that you can do with antithesis is say, okay, there's a computer. You don't know what it looks like on the inside. You do not control it. The person who runs it... It does not get signed by the same person who signs for our paychecks. But you got to deal with that computer anyway. Okay, here are some things that computer could do to you. And is your code robust against that? That's right. And we try to pursue the sort of...

principle of worst behavior that I outlined before, right? So for example, many, many, many of our customers use Amazon Web Services. And so we provide in the simulation a fake Amazon Web Services that you can use. And that, you know, your software doesn't even know it's not talking to the real AWS, which is great. But our fake AWS is an evil AWS, right? It does exactly what it says on the tin and no more. And so if you have come to rely on rights to S3 always succeeding...

you will quickly discover that sometimes S3 will return a 500 error to you, and you better be prepared to handle that and to retry it atomically and whatever else you need to do. And this is really good for flushing out... So many bugs are really conceptual bugs rather than software bugs. They're a failure to imagine a particular circumstance happening in the world.

And I think that's what we are very good at turning up quickly, is like, oh, it never even occurred to you that this machine might lose communication with this other machine. But guess what? It can't. And we're going to find out what happens when it does. And there were failures that...

would never happen in the life, like, let's say, you know, the predicted career of a particular engineer at a particular company. So there is no muscle memory for this kind of failure. Or even, you know, you, you personally. engineer Bob, are unlikely to experience this failure or observe this failure in your entire 40-plus year engineering career. However, that does not mean this failure will never happen. And the number of failures that are in that weight class, it's like...

individually, you are unlikely to experience any one of those failures. However, you're likely to experience a failure that is in that class. And so let's test out as much of that class as possible every time to make sure your program is robust against it. That's right, and often these things are actually very cheap to be robust against. Often it's like, oh, I just need to retry that. Sometimes the fix is like a line of code, but it's like, man, if you didn't have that line of code...

you might have your month ruined or worse, right? The thing about finance is the stakes are almost unbounded. Like when I'm talking to people in the financial services industry, I just have to say Knight Capital and they like, oh, they like recoil, right? Like this is the hedge fund that got like literally liquidated, like went to zero over a single software bug.

And I think in the crypto industry, it's even worse. People are rightly very, very jumpy about this stuff. Or if I can give another example there. Last year, CrowdStrike, a single bug by a single engineer at a single... company, which is not in the financial industry, took down a material portion of all banking services in the United States on a Friday.

Great example. And that's a bug that's not even in your software, right? And that's actually a service that we do provide to some people. We have some customers who are basically like, look, my entire company... is like resting, it's like that old XKCD comic, right? It's like all resting on this one toothpick, which is this open source library, and I have no idea what it will do and whether it works correctly. And so we end up testing some open source project or some vendors.

software on behalf of the person who's consuming that vendor software, which is kind of an interesting dynamic. And that's particularly critical to test. And also, I won't say untested in the status quo, but neither here nor there.

The CrowdStrike example

One part of that dynamic is, of course, in serious businesses, you have a quote-unquote service level agreement with your vendor that you can get them on the phone when there's an issue. But the... Time to detect and the time to resolve are often orders of magnitude more than when you slash your team, your organization controls all the moving parts.

And indeed, you can find this in write-ups of, say, military plane crashes where, okay, well, one of the parts of resolving this, well, there's still a real human being in the air. whose life is at risk is we need to get on a conference call with engineers at the company that built the airframe and figure this out. This actually happened very recently for the F-35. I don't know if you saw that story.

Yep, that was exactly what I was subtweeting as it were. Right, well, and the service level agreement can be cold comfort if, you know, you get your money back. Meanwhile, you've like just booked a... $20 million loss because some customers canceled because they're furious that something went wrong. Usually the SLA doesn't give you credit for that. There's very possible...

Possibly, you know, a night capital kind of incident where a company vanishes overnight. And good news, the creditors to the bankruptcy estate will have a $2,000 claim for that day's cloud services available to them. That's exactly right. Yeah. Let's dig into the weeds more about here about how it works, because I saw this technology and I'm fascinated about it. One thing I'm fascinated about, and I know, like, on the one hand...

kind of a marketing stunt given the kinds of people who get into software development. On the other hand, it's awesome. Let's talk about it anyhow. You point this infrastructure that you've created at Mario and find novel bugs in Mario. How does that even work?

Finding bugs in Mario

Yeah, so it really, I think, just speaks to the sheer power of a little bit of randomization and then a lot of smarts. So the thing that's sort of philosophically interesting about randomness... This is going to sound a little paradoxical, but pure randomness isn't very random. If I just flip a coin every time I need to make a decision, the resulting distribution

has a very, very predictable shape. It's a binomial distribution. It trends towards the Gaussian as the number of coin flips approaches infinity. That's one of an infinite class of possible distributions, right? And it's not even a very interesting one. It's exponentially unlikely to get really, really positive or really, really negative. And so just randomizing what you do doesn't actually let you get very far into software.

What you need to do to get very far into software is a much trickier, more sneaky exploration of the software using randomness sometimes and using structure sometimes. And then the thing that can give you a super exponential boost to your efficiency at doing this all is remembering where you've gotten and retrying from points that are fruitful. So this is where the time machine comes in.

People think that a time machine is mostly useful on the debugging side, but the time machine is also very useful on finding the bugs and getting deeper into a program side. Let's use Mario as an example here. Level one of Mario. Imagine I'm playing Mario with a fuzzer. I'm playing with AFL. People have actually done this stunt. I run into that first Goomba, and let's say I've got a 1 in 8 chance of just random button mashing, getting over that Goomba, and getting to the other side of him.

That's great. One in eight, our computer can eat that for breakfast, right? It's going to find that really fast. But then there's another Goomba and there's a one in eight chance of getting past him. Now I'm one in 64. And then there's like a pit I have to jump over. That's another one in eight, let's say.

In order to get to the end of Mario, there's hundreds and hundreds and hundreds of such obstacles. And so when I multiply together the low probability of getting past each of these obstacles with randomness alone, the result is infinitesimally small. I'm never going to beat Mario. with a fuzzer this way, right? The thing that helps though, is if I can notice that I have gotten past one particular obstacle.

and I can remember what I did to get over that obstacle, then I don't have to rewind to before that obstacle. I can just pick up from where I left off. And that lets me chop terms off this very long product. means that I can get deeper into the software really fast. Now, you, like anybody, not even Antithesis, could do that with Nintendo, because a Nintendo emulator is already a deterministic simulation of...

a Nintendo device, right? That's because speedrunners use them and tool-assisted speedruns use them and they have to be frame-perfect and so on. But we've just talked about all the ways in which the real world is not deterministic. It's not... like that. And so that strategy for exponentially increasing the speed at which I find bugs could not possibly work for real-world software unless I have a magical simulation.

a deterministic simulation of the world, which I can rewind and get back to any moment I care about. So what that means now is I can take your software or some big bank software or whatever, and I can put it in there and I can...

run stuff. I can just try stuff. Not necessarily until I see a bug, but until I see something different, right? Until I see something that my machine learning model says, oh, that's interesting. We haven't seen that yet. And then I can remember the exact sequence of steps.

that got me to that moment. And I can run hundreds more trials from that point, having saved my game at that moment, basically. And this gives me an exponential speedup in getting deeper and deeper into their code, provided that I can... actually recreate everything perfectly leading up to that moment. And that's sort of the magic of the system we've built is that we can do that. One of the things I love about Mario is the proof of work here is you mentioned tool-assisted speedrunners.

This is a community I find fascinating because I think objectively these people are extremely underemployed software engineers. But we spend what appears to be minimally millions of hours of effort distributed over a group of underemployed software engineers. Like, trying to figure out all the ways that a software artifact, be that Mario or the N64 Zelda game or whatever, works in ways that are not anticipated by the global knowledge base about that system.

They fight like hell for tiny, tiny research results. I'm like, oh, there is a skip in level three where if you do this particular frame-perfect button sequence at this particular moment, you get like... three milliseconds off the global best-ever speedrun of Level 3. And the thing that Mario is a proof of work for is, yes, there are a million hours of...

well, there's many, many millions of hours into Mario, but let's say there's a million hours that has directly impacted the scientific literature created by these underemployed software engineers for Mario. Computer knows none of that. Computer just has Mario in front of it. Go. And it will start discovering things that we put a lot of time by very smart people who are extremely, extremely passionate about level three, and it finds novel bugs in level three immediately after you turn it on.

And the sort of mic drop moment when we do a Mario Geist demo, which we've done once or twice, is then I take a ROM hack, right? I take a fan-made Mario map that nobody's ever seen before, and you feed that to the system, and it just... It just goes, and it does the same thing again. And that makes a very important point, which is... Conventional tests are incredibly brittle and flaky because anything you change about the system can sort of change incidental things about how the test runs.

which can then lead the test to report a false positive or a false negative, which is very, very dangerous and very time-consuming and consumes a lot of effort and maintenance and so on.

Property-based vs conventional testing

When you have property-based tests, though, when you have these more loosely specified assertions that particular invariants should always be met or whatever, those are much less likely to change. because the specification of software, the interface of software, changes a lot more slowly than the implementation does. And so the really cool thing about a system like Antithesis is if you say to me, hey, Will,

my software should never lose data. I can check that property on your current software and on all new versions of your software without doing any new work. per new version of your software. And that is kind of a game changer as far as human effort goes. Yep. Also, I think we're experiencing this different ways in different places and at different rates, but...

We're on the precipice of the rate of change of software development, like the raw number of commits that a particular organization produces going exponentially might not even be the right word. Much, much higher than our experience to date has. caused due to computer systems being able to, not for the first time, but for the first time in the way that we think about this, write their own computer code, basically. And so I've been programming with that.

Cloud Code and Cursor for the last month, and the cool kids got to it about eight months ago. But the majority of capitalism has not yet cottoned on to this is definitely the future. But this is definitely the future. And in that future that we will arrive at very shortly, there are going to be bugs discovered by computer code, which causes remediations written against those bugs, where a human engineer...

will be aware in some vague sense that that remediation is happening. In the same fashion that Congress is aware in some vague sense that the banks are working about money laundering today.

But I don't know that Steve at this particular bank is busting this particular person for money laundering because that would be crazy. And given that this pace of iteration is going to pick up massively, and that we as humans... supervising the system want to know that, hey, if we're making, let me pick a number, 10,000 times more changes to the software system on any given Tuesday than we were on Tuesday like 12 months ago.

The future of AI-assisted development

Am I still as robust as I was 12 months ago? And technologies like antithesis are part of the way in which this huge fleet of... currently junior software developers, but maybe they're not junior software developers 12 months from now. How can we know that that huge fleet is not compromising guarantees we make about our systems?

That's exactly right. And I basically think that really good software verification techniques, whether it's testing systems like Antithesis, or whether it's formal verification systems like Amazon's looking into, whatever the answer ends up being. I think that that is actually going to be one of the key enabling technologies of the revolution you're describing, because otherwise you get the problem of lemons.

and everybody gets too scared and paranoid to go all in on this. I think an interesting historical analog that I've been thinking about for a while is the 1990s and the wave of offshoring.

that everybody was told was going to happen and that didn't really end up happening quite the way people thought. And I think there's a lot of reasons why that didn't happen the way people thought. But one reason was when you pay an offshore shop in Argentina, for a software artifact, and they hand you something in response, it's actually very difficult to assess the quality of that software artifact without putting in a tremendous investment.

In fact, such a large investment that you could have just written the software yourself. And so what that meant was that there was basically, there was this problem of lemons, right? And it meant that people had a lot of uncertainty and that sort of... prevented this industry from really getting its feet under it in the way that people thought it would.

I feel like there's a fork in the road for the AI-assisted coding. There's one world in which it's very transformative and disrupting for prototyping and for web apps and things that don't have critical reliability requirements. And then there's another world where it truly takes over everything and becomes how everything in the software industry gets done always. And I feel like we only can reach that second world if we develop very fast, automated...

no human in the loop, no bottleneck ways of checking that the systems are doing what we think they're doing. And so that's one reason that I'm very excited to be in this business right now. I would... Broadly agree, although I put my metaphorical and or literal chips on, we end up in world number two and probably faster than many people think we've got to. I'm not predicting the end of the engineer by any stretch.

The first labor-saving technology that was supposed to put engineers out of a job was the compiler. And we love our compilers, but still have engineering employment. But I think this will be as transformative to the craft of engineering as the compiler was.

Right. Well, elasticity is a funny thing, right? Like the switch to LED light bulbs from incandescence, this is like one of my favorite facts, hugely increased the power consumption devoted to lighting in the United States because they're so cheap.

that you leave your lights on all the time now, which was like enough to overwhelm the like massive efficiency savings. And it's very hard to predict when that will happen. Like I think the same thing happened with weavering, right? Like with the original looms, the first industrial revolution.

One of my favorite examples of this is up until quite recently, the introduction of the automated teller machine, a machine which like it's literally in the name. This is supposed to replace tellers, increased employment of tellers in the United States. Right. I'll drop a graph here for people who don't know this funfactoid. And the mechanism there is with more ATMs, you have more bank branches. Makes sense.

the elasticity is difficult to predict in advance. And the difficulty of predicting in advance is why you have simulated universes that you can just run and then see what bugs pop out. That's right, yeah. But it will be interesting to see what the...

what future the world holds. I will say in my limited experience, and again, I got to this new feature in the last 30 days, but my limited experience, any amount of automated testing that you can have the AI repeatedly run on your behalf to say, like, before you tell me the thing is done. run the test suite, see what broke. And if you find a breakage, try thinking about it before you tell me the operator about it. Any amount of that makes it, you get better outputs out of the AI, the loop.

runs faster, et cetera, et cetera. And you, the human operator, spend less brain sweat on it obviously writing itself into a corner where it will never succeed in the path that it's going down. And so I have to... We assume that antithesis or an antithesis-like system with tight integration and the AIs that are quickly coming down the pipeline is going to be quite interesting. But we shall see what the future holds.

I feel like I could continue having this conversation about software reliability and alternate universes all day long, but you actually have a company to run. So, Will, where can people find you and answer to this on the internet?

Wrap

So best place is our website, antithesis.com. We are also active on LinkedIn and on the social media site formerly known as Twitter. Awesome. I'll drop links to all those in the show notes.

Will, thanks very much for coming on the program today. And for the rest of you, thanks very much for listening to Complex Systems, and we'll see you next week. Thank you so much for having me. Thanks for tuning in to this week's episode of Complex Systems. If you have comments, drop me an email or hit me up at patty11 on Twitter. Ratings and reviews are the lifeblood of new podcasts for SEO reasons, and also because they let me know what you like.

Complex Systems is produced by Turpentine, the podcast network behind Econ 102, Riff with Berne Hobart, Turpentine BC, and more shows for experts by experts in tech.

This transcript was generated by Metacast using AI and may contain inaccuracies. Learn more about transcripts.
For the best experience, listen in Metacast app for iOS or Android