TechStuff Classic: Bad Computer Bugs

Speaker 1

00:04

Welcome to tech Stuff, a production from iHeartRadio. Hey there, and welcome to tech Stuff. I'm your host, Jonathan Strickland. I'm an executive producer with iHeartRadio, and how the tech are you? It is time for a tech Stuff classics episode. This episode originally published on December seventh, twenty sixteen. It is called bad Computer Bugs. No, it was twenty sixteen, the year where like bedbugs became a big news item.

00:36

Like there was a year where it just bed bugs were in the news, and I'm wondering if that was twenty sixteen. At this point it goes to show that I haven't actually listened back to this episode, but yeah, Bad Computer Bugs. It originally published December seventh, twenty sixteen. Let's take a listen, now, shall we. I have to address a bit of apocryphal history, and regrettably it's a

01:02

story that we've repeated on tech Stuff. So I'm sad to admit that I was complicit, although unknowingly, in the spread of misinformation, and that all has to do with the origin of the term bug to describe a flaw in programming. So here's the popular story, the one that we have accidentally promoted on tech stuff without knowing that

01:29

we were in the wrong. It goes that Grace Hopper, who was an early computer scientist who rose to the rank of Rear admiral in the US Navy, coined the phrase bug after discovering a moth guming up Harvard's Mark II calculator, a literal bug. Generally speaking, the story tends to be set in nineteen forty five, and there is even a note in the logbook that reads first actual case of bug being found that's attributed to Grace Hopper. But there are several points that are wrong in this story. First,

02:05

the year. It didn't happen in nineteen forty five. It happened on September ninth, nineteen forty seven. We know because there's a logbook. At the logbook that marks the incident not only has the notes, it actually has the moth taped into the book itself. It's taped onto the page. Second, Grace Hopper wasn't the person to discover the moth or make that log entry. She did tell the story about the moth several times, but it wasn't in the context

02:35

of finding it or logging it. She just told the story that, yeah, we really did have a bug in the system, and most importantly, the word bug had already been used to describe design flaws for decades before the Mark II was even designed. In fact, if you look at the logbook, this makes sense. It says first actual case of bug being found. That sentence doesn't make sense unless you've already used the word bug to describe a flaw, because you wouldn't say first actual case of bug being found.

03:09

The wording doesn't make any sense. The context makes no sense. Sadly, there are documented quotes dating back to the nineteenth century using the word bug to mean a design fault, and it could go back even further than that. So is with much regret that I admit I have unwittingly contributed to a bit of misleading folklore making the rounds. But I'm glad I can take this opportunity to address it.

03:34

All Right, So let's talk about design bugs, and I'll be covering several goofs, mistakes, flubs, flaws, and outright catastrophes in this episode. But one thing I'm not necessarily going to cover our software vulnerabilities that were later exploited either by opportunistic hackers or white hats who are just trying to improve system security. Those vulnerabilities are common in many types of software and arise not just through mistakes, but

04:01

sometimes simple oversights. And I think it might be more fun to look at some real bugs, like stuff that made things go wrong, stuff that may have rendered a program defunct or otherwise caused headaches. Now I'm going to make an exception to this. I'm going to start off with the Ping of death. And I only mention it because it has an awesome name. Now, this flaw caused

04:24

headaches back in nineteen ninety five and ninety six. It was a flawed IP fragmentation reassembly code, and it became possible to crash lots of different types of computers using different operating systems, although Windows machines were particularly vulnerable, and this particular flaw would make a Windows machine revert to the dreaded blue screen of death. And it all happened

04:46

by sending a special ping packet over the Internet. So, for those of you who aren't familiar with what that is, a ping is essentially a simple message that checks for a connection between two computers. You send one ping from a computer to a another one and look for a response so that way you verify there is in fact a connection. You can also tell other things like how

05:07

fast is that connection between those two computers. Now, in this case, you would have to actually design a malformed ping request and send that to a target and it would bring that target down. That's the only security vulnerability story I really wanted to focus on. The others are all just design flaws. And let's begin with the bug that inspired me to do this episode in the first place,

05:30

that Spotify bug I mentioned earlier. Ours Tetnica wrote a piece on it in November twenty sixteen, but the problem seems to date back at least as far as June twenty sixteen, and that's when a few savvy Spotify users noticed some unusual activities on their computers, and it took a little bit of detective work, but they discovered that Spotify was apparently generating a huge amount of data on a daily basis, like gigabytes of data per day. And the culprit turned out to be a vacuum process for

06:02

a database file containing the string mercury dot dB. Now, the vacuum process is the digital equivalent of vacuum ceiling. It's meant to repack data so that it takes up less space on a drive. Now, this involves building a new file to maximize efficiency, which is a good thing generally speaking. The problem was that Spotify's version was making it happen way too frequently, like on the order of

06:28

once every few minutes. So that's not generally necessary. You don't need to rebuild a database file every few minutes to make sure it's the most efficient size it can be. So each rebuild represented a relatively small amount of data, but over time it added up, which meant that if you had Spotify on on your computer, even if it was just running in the background, it would be generating gigabytes worth of information rewriting this file over and over. Now,

06:55

it wasn't filling up a hard drive. It was just overwriting the same file. Now, if it had been filling up a hard drive, people would have noticed much earlier, and it wouldn't have just been savvy Spotify users, because you would suddenly notice, hey, I can't save anything to

07:12

my hard drive because everything's filling up. Instead, again, it was just sort of writing and deleting, and writing and deleting the same file over and over again, and that probably doesn't sound like a big deal, but it is a problem if you're using a solid state drive or SSD. So one of the drawbacks of an SSD is that over time it loses storage capacity. Like you can store

07:34

less data on an SSD over time. Now, by overtime I generally mean over a great deal of time and a lot of different data being written to it and overwritten. Generally speaking, most of us end up replacing our drives before we get to a point where the loss of

07:51

capacity is a real issue. But similar in a way to how a battery can lose its ability to hold a full charge after you've gone through lots of charging and discharging cycles, you know how a battery won't be able to hold as much even if it says it's up to one hundred percent, But one hundred percent doesn't last you as long as it used to. That's because its capacity to hold a full charge has decreased over time.

08:16

But let's say you've got a program that's just constantly overwriting data to your drive, you might discover that your SSD's useful lifespan has been drastically reduced. So as I record this episode, Spotify has already rolled out an updated version of its desktop application, and that, by the way, is the only version of Spotify that was affected. If you use web based Spotify or mobile Spotify, you're in

08:41

the clear already. If you use a desktop version, as long as you have version one point zero point four to two or later, you are fine. But if you did have that earlier version and you just had Spotify running on in the background, chances are it was writing to your hard drive like crazy. So what about some of the other big bugs in computer history. Well, some of the real doozies involve our attempts to explore the

09:06

final frontier. So we'll be talking about space a few times in this episode, and we'll start with an early US satellite. So first up is a nineteen sixty two blunder involving the Mariner one. So some backstory on this one. We're going to talk a lot about the Soviet Union

09:23

in this episode too. It takes a couple of roles as we go on, But in this case, the then USSR had launched Sputnik into orbit in nineteen fifty seven, which really kicked off the space race and also was a big shot in the Cold War because the Soviet Union was essentially saying, hey, we can launch this into space, we could also launch something at you. In response, the

09:46

US done sort of the same thing. They had launched some satellites into space, and the Mariner one was going to be a big, big feather in the cap of the US. The whole idea was to launch a probe that would be a flyby probe and it would go

09:59

by Venus. So NASA, which was newly formed in nineteen sixty two, was taking control of this, and the budget for this particular project was eighteen point five million dollars, which, if you were to adjust for inflation, would be almost one hundred and fifty million dollars today, So one hundred and fifty million dollar project to launch the Mariner one

10:21

and have it fly by Venus. But, as I'm sure you guys have figured out by now based upon the topic of this podcast, not all went according to plan. Not long at all. After the rocket launched from the launch pad, it began to veer off course, and neither the computer controls on the rocket or manual controls back

10:43

at HQ could correct for the problem. The rocket's course was such that it was going to take it over shipping lanes, which meant there could be a potential catastrophe, and so a range safety officer made the difficult call and issued the command to blow the whole thing up just shy of three hundred seconds after it launched. So what happened? What why did it go off course in the first place? Well, there was a flaw in the spacecraft's guidance software which diverted the rocket, and no amount

11:13

of commands from ground control could correct for it. After a lengthy investigation, NASA discovered the error was the result of a mistake transcribing handwritten notes into computer code. So someone just took some handwritten notes and misinterpreted one of them, and that one mistake was enough to crash the rocket

11:38

or to necessitate it being destroyed. The great science fiction author Arthur C. Clark wrote that the Mariner one was wrecked by the most expensive hyphen in history, which isn't quite right, but it's pretty funny, I mean, come on, its humorous phrase. So the actual punctuation mark that caused the problem was not technically a hyphen. It was a superscript bar. Superscript bars, by the way, not a place

12:09

where playwrights hang out. To get tore up. A superscript bar just means it's a horizontal bar that is above some other symbol. In this case, it was a radius symbol, and that was a symbol along with the superscript bar to describe a smoothing function, which means the formula was meant to calculate smoothed values over the time derivative of a radius. Now, without the smoothing function, tiny deviations in course sent commands to the rocket's thrusters to kick in

12:40

big time to overcorrect for that problem. As an analogy, imagine you're driving a vehicle and you see a pothole in the road and you're approaching it, and instead of gently steering out of the way, you wrench the wheel really hard to the left or to the right in order to try and get around this pothole. That's kind of what was happening with the rocket. It didn't have the smoothing function and so as a result, it was having these wild deviations and course. So it wasn't a

13:09

hyphen that caused the problem, was close enough. Our next space story takes place in nineteen ninety six with the European Space Agency's Ari Anne five flight five oh one rocket. Now, this rocket was to launch into space on June fourth, nineteen ninety six, and instead the rocket disintegrated forty seconds after taking off. So what the heck happened? Well, it largely had to do with the ESA reusing old work.

13:38

This actually becomes a theme in this episode. One of the morals the of this entire podcast is, if you're designing something a successor to an earlier product, and you'd want to reuse some of the features that you created in your previous product, test the heck out of it in its new form, factor, because it could be that things that worked perfectly fine in the earlier model will

14:05

go awry in the new one. That's what happened here. So, as you might guess from the name, the Ariyan five marked the fifth generation of launch vehicles under that name. The Arian four's inertial reference system would convert sixty four bit floating point numbers into a sixteen bit signed integer, and it worked just fine. But the Arian five's stats were beefier than its predecessor with faster engines, and that

14:36

was where the problem really started. The engine output meant those sixty four bit floating point numbers were significantly larger than the ones generated by the engines on the Arean four. They didn't anticipate this, so during the conversion process there was actually data overflow, and that overflow caused both the backup computer and the primary computer board the Area N five to crash, and they crashed in that order. The backup computer crashed first, followed by the primary computer a

15:07

couple of seconds later. The whole thing took less than a minute to go from launch to disintegration. Oops, now we're going to stick with space. But jump forward to nineteen ninety eight and the Mars Climate Orbiter. This was an unfortunate problem. So this particular spacecraft was meant to study Mars's climate, atmosphere and surface changes, and it was also supposed to be a kind of relay station for landers that would explore Mars's surface, but none of that

15:38

would last because of some pretty significant goofs. So on September twenty third, nineteen ninety nine, the orbiter passed into the upper atmosphere of Mars and did so at a pretty low altitude. And this is what folks in the space industry call a bad thing. The drag on the spacecraft was significant, it began to fall apart and it was destroyed upon entering Mars's atmosphere. That's what happened. So the software guiding the orbiter was to blame, and it's

16:13

a dumb, dumb mistake. It was supposed to make adjustments to the orbiter's flight in SI units, specifically in Newton's seconds. That's what the contract with Lockheed and NASA said, Newton seconds, use Si units for all of your all of your calculations. But the software instead made calculations in non SI units, namely pounds seconds. So Lockheed software gave information to NASA's

16:42

systems using the wrong units of measure. NASA systems then took that information, assuming it was with the right units of measure, and executed commands based upon that. So this is why if you're ever in a math course and the teacher makes you stop in the middle of writing a problem on the board and says, where are your units?

17:06

This is why you have to make sure you're using the right units, because if you're saying a number and you don't associate a unit with it, someone could make an incorrect decision based on that, and it could be disastrous, as it was with the case of this orbiter, the thrusters fired at four point four or five times the power they were supposed to, and the orbiter didn't stand a chance. And this was a pre expensive mistake. That mission's cost came in at three hundred and twenty seven

17:35

point six million dollars. But on the bright side, with all of these stories, at least no human lives were ever in real danger as a result of the mistake. We're gonna take a quick break, and when we come back, we'll talk more about bad computer bugs. All right, Now, let's make a switch to AT and T, which is a company that had a pretty big problem with switches once upon a time. I'm talking about an issue that

18:09

popped up on January fifteenth, nineteen ninety. That's when AT and T long distance customers discovered they were unable to make any long distance calls. Why why could they no longer reach anybody? Well? AT and t's long distance switches, which control that and allow for the actual connections to be made, were on the frints. They were trying to reboot over and over again. They were just stuck in

18:34

a reboot cycle. Now, initially the company thought it was being hacked, but like I said at the top of the show, I'm not covering stories about hackers here. I'm talking about big design flaws that caused problems. So they weren't getting hacked. That's not what was going on with those one hundred and fourteen long distance switches. No, there was a design problem at fault. So what had happened was AT and T had rolled out an update to the code that managed the switches, and it was meant

19:03

to increase the efficiency. It was meant to speed things up. But the problem was it sped things up so much that the system got caught up in itself. It gets pretty technical, but I can give you kind of an overview of what the problem was. All right, So each switch had a function that allowed it to alert the next switch down the line if things were starting to get hairy. So imagine that switch number one is handling traffic,

19:31

but it's getting really close to capacity. So it sends a message over to switch number two and says, I can't take on any more work because if I do, I'll be overloaded. Switch too then says, no problem, I'll take on any oncoming work for you and we'll handle it from there. And if switched number two were to get into the same sort of situation, it would say the same thing to switch number three, and so on

19:55

and so forth. Now, eventually each switch will contact the one below it and say, hey, how you doing there, And if the answer is okay, then everything switches back and you go back to normal operation. That's how it's supposed to work up. But AT and t's updated code sped things up so much it caused some real issues, and there was some poor timing, just coincidental timing that

20:21

made things worse. So switch number one starts to get overwhelmed and sends a message over to switch number two, But switch number two was just in the middle of resetting itself, So switch number two goes into reset mode, which says do not disturb. Sends a message over to switch number three. That prompted switch number three to overload and put up a do not disturb sign. Move that down to switch number four. This whole thing goes down

20:46

the entire line of one hundred and fourteen switches. They all end up getting overloaded as a result of this, and all go into reset mode and they get stuck there. That problem lasted for nine hours before be for AT and T was finally able to address the message load on the entire system and get the switches back to normal.

21:06

The estimated cost of lost revenue for that time was about sixty million dollars in long distance calls, and there were a lot of angry customers to boot, so to placate them, AT and T offered reduced long distance rates on Valentine's Day. Pretty ugly, but AT and T tried to handle it, at least in a way that didn't turn it into a pr nightmare. Not so with Intel. That's what brings us to the Pendium problem. I don't know if you guys remember when pentium processors first came out,

21:36

but they were a big deal. It was a redesign of the architecture of the microprocessor and it was meant to really speed things up. Well, Intel had a massive nightmare in nineteen ninety four thanks to a flaw in the entire first generation of Pentium processors. Now, when you break it all down, a CPU is all about performing math, medical operations on data, so it's kind of important that

22:02

it does this correctly. Unfortunately, the flaw on the Pentium processors kind of messed that up, and the issue has to do with floating point operations. So the predecessor to the Pentium, the four eighty six, used a shift and subtract algorithm for floating point operations, which was effective but relatively slow compared to what Intel thought they could do

22:25

by totally redesigning that structure and using a lookup table approach. Now, the table was supposed to have one thousand and sixty six entries programmed directly onto the logic array of the Pentium processor, but for some reason only one thousand and sixty one entries made it. Five entries went missing and essentially returned an answer of zero instead of what they were supposed to say, so if a calculation accessed one of those missing cells, it got zero, even though that's

22:58

not the correct answer. All the first generation pentiums went out with this error because it was so minor that it wasn't even picked up by Intel's quality control at the time. Now, processes worked just fine up to the eighth decimal point. Beyond that things got messy, but for most folks that wasn't a problem because they weren't doing mathematical calculations that needed that level of precision. It just

23:24

wasn't a thing. In fact, there was only a one in three hundred and sixty billion chance that this error would cause a big enough problem to reach up to the fourth decimal place. So most calculations that were simple were bulletproof. You were fine. But if you needed that precision, if you needed that really fine degree, that's when you

23:45

would encounter the flaw. And that happened because they're math professors in this world, and one of those, Thomas Nicely discovered in October nineteen ninety four that he was getting errors because of this issue. He needed the processor to work correctly, and so he contacted Intel about the problem. And this is where we take a moment to acknowledge there's a right way and a wrong way to handle an issue. That's your fault. Intel decided to go the

24:16

wrong way. My opinion is if you make a mistake, it's usually a good idea to just own up to it and try to make it better. But Intel's response was more along the lines of, yeah, we didn't think it was a big deal. And then Intel made other pr blenders. But because people began to hear, hey, that pentium processor in your computer that you just bought it

24:37

doesn't work properly. So people wanted to get replacements, but Intel said, oh, we're only going to replace the ones if you can prove that the mistakes that it makes affect you in some meaningful way. So they weren't They weren't denying that there was a problem. They were just saying, hey, unless you can prove the problem affects you, we don't care.

24:59

That didn't go well. If you create a product and you market it as the future of computing and then it's discovered there's a flaw in the design, and then you say we'll replace it, but only if you prove you deserve it, it doesn't tend to make your customer base very happy. So ultimately Intel reverse that decision and offered to replace the processor for anyone who wanted it who had a first generation Pentium, and that mistake ended

25:26

up costing the company four hundred and seventy five million dollars. Yikes. All right, now we're gonna switch gears over to Microsoft. First. I think you could claim that all of Microsoft Bob, the nineteen ninety five product that was supposed to be an easy accessible computer interface, was really just a massive software bug. I mean it introduced comic sands. For goodness sakes, the cluttered organization system, the lack of meaningful security, and

25:55

other numerous issues plagued that software. But we did an entire episode of tech Stuff about Microsoft Bob a couple of years ago, So I'm not going to dwell on it anymore, but if you want to hear more about it, go find that episode. It was a fun one. Now. In two thousand and seven, Microsoft experienced a massive headache when a bug on their servers notified thousands of Windows customers that they were filthy, dirty software pirates and they

26:22

should be punished. These include people who actually had legitimate legal purchase copies of Windows XP or Vista. So the problem here was Microsoft had an initiative called Windows Genuine Advantage, and it was a nice name for a strategy meant to curtail operating system piracy. Essentially, it was a component in Windows that would allow Microsoft to figure out if the copy of Windows on any given computer was legit.

26:53

In other words, it was a DRM strategy. But in two thousand and seven, a buggy install of software on a server misidentified thousands of legitimate, law abiding customers as pirates. For nineteen hours, the software just laid down the law, and so people began to receive sternly written warnings about

27:13

their choice to indulge in bad behavior. And if you were a Windows Vista customer, you had it the worst, not just because you were using Windows Vista, which I think we all agree was not one of the bright points in Microsoft's operating system history, but also because Microsoft had built in the ability for Windows Genuine Advantage to switch off certain operating system features in Windows Vista if it determined that the copy someone was using was a

27:43

pirated version. So it was misidentifying real versions as pirated ones, turning off features. And these are for people who have bought legitimate copies. This, by the way, is one of the big arguments people have against DRM. It has the tendency to punish legitimate CUS customers. And you feel like you're stupid for buying a copy of a piece of software rather than just stealing one that has had those

28:09

features or those defenses removed. Like why you're creating more incentives for people to go outside and get a pirated copy. All right, so imagine you've purchased this legitimate copy of Windows Vista. First of all, you already feel bad. Then you're told you're a thief, so you feel worse. Then someone remotely switches off several features of your operating system. That was not a great PR message, So that was a real issue. They did eventually fix it after that

28:37

nineteen hours, but by then people were already very upset. Also, I don't want to just you know, pile lots of abuse onto Microsoft. I gotta talk about Apple here too. So the company prides itself on a high standard of quality, and in general it's pretty good about living up to that standard of quality, depending upon your point of view of their various products. But that hasn't stopped a few clunkers getting through and into the public hands. And that

29:08

was the case in twenty twelve with Apple Maps. If you owned an iPhone back in twenty twelve when Apple Maps came out, you may remember this problem. It's pretty well publicized. Maps were inaccurate, sometimes leaving out important details, like, you know, a river or a lake between you and your destination, things that might be important if I don't know you don't drive an amphibious vehicle, might not have a road on there that's important, might misidentify the location

29:35

of a historical landmark. For instance, it thought the Washington Monument was across the street from where it is, but nope, it's just where we left it. Despite all of Rolanimerick's best attempts to move it or destroy it, it's still there. The real problem here was that the Apple software just wasn't ready for public unveiling. It needed a lot more testing. It was trying to play catch up to Google Maps, but Google had the advantage of working with companies that

30:05

had been doing mapping software for years. Google acquired those companies and acquired the expertise of people had been working on that software, and Apple was really just trying to create their own version and get it out as fast as it could. But it got out a little too early, and the company spent the next several months tweaking maps and trying to keep control of the situation. But by that time, many of Apple's fans, even the most devoted ones, had kind of given up and switched over to Google

30:31

Maps instead. All right, we've got another break ahead of us, but don't worry. Once that's done, we're back to conclude our discussion about bad computer bugs. Now I'm going to transition into some serious bugs. These are ones that either threaten the lives of people or they contributed to people dying. The ones I've talked about now up to out rather have cost companies millions of dollars, but no one's life was truly threatened. Unfortunately, that's not the case with all

31:12

software bugs. Now, a couple of bugs had the potential to kill millions of people. One of those happened in nineteen eighty a famous famous bug, or at least a faulty circuit, and that was a faulty circuit in Norrad's computer system, which caused it to mistakenly conclude the US was under nuclear attack from the Soviet Union. So displays on nor Ad systems showed seemingly random attacks, and they didn't correspond with each other. So the display might show, Hey,

31:43

they're two missiles heading over from the Soviet Union. No, they're two hundred. No they're fifty. No there's three. And it wasn't consistent, and command posts around the US all had conflicting information, which led leaders to conclude the whole thing was a regrettable computer error, and they were right

32:02

to do so. To be fair, they were kind of prepared for this because there was another incident that had actually happened in nineteen seventy nine that was way scarier, and in that case, someone mistakenly inserted a training scenario into the computer system that made it seem like the Soviet Union had launched an all out nuclear attack on the US. But that wasn't a bug. That was a mistake on the part of a human who had accidentally

32:25

uploaded the wrong or rather executed the wrong command. It didn't have something to do with a flaw in the computer system itself. However, because that thing happened and everybody was freaked out and then was able to determine that in fact, it was a false alarm, it meant that calmer heads could prevail in the nineteen eighty incident, so the Soviets also had a close call just a few

32:50

years later. It was a bug in the early warning detection software that the USSR was using in the early eighties, and on September twenty third, nineteen eighty three, so Union received an alert that the US had launched a nuclear attack in the form of five nuclear warheads, technically two

33:09

different attacks. The first would have been a single nuclear warhead and the second was four nuclear warheads, and this was during a particularly stressful period in the history of both countries and their relationship with each other, at the height of the Cold War nineteen eighty three. Now, Fortunately, Soviet Air Defense Forces Lieutenant Colonel Stanislav Petrov suspected that this report was an error and that there was some sort of bug in the software or a mistake in

33:43

the reporting system that caused this. He gave a command to hold off on any sort of retaliatory strike, which would have initiated a full scale nuclear war had it happened. Petrov was the officer in charge at a bunker that served as the command center for this early warning system, and he had said afterward that his reckoning was any real attack would consist of hundreds of warheads, not five.

34:08

No one would start an attack with just five warheads, so it was more likely to be an error than a genuine attack, So he gave the command to wait until the reported missiles would pass into the range of radar, which only extended as far as the horizon, so if it had in fact been a real attack, it would have potentially limited the Soviet Union's ability to respond. But

34:30

no missile showed up, and he was vindicated in his decision. Now, the cause of the false alarm in this case was a combination of factors that the designers didn't anticipate, which largely consisted of sunlight hitting high altitude clouds at a particular angle from a particular perspective of the satellites. So

34:53

the satellites misidentified that reflection as a warhead. Now they were where the Soviets were able to address this error in the future by adding another step in which these satellites would cross reference data from other geostationary satellites to make certain that they are identifying actual rockets as opposed to high altitude clouds. Now, there are several cases of software bugs leading to actual deaths. For example, the Therak

35:27

twenty five was such a case. Now, that was a radiation therapy machine that could deliver two different modes of radiation treatments. The first was a low powered direct electron beam and the second was a megavault X ray beam. Now, the X ray beam was far more intense and it required physicians to provide shielding to patients to limit exposure to the beam. But the Therak twenty five had inherited its code from its predecessor, which had different hardware constraints.

35:56

Now the new machine meant that these constraints were aren't there, and it created a deadly problem. If operators change the machine's mode too quickly from one to the other, it would actually send two sets of instructions to the processor, one for each mode of operation, and whichever set of instructions reached the processor first, that's what the machine would switch to. So let's say you've been operating the the Act twenty five in the Megavault X ray mode, but

36:25

now you're going to have a patient come in. You need to administer radiation therapy, so you want to switch it to low electron beam. You switch it too quickly, it sends two sets of instructions to the processor, and the one that arises the Megavault X ray instruction, So instead of switching it, you confirm to stay on the more intense, deadlier radiation. The tragic news is this did

36:50

happen several times. Six patients were documented as dying from complications due to radiation poisoning from THERA Act twenty five machines between nineteen eighty five and nineteen eighty six, and while the machine would send error messages when these conditions were present. The documentation for the machine didn't explain what the errors meant. It didn't say, hey, if you get this error, it means that you've switched modes too quickly

37:16

and you need to address this. So, since operators weren't told that this was necessarily a hazardous condition, they would just clear the error and proceed, and there were deadly results. In a similar vein in Panama City, Panama, there was an incident involving a Cobalt sixty system, actually several incidents involving this Cobalt sixty system that was running therapy planning

37:41

software made by a company called Multi Data Systems International. Now, the software's purpose was to calculate the amount of radiation

37:48

that cancer patients should receive in radiation therapy sessions. Now, during these radiation therapy sessions, the therapists were meant to place metal shields on the patient to protect healthy tissue from radiation damage, and the software would allow therapists to use a methodology to show where those shields were on the patient to indicate where the shields are present, But they could only draw up to four shields, and the doctors in Panama wanted to use five shields for particular

38:22

therapy sessions. They were overloaded, they had a long waiting list of patients, and they were trying to make things more efficient, and they discovered that they could kind of work around this limitation of four shields by drawing a design on the computer screen as if they were using just one large shield that has a hole in the

38:42

middle of it. And so what they would do is they would arrange the five shields to essentially be in the same sort of shape with the middle of it being opened, so that they could have the radiation therapy

38:52

pass through it. But they didn't realize that the software had a bug in it, and that bug was if you drew the hole in one direction, you get the correct dose of radiation, but if you drew it in the other direction, so like clockwise versus counterclockwise, the software would recommend a dosage twice as strong as what was needed, and the result was devastating. Eight patients died as a result of this, and another twenty received doses high enough

39:23

to potentially cause health problems. Later on, the physicians were actually arrested and brought up on murder charges because they were supposed to double check all calculations by hand to ensure that they were going to give the proper dose of radiation treatment. So while the software was calculating the incorrect dose, the physicians were responsible for making sure that any dose that was calculated was in fact the correct one, and they failed to do so, or at least that

39:51

was the charge. There are also bugs that involve military applications that have resulted in the loss of life. During the Persian Gulf War in Iraq, he fired scud missile hit a US base in Saudi Arabia and it killed twenty eight soldiers. Now the base had detected the missile and had launched and fired a Patriot missile in return.

40:12

The purpose of the Patriot missile was to intercept and destroy incoming missiles, and the way a Patriot missile did this was to use radar pulses to guide trajectory calculations so that it would end up getting close to the incoming missile. This is harder than it sounds because both missiles are moving very very quickly, so a NEA very precise information in order to adjust its trajectory properly and make sure it was on target. Now, once it gets

40:39

within range, which is between five and ten meters. I think it would then fire out one thousand pellets from the Patriot missile at high velocity, with the goal of causing the incoming warhead to explode prematurely. In this case, the Patriot missile missed, and the military investigated the issue in the wake of the loss of life and found a problem with the software guiding the Patriot missile. And it was a problem that actually the military kind of

41:07

knew about already. So one of the processes in the Patriots programming was to convert time into floating point operations for increased accuracy, but not all subroutines that depended on tracking time did this. Some of them remained clock units rather than floating point operations, which meant that they would get out of sync after a while. There'd be a disagreement in various subroutines as to how much time had

41:34

actually passed. And like I said, the military was aware of this issue and they had a workaround which was not ideal. The workaround was you would occasionally reboot the system, which would reset the clocks and synchronize them, but over time they would fall out of sync because they're not tracking time the same way. And since there was no hard and fast rule as to how frequently you'd reset the system, problems like this one were possible, and in fact,

42:00

in this case it did happen. So prior to this particular incident, that specific Patriot system had been running for one hundred hours without a reboot, and the clock disagreement amounted to about one third of a second. Now that seems like it's no time at all. One third of a second is so short, but a scud missile's top speed is about one point one miles per second or one point seven kilometers per second, which means if you take a third of a second, the missile could travel

42:29

more than five hundred meters. And since a Patriot needs to be within ten meters of a target to destroy it, that resulted in a catastrophic failure. So software bugs can be a matter of life or death. It's not all just Hey, this irritating thing meant people couldn't make long distance phone calls, or this issue caused my computer to start writing massive amounts of data to its hard drive.

42:54

And this is why it's so important to have really qualified QA personnel go through code and make sure it's doing what it's supposed to do, because the problems that can arise can be non trivial and in fact life or death situations depending upon the application of technology. So technology is a fascinating thing. It's a wonderful thing. It has benefited us in ways that I can't even begin

43:20

to describe. It's just too broad a topic, and it's something I've been tackling for, you know, eight years, and I haven't even gotten close to getting toward a finishing point. So I don't want to suggest that technology is bad, but we definitely have the need to check, double check, and triple check all this work to make certain things are working properly before we release them ount into the wild.

43:43

That particularly applies if, again, you are reusing old code or old components in a new way, because you have to make absolutely certain that there's not going to be some unintended problem that results when a new form factor is using old code. Alrighty, And that was our Bad Computer Bugs episode, which I still haven't listened back to yet. So maybe I talked about bedbugs. I probably talked about

44:12

Grace Hopper in that episode. Just did an episode about Grace Hopper earlier this year for a Memorial Day remembering Grace Hopper and her contributions because of the somewhat apocryphal story that she coined the term computer bug, but I have a feeling I might have addressed that in this episode too. Hopefully I did. If not, make sure you

44:35

find that episode about Grace Hopper. She was a truly phenomenal innovator and computer scientist, someone who is extremely important in the history of computer science, especially here in the United States. Check that out and I hope you enjoyed this classic episode. I hope you're all well, and I'll talk to you again really soon. Tech Stuff is an iHeartRadio production. For more podcasts from iHeartRadio, visit the iHeartRadio app, Apple Podcasts, or wherever you listen to your favorite shows.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript