Patriot missile software failure, 28 soldiers died. Fix: reboot the system

patio11 · on Sept 6, 2010

This was covered in our C classes in college, and it is probably more interesting for programmers here if you understand what the bug actually was.

The "software error" Wiki alludes to is that the Patriot missile kept track of its internal clock with floating point numbers. When the machine had been booted in the recent past, such as every time in testing, the floating point number spent most of its precision to the right of the decimal point. This let it able to do the designed behavior, which was calculate very small delta(time) to be able to do velocity/position calculations and get fairly close to fast moving objects then go boom.

The problem is that floating point numbers have a limited amount of precision available to them, and if you are using a few billion milliseconds (2 weeks), almost all of your precision is lost to the left of the decimal point (and, given that this is precision-intensive work, you didn't need to wait that long to see anomalies).

Lower precision meant that taking delta(time) got increasingly less precise as time went on. Which meant that velocity/position calculations got progressively more screwed up. Which meant the missile did not go boom in the general vicinity of incoming missiles. Which killed Americans and allies.

Thus the moral of the lecture: a) your computer is a powerful, tricksy beast which has many ways to trap you in even straightforward code and b) you should treat software quality like some 19 year old's life depends on it, because it might.

davidw · on Sept 6, 2010

> you should treat software quality like some 19 year old's life depends on it, because it might.

You should treat it like that if someone's life does depend on it and you have the resources to develop accordingly.

If you're developing something like, say, bingo software, you're probably better off devoting time to improving the product or marketing it, rather than working on it being 100% bug free.

It's all a tradeoff - time spent on eliminating every last little bug is time not spent on adding features or making it faster or marketing it or whatever.

Edit somewhat less clear-cut cases might be bits of software that you release publicly, and subsequently get used for life-critical systems. However, in that case, the onus is on those adapting the code for use in that environment to provide the testing/review/etc... rather than blaming the upstream developer.

extension · on Sept 6, 2010

That approach works if you know exactly what applications your code will ever be used for. If you are writing a library or a compiler or anything that will potentially be reused by unknown 3rd parties, then you can't be sure just how critically it will be put to use. There is no certification to distinguish software that lives can depend on.

That doesn't mean it's your fault if someone uses your free XML parser in an amusement park ride and your bug makes it fly off the rails. But I still wouldn't feel very good about it and would like to do everything I can to avoid it.

Also, when it comes time for you to write life or death code, it would be good if you already knew how to meet the required quality standard.

Dove · on Sept 6, 2010

There is no certification to distinguish software that lives can depend on.

Actually, there is. Google up "safety-critical software". And maybe "trusted software".

I work in military avionics; every dang line of code in the product, including any libraries we use, is vetted to death. If I were to try to just download a library off the internet and include it in the flight control software, I (A) wouldn't get away with it (B) would probably lose my job and (C) would confuse the hell out of my coworkers who all know I know better than that.

So don't worry someone will include your hastily-developed XML parser in safety-critical software without your knowledge. They won't unless you're willing to prove you've certified it to the level they require. And I promise you, that's not an exercise you'll forget having gone through. ;)

ssp · on Sept 6, 2010

But then, how on earth did they manage to get this software certified if it did something as DailyWTF-worthy as using floating point for a clock?

Dove · on Sept 6, 2010

Yeah. You should see some of the other stuff that makes it into production code.

I said there was a process. I didn't say the result was good software. ;)

gnaffle · on Sept 7, 2010

According to http://www.cs.unc.edu/~dm/UNC/COMP205/LECTURES/ERROR/lec23/n..., the system stored the time in integers, but it was converted to floating point when doing the conversions. This conversion contained a small error that accumulated over time.

Keep in mind that things that seem very WTF to you, might seem more plausible when given more details about the subject.

tetha · on Sept 7, 2010

Well I have to give them that this is kind of a curious bug. I mean, you have to wait over 4 days until it triggers. Probably, this just worked whenever it was booted and tested, and booted and certified, because probably the certification did not involve ignoring the thing for four full days.

phaedrus · on Sept 7, 2010

According to a book I read about game testing it's common for commercial games to be run through a test of simply leaving the game on for hours or days to see if there are bugs that only show up in this way.

regularfry · on Sept 7, 2010

In part because the user manual included instructions to regularly reboot before precision became a problem. The users did not, because they didn't want to risk being in the middle of a reboot when a target went overhead.

fshaun · on Sept 7, 2010

Collection of bits are put to truly horrible uses in the name of efficiency.

extension · on Sept 6, 2010

My mistake, but this kind of methodology seems to be specific to a few industries like aviation and space, while being conspicuously absent from e.g. medicine. Ideally, the software industry would have its own life-critical standard that was applied across all domains, not that I'm suggesting such a standard is necessarily feasible right now.

Until we have such a universal standard, it is entirely possible for a bug in your free library to indirectly kill someone, in the course of everyday best practices.

slavak · on Sept 7, 2010

I have an incredibly hard time believing the medical industry does not have stringent standards in place for safety-critical software. Do you have anything to demonstrate this, and what country are we talking about here? (To remind me not to ever see a doctor there... :) )

extension · on Sept 8, 2010

http://www.nytimes.com/2010/01/24/health/24radiation.html

farmerbuzz · on Sept 8, 2010

In the US, medical devices are regulated by the FDA. When it comes to using 3rd party software, the company manufacturing the device has the responsibility to ensure any 3rd party components function correctly. Its unrealistic to impose regulations on every piece of software that is written "just in case". It is far more practical to put the responsibility on the company doing the integration and selling the device.

Here is the guidance from the FDA on off-the-shelf software: http://www.fda.gov/MedicalDevices/DeviceRegulationandGuidanc...

davidw · on Sept 6, 2010

You seem to have written this after my edit, which addresses that point. I'd feel bad too, but I'd feel bad if I had to hire a team of coders to review every line of my open source projects and document any change:

> "At the on-board shuttle group, about one-third of the process of writing software happens before anyone writes a line of code. NASA and the Lockheed Martin group agree in the most minute detail about everything the new code is supposed to do -- and they commit that understanding to paper, with the kind of specificity and precision usually found in blueprints. Nothing in the specs is changed without agreement and understanding from both sides. And no coder changes a single line of code without specs carefully outlining the change. Take the upgrade of the software to permit the shuttle to navigate with Global Positioning Satellites, a change that involves just 1.5% of the program, or 6,366 lines of code. The specs for that one change run 2,500 pages, a volume thicker than a phone book. The specs for the current program fill 30 volumes and run 40,000 pages."

From: http://www.fastcompany.com/node/28121/print

If there's no certification for ready-made life critical components, then that means the burden of reviewing, checking and verifying everything in the system is on whoever wants to use it in a life-critical environment.

kiujygtyujik · on Sept 6, 2010

This did meet the requirements for what it would be used for.

Intended to be based in Germany in the 60s facing the Russians it wouldn't be powered up for 600 hours, because in 60 hours you would have been overrun.

ImperatorLunae · on Sept 6, 2010

There's an old Tom Lehrer song, "So Long, Mom," about just that: http://www.youtube.com/watch?v=pklr0UD9eSo

kiujygtyujik · on Sept 6, 2010

The British vulcan bombers of the same era had the range to reach Moscow - but not the range to return to the UK.

gaius · on Sept 7, 2010

I don't think that's true - Vulcans hold the record for the longest bombing mission even today (during the Falklands war). The secret is air-to-air refuelling...

fgf · on Sept 7, 2010

might be hard to do that somewhere over poland if ww3 just started

gaius · on Sept 7, 2010

USAF wargames discovered against F22s the most effective strategy to use would be tanker denial. Tho' with 4th generation fighters you would still need to outnumber them 6:1 (!)

kiujygtyujik · on Sept 7, 2010

The astronomer Freeman Dyson was part of the operational analysis team in WWII that worked out it made most sense to only search for and attack refuelling and resupply U-Boats (milkcows)

The Vulcan flight gear famously included hiking boots so the crews could walk to Turkey after completing their mission

stretchwithme · on Sept 7, 2010

ultimately, it is the responsibility of the integrator to test their systems, including the software. And they have the source code for everything. Its like Apple X 100.

patio11 · on Sept 7, 2010

Didn't see this last night. FWIW I agree with you, but that was not my recollection of the lecture as presented.

jacquesm · on Sept 6, 2010

There's an in-joke amongst embedded systems programmers, which is that if you need floating point you don't fully grok the problem (yet).

Floats are subtle, they can have exactly the kind of nasty little side effects that you overlook during testing and that bite you in a terrible way once you hit production.

Fixed point is the way to go to implement stuff like this, floating point is asking for trouble. Of course floating point is more convenient, but most people that use it don't really understand what's going on under the hood, and most of the time they don't need to.

But bank balances, clock values, GPS coordinates and other values of some importance are best represented in fixed point format. You'll have to do a bit more work when manipulating them but that pays off in reliability.

jpr · on Sept 8, 2010

Is there really any place where floating point is the right solution or default? I mean, all general purpose languages generally use floating point, but for general purposes, floating point isn't what you want. Or maybe it is, but it certainly isn't what I want.

pvg · on Sept 6, 2010

the Patriot missile kept track of its internal clock with floating point numbers

It didn't, really, and the bug itself is more convoluted than the loss of precision you are describing.

See the references:

http://www.mc.edu/campus/users/travis/syllabi/381/patriot.ht... and http://www.fas.org/spp/starwars/gao/im92026.htm

extension · on Sept 6, 2010

A potentially more specific moral in this case is "beware of floating point numbers". Don't use them unless you really, really understand how they work, or if you are using them in a very conventional way, like with 3D graphics.

Inexperienced programmers often regard FP as magical numbers that do everything. It doesn't help that programming languages like JavaScript essentially treat them as such.

gaius · on Sept 6, 2010

See, there've been guided missiles since the 1950s. There'll be 60-something old geezers at Raytheon who have done missiles their whole careers, 30-40 years, and they would have been apprenticed to the previous generation of old geezers who had merely done missiles most of their careers. These guys should know this stuff by now.

Fatal errors are almost never a single thing going wrong unpredictably - they're a chain of events leading to an inexorable conclusion. Root cause - where were these guys?

It is possible to get this right: http://en.wikipedia.org/wiki/Sea_Dart_missile#Gulf_War_.2819...

azernik · on Sept 7, 2010

The problem is, they never had to deal with software engineering; my mom, who served in the Israeli Navy in the late 70s, told me horror stories about the ship-to-ship guided missiles they dealt with. They were essentially analog electronics - addition done by adding voltages, for example - with error creeping in at every component. The way I remember her telling it, something like 30% to 60% of the things were down at any given time because some analog multiplier or adder had an error that was just a little bit too large. No one in the field was even thinking about software bugs at that point.

The reliable (and programmable) digital electronics were originally developed for ICBMs in the 60s, and only after quite a bit of miniaturization were they available for smaller guided missiles, meaning that there wasn't quite as much institutional experience of software engineering among guided missile designers as you'd think.

varjag · on Sept 7, 2010

Well, here SCUD (a technology obsoleted by 1960s) certainly worked, while the head-to-toes digital Patriot didn't.

Symmetry · on Sept 7, 2010

The SCUD was doing a much easier job, though. And the amount the patriot missed by is actually pretty close to the accuracy of a SCUD at hitting its target.

varjag · on Sept 7, 2010

The SCUD was developed in mid-1950s using technology pioneered in 1940s. Back then, the only feasible electronics were based on vacuum tubes. Noir films were hip, chrome fins on the cars were just in, ball joints in suspensions were about to appear, and it was socially acceptable to call black man a negro.

I think it did amazing job in the 90s against the arguably most technically advanced and innovative military, even more so considering it was pushed past its original specs.

regularfry · on Sept 7, 2010

They did get it right; the specs for the Patriot system allowed for rebooting every couple of days specifically to avoid this problem.

To my eyes, this is a classic training issue. Either the men on the ground were never told to reboot, or they were never told the consequences of failing to do so. Whether that failure was on the manufacturer's part for making crap manuals, or on the Army's part for screwing up the training, I don't know.

extension · on Sept 6, 2010

Well, nevermind. According to this, they were using 24bit fixed point, not floating:

http://www.ima.umn.edu/~arnold/disasters/patriot.html

Maybe a better moral is "avoid absolute clocks or counters, if at all possible".

sundarurfriend · on Sept 7, 2010

This may be a naive question but: why did the error get accumulated over time? The original system clock was correct, and if they read from it every time and converted, the error would have remained at 0.000000095. Could someone explain how exactly this happened?

regularfry · on Sept 7, 2010

The figure of 0.000000095 refers to the inaccuracy in the multiplier to get from the clock conter value to a value of time in seconds, not the inaccuracy in the clock itself. As the clock counter increases, the difference between the true time and the clock counter times the multiplier increases linearly.

Edited for clarity.

jacquesm · on Sept 6, 2010

In 3D graphics you need to do a lot of work beforehand on range checking your input data to make sure you don't end up with runtime overflows, which usually exhibit in visually very disturbing ways, such as planes that suddenly become transparent, pixels that flip from black to white or vv and so on.

It's very easy to destroy the illusion of a three dimensional graphic on a screen, these are very effective ways of doing so.

bl4k · on Sept 6, 2010

I thought that the experience of developing the GPS network (where all the sattelites and ground stations must be time co-ordinated to within 10+ decimal places in seconds) would have helped them with this missile system.

A GPS satellite travels much faster than a missile, and is accurate to within meters, so the tolerances are much higher (and the system was designed prior to patriot).

There is a definite 'not invented here' syndrome amongst defense contractors - I doubt they share any information, research or solutions amongst one other, which means the US tax payer foots the bill each time one of these contractors must independently develop and implement a system that has likely already been built in another part of defense.

rdtsc · on Sept 6, 2010

> There is a definite 'not invented here' syndrome amongst defense contractors

Yes. Especially because of contractors. A defense contractor is not going to its competitors in order to reveal and exchange know-how. That know-how is a trade secret that gives them competitive advantage. Except that in this case, lack of sharing results in making the same mistake multiple times, which results in loss of human life.

Interestingly lately I have observed that the government is trying to tighten up its defense spending belt and there is a tendency to develop more in-house rather than contract out. In order to gain public support the Congress keeps voting to increase the salary of service personnel. That leaves less money to spend on contracting out. So perhaps we'll see more sharing in the future at least between DoD's own projects.

gaius · on Sept 6, 2010

The DoD should have stuck to its guns (hah) and continued to mandate Ada.

weaksauce · on Sept 6, 2010

I would wager that some of the NIH syndrome is due to top secret requirements and blackops development that makes things tough to get at source code for other systems.

Retric · on Sept 7, 2010

It's much simpler to track and predict the position of objects orbiting the earth than moving though the atmosphere. Wind, temperature gradients, particulate matter, mist, etc add a lot of uncertainty to the predictions.

PS: As you increase accuracy more things become important factors. Large Ship guns actually started to track things like temperature at different altitudes to increase accuracy.

regularfry · on Sept 7, 2010

Also, a GPS satellite is actively designed to be found. A ballistic missile? Not so much.

some1else · on Sept 6, 2010

Oh, Interesting. I always figured GPS satellites were geo-stationary, but in fact (as you pointed out), they travel at 7000 mp/h :-0

pmjordan · on Sept 6, 2010

Geostationary orbits would tie the satellites to being directly above the equator, which (a) would prevent the system from working beyond a certain latitude and (b) would cause an extremely bad distribution of "visible" satellites and angles between signals, precision would suffer.

kiujygtyujik · on Sept 6, 2010

Not to mention the transmitter power you would need if they were 25,000miles away in GSO rather than 90mi away in LEO

chrisbolt · on Sept 6, 2010

GPS satellites are actually in MEO, at 12,500 miles away.

http://en.wikipedia.org/wiki/Medium_Earth_orbit

tetha · on Sept 7, 2010

Note that while you sit on your chair, you are travelling pretty fast. Relativity for the win!

dagw · on Sept 6, 2010

Moral c) The floating point numbers you use when programming computers are fundamentally very different from the numbers you'll see in math class. Any assumption that computer numbers and math numbers behave the same will eventually lead to bugs.

some1else · on Sept 6, 2010

It's truly remarkable how nobody cared to attend to the bug report about the drift, given that they are in the business of making missile pilot software, and it turned out they can issue a patch in one day. My teachers didn't have such a terrifying example at hand (they emphasize it with financial calculations), yet I still remember the issue and consider it when necessary. Poor QA :-S

joegaudet · on Sept 6, 2010

I thought the issue was due to the binary representation of numbers like 0.1, 0.01, 0.001 being in fact non terminating fractions ala: 1/3 thus the longer the system ran the more small amounts of round off error it accumulated but adding two incomplete fractions.

I seem to remember this being the case in my numeric methods course, but maybe that was a different systme.

slavak · on Sept 7, 2010

This baffles me. Did the software engineers creating this not take a course on numerical analysis in college? Or were those, as so often are, electrical engineers that had been hired to do software work? The amount of misunderstanding in the industry about what it takes to create mission-critical software is incredible.

assemble · on Sept 7, 2010

Trust me, most of us EEs know -plenty- about floating point problems. We also have to take entire classes about how our work could kill somebody.

Bugs are bugs, errors and oversights happen. You have to deal with it, document what happened, and be vigilant to make sure it never happens again.

ars · on Sept 6, 2010

Do you know how they fixed it? I can think of a few ways, but I'm curious which they used.

Natsu · on Sept 6, 2010

I find it interesting how "just reboot" has become something of a user expectation: often, users expect it to fix most anything. To be fair, it does seem to work fairly often.

It's a strong enough expectation that when I had some industrial machines running DOS (albeit on more modern hardware), I added my update, backup & diagnostic scripts to autoexec.bat so that rebooting would fix most of their problems.

It made my life a lot easier, though, because I could update the files and configurations via a master copy on the network, then tell them to reboot everything whenever it was convenient for them (usually between shifts) and the machines would all grab their updates and upload some log files for me to monitor.

viraptor · on Sept 6, 2010

I wonder how did they actually find out the reason for the failure? They had a system which worked perfectly (almost) and probably could be tested in every standard way without showing the problem. They must've had a seriously good logging system that showed something suspicious, or someone had a really interesting "a-ha" moment...

I'd like to hear the story of debugging this one. Also how they managed to identify that this incident was caused by that specific bug.

dotBen · on Sept 6, 2010

Imagine being the developer who wrote the line of code (who didn't understanding floating point variables). Or the QA tester who didn't spot it, or didn't decide it was worth reporting.

Aside from being a pacifist, this is why a number of engineer friends have stepped out of building defense systems (including missile guidance systems) and into more civilian engineering because the stress and moral burden is just too great.

extension · on Sept 6, 2010

Better than being the executive who decided not to release the patch until 28 people were dead.

viraptor · on Sept 6, 2010

As long as the workaround was known, well... I don't know exactly what the military procedures are for situations like that, but updating an active rocket defence system in area where you don't necessarily have trained engineers -vs- rebooting it every day or so to make sure it works. It looks like a simple choice to me.

Also looking at who actually makes the mistake - if someone gives you an update and the system fails, they're at fault. If you give clear instructions for operation and users don't follow it...

extension · on Sept 7, 2010

According to Wikipedia, the workaround came from the Israeli army, who found the bug, not the manufacturer. The instructions for this workaround did not propogate clearly and thus weren't followed.

Being defenseless for a few minutes every day during a reboot hardly seems like a reasonable fix anyway, especially if it becomes standard procedure that your enemy may learn about.

The fact that the manufacturer did release a patch, rather than a workaround, the day after the accident suggests that this was indeed the safest course of action. It was just taken too late.

nitfol · on Sept 6, 2010

More information about the errors in the (fixed point) math on the patriot: http://www.ima.umn.edu/~arnold/disasters/patriot.html

joshzayin · on Sept 7, 2010

That's interesting.

Given that it was an issue with a non-terminating binary representation, what would be the way to handle this, without somehow resetting the clock (restart or otherwise)?

Obviously, you could, in a modern system, have more memory and be able to store more bits of the number, but there would still be a limit that you would run into after some amount of time that would cause similar problems.

tetha · on Sept 7, 2010

I'm just thinking about this. I think I would try to split the clock into a precise part, which (for example) tracks how many milliseconds we are into an hour and a precise part which stores the hours, or the date up to the hours or whatever. Given this, I can reset the part with the degrading accuracy in a duration which maintains enough accuracy in order to maintain enough accuracy overall.

stretchwithme · on Sept 6, 2010

Interesting how complex it is to determine the accuracy of the missiles. Multiple Patriot missiles fired at each Scud, several possible outcomes, the Scud can break apart making multiple targets for the Patriots.

patio11 · on Sept 6, 2010

I agree, though "We got the missile but missed the warhead. That has to count for something." strikes me as a little off. Pretend Saddam Hussein has ordered his team of crack experts to use crappy engineering as an active countermeasure. They just beat you. Do better.

Anechoic · on Sept 6, 2010

One of my college professors was an outspoken opponent of missile defense systems (and the Patriot system specifically) that worked during the missile's reentry/descent phase.

His objection is that it's too easy for an opponent to defeat the system either by overwhelming it (MIRVs for instance) or by designing the reentry vehicle to make random movements which would make it really difficult for a intercepter to track. Even if the interceptor can hit it, it's more likely to knock the warhead off course instead of destroying it. If a nuclear missile is aimed at NY and an interceptor hits it so that it falls to Philly instead, that's still a net loss.

Saddam inadvertently hit upon the both methods - his engineers tried to improve on the Soviet scud design to give them more range (which they were successful at) but their improvements made the missile more likely to break up on reentry (which presented more targets than the tracking radar anticipated) and the lack of aerodynamics of the resulting pieces (including the warhead) the missiles fall in unpredictable ways which caused tracking problems. An opponent that actually tried to game the system could make his missiles more difficult to hit.

The professor is advocating for boost-phase missile defense since the missile movements are much more predicable.

icegreentea · on Sept 6, 2010

But it's not like boost-phase intercept is a magic bullet (ha!). Well, really, it generally takes a magic bullet. Launch sites are typically far away. Unless you have a interceptor several times faster (or interceptor site much closer to the launcher than the target), then it's very hard to actually reach the missile while it's in boost phase.

That airborne laser that's been in development for seemingly forever was pretty much determined to be the best way to intercept. You'll notice that it combines BOTH elements. That 747 is flying within LOS of the missile path while in boost phase, and it's also using the fastest projectile possible.

You'll also note that the Patriot's role is medium tactical air defense as well as theater anti-ballistic missile defense. The second role was basically tacked on, and then later massively expanded (once it became obvious that there aren't many air forces in the world that can actually fight the US).

In the end, I'm sure every general and admiral actually out to improve their warfighting abilities would want both systems. It's all about defense in depth. It's the reason why warships have three different sets of anti-air missiles, while we still have stingers when we have patriot missiles, and why US fighter aircraft still carrying short range missiles and guns.

jacquesm · on Sept 6, 2010

But it makes for more defense spending.

If you catch something on the launchpad that's a relatively low expense, to catch it i in flight you need to actively defend each and every possible target and a lot of sexy hardware to do so.

stretchwithme · on Sept 7, 2010

yeah, I think its very hard to defend against missiles.

I've always thought they should prove that it will work in a realistic physics-based simulation before spending billions on these systems. But they seem to have some value against relatively crude missiles. Plus their value as a bargaining chip, although that depends on how smart the adversary is.

duncanj · on Sept 7, 2010

The really interesting lesson that engineers have learned from the Patriot is that "never reboot" might not be the best target for critical systems. Rather, controlled rebooting can help clean up problems before they affect the function of the system.

http://portal.acm.org/citation.cfm?id=1251254.1251257 http://www.computer.org/portal/web/csdl/doi/10.1109/HOTOS.20...

daemin · on Sept 7, 2010

I seem to recall reading somewhere that the system was originally designed to be a mobile platform against Soviet missiles in somewhere like west Berlin. Where they needed something that would be moved around every day or two so that the enemy would not know its location. That meant that the system would be reset whenever it was moved, and therefore using a floating point clock was a reasonable design trade off.

dRother · on Sept 7, 2010

Sure, rebooting is often the most straightforward way to fix runtime issues because it resets everything. In this case, sounds like resetting the clock would have been just as effective. I'm sure these days, you'd have something like the equivalent of a ntpd update every hour to take care of that.