Let’s talk about bugs for a moment. Bad bugs. The kind of bugs that make headlines. Bugs like these:
- From 1985 – 1987, Therac 25 radiation therapy machines overdosed patients with radiation, killing them.
- In 1996, the Ariane 5 rocket exploded spectacularly during its first flight.
- In 2004, the NASA Mars rover “Spirit” was inoperable for several days as it rebooted itself over and over.
- Also in 2004, a bug in GE energy management software contributed to the devastating blackout that cut off electricity to 50 million people.
So why do I want to talk about these bugs? Because they provide fascinating examples of how variables—things we can change while testing—are sometimes subtle and tricky. Variables can be difficult to identify, and even more difficult to control. And yet, if we want to design interesting tests that will give us the information we need about vulnerabilities in our software and systems, we need to identify those subtle variables and the interesting ways in which we can tweak them.
About “Variables” in Testing
But first, let’s take a step back and talk about what I mean by “variable.”
If you’re a programmer, a variable is a named location in memory. You declare variables with statements like “int foo;” However, as a tester, I mean “variable” in the more garden-variety English sense of the word. According to www.m-w.com, a variable is something that changes. And as a system tester, I’m always alert for things I can change through external interfaces (like the UI or the file system) while executing the software.
Sometimes variables are obviously changeable things like the value in a field on a form. Sometimes they’re obvious, but not intended to be changed directly, like the key/value pairs in a URL string for a web-based application. Sometimes they’re subtle things that can only be controlled indirectly, like the number of users logged in at any given time or the number of results returned by a search. And as the bugs listed above demonstrate, the subtle variables are the ones we often miss when analyzing the software to design tests.
Horror Stories Provide Clues to Subtle Variables
So let’s consider the variables involved in these disastrous bugs.
In the case of the Therac-25 incidents, there were numerous contributing causes involved in the deaths of the patients including both software bugs and hardware safety deficiencies. This is not a simple case of one oversight but rather a cavalcade of factors. But there were some factors that were entirely controlled by the software. Nancy Leveson explains in Safeware that in at least one of the incidents the malfunction could be traced back to the technician’s entering then editing the treatment data in under 8 seconds, the time it took the magnetic locks to engage. So here are two key subtle variables: speed of input and user actions. Further in Leveson’s report is an explanation of how every 256th time the setup routine ran, it bypassed an important safety check. This provides yet another subtle variable: the number of times the setup routine ran.
The Ariane 5 rocket provides an example of code re-use gone awry. In investigating the incident, the review board concluded that the root cause of the explosion was the conversion of a 64-bit floating-point number (maximum value 8,589,934,592) to a 16-bit signed integer value (maximum value 32768). That conversion caused an overflow error, and compounding the problem, the system interpreted the resulting error codes as data and attempted to act on the information, causing the rocket to veer off course. The rocket self-destructed as designed when it detected the navigation failure. The conversion problem stemmed from differences between the Ariane 5 rocket, and its predecessor, the Ariane 4 rocket for which the control software was originally developed. It turns out that the Ariane 5 rocket was significantly faster than the Ariane 4 rocket, and the Ariane 5 software simply could not handle the horizontal velocity its sensors were registering. The variables involved here are both velocity and the presence of an error condition.
An article in Spaceflight Now explains that the Mars rover “Spirit” rebooted over and over again because of the number of files in flash memory. Every time the rover created a new file, the DOS table of files grew. Some operations created numerous small files, and over time the table of files became huge. Part of the system mirrored the flash memory contents in RAM, and there was half as much RAM as flash memory. Eventually the DOS table of files swamped the RAM, causing the continuous reboots. Note the number of variables involved, all interdependent: number of files, size of the DOS table of files, space on flash memory and available RAM.
Finally, the GE energy management software provides a cautionary tale about the problem of silent failures. As in other cases, there are numerous contributing factors in the massive-scale blackout. Everything from lack of situational awareness to lack of operator training to inadequate tree-trimming is named in the final report submitted by a US-Canadian task force. However, there are tantalizing hints in that final report that software problems contributed to the operators’ blindness to the problems with the power grid. According to the report, FirstEnergy, the company responsible for monitoring the power grid, had reported problems with the GE XA/21 software’s alarm system in the past. In his report published on SecurityFocus, Kevin Poulsen quotes GE Manager Mike Unum as pinning the blame for the software failure on a race condition that caused two processes to have write access to a data structure simultaneously. Event timing and concurrent processes turned out to be critical variables, and ones that took weeks to track down.
Looking for Variables
We may not be testing software that can heal or kill people, that blasts into space, or that manages a nation’s energy supply, but we can apply these lessons to our projects. I don’t test such mission critical systems, but the software I do work on still has variables around timing, speed, user actions, number of times a given routine or method executes, files, memory, and concurrent processes.
The final lesson in all these cases is that testing involves looking at all the variables, not just the obvious ones. So the next time you’re thinking up test cases, consider this question:
What variables can I change in the software under test, its data, or the environment in which it operates, either directly or indirectly, that might affect the system behavior? And what might be interesting ways to change them?
It’s not a simple question to answer. But just thinking about it is likely to improve your testing.
- Gleick, James. “A Bug and a Crash: Sometimes a Bug Is More Than a Nuisance”
- Society for Industrial and Applied Mathematics. “Inquiry Board Traces Ariane 5 Failure to Overflow Error”
- Wikipedia. “Ariane 5 Flight 501”
Mars Rover “Spirit”:
- Spaceflight Now. “Thousands of files deleted on Spirit to fix computer trouble”
- Hachman, Mark. “NASA: DOS Glitch Nearly Killed Mars Rover”
- NASA Office of Logic Design. “MER Spirit Flash Memory Anomaly (2004)”
The East Coast Blackout of 2004: