Top 10 Testing Mistakes XP Teams Tend to Make

Top 10 Testing Mistakes XP Teams Tend to Make

XP practices like Test-Driven Development (TDD), continuous integration, refactoring, and paired programming make it the most rigorous process I’ve ever encountered. The resulting code is usually extremely high quality. But, unfortunately, XP is not a panacea. Nor is code produced by XP teams guaranteed to delight users. Pulling from my own observations as well as the stories others have told me, here is my top 10 list of testing mistakes XP teams tend to make that can result in bad surprises when the software is deployed.

  1. Attempting to substitute unit tests for acceptance tests. The automated unit tests that result from TDD provide fabulous fast feedback about the possible harmful side effects of any given change. That feedback is so fast, and so useful, that teams are sometimes tempted to claim that the software is good because the unit tests are green. But no matter how comprehensive the unit tests are, end-to-end acceptance tests give us information that unit tests can’t. Where unit tests tell us how well the internals of the system conform to explicit expectations, acceptance tests give us information about an end user’s experience. You can’t substitute one for the other.
  2. Not automating the acceptance tests. Numerous XP teams around the globe are re-discovering that automating end-to-end acceptance tests through a UI is painfully hard. All too often, I see XP teams skimp on end-to-end acceptance test automation because it’s too time-consuming and painful. But this is one case where the adage “if it hurts, do more of it” is especially helpful. If it’s too time consuming or painful to automate the acceptance tests, it means something needs to change to make it easier. Maybe that means using a different automated testing tool, maybe it means changing the interface to make it more testable, or maybe it means the team just needs more practice. Whatever the problem, doing it more will give the team more opportunities to find remedies for the pain. Avoiding it just causes more pain in the long run when the manual testing becomes too big a burden and bugs start slipping through the cracks.
  3. Thinking the automated tests are sufficient. Having a fully automated suite of tests at both a unit and acceptance level is such an ideal goal, it’s easy to forget that we can’t predict and code tests for every interesting condition. Some amount of manual testing will always be necessary to catch those surprises we couldn’t possibly foresee.
  4. Letting the Customer accept features with insufficient testing. Sometimes Customers let their eagerness to see the software deployed gets the better of their skepticism, and they skimp on the acceptance testing. If that happens, it’s up to the team to gently but firmly make the Customer understand the risks of accepting Stories too readily.
  5. Overly relying on the Customer to specify every detail of both desired and undesired behavior on every Story. Some XP teams place all the burden of specifying behavior on the Customer, saying “if the Customer didn’t ask for it specifically in the Story, it doesn’t count, and we shouldn’t do it.” Consider an application that crashes when the user enters invalid data in the field. Some XP teams will say, “We don’t need to write any code to guard against bad input unless the Customer explicitly asks for it.” The problem is that the Customer usually assumes that some acceptance criteria are obvious. Not crashing if the user happens to enter an ampersand (“&”) in a description field would be right up there in their minds with “obvious.” But how should the team draw the fine line between gold-plating a release with features the Customer didn’t request and anticipating the Customer’s needs to avoid rework? The best way I know is to discuss assumptions about these “Level 0” requirements: the acceptance criteria the Customer assumes will be in place without having to explicitly state them in each and every Story.
  6. Underestimating the need for integration testing. Story A has automated unit and acceptance tests. Story B has automated unit and acceptance tests. Story A works great. Story B works great. The customer has carefully reviewed, tested, and accepted the stories. Everyone’s happy. End of, er, well, Story. Right? The problem is that Story A and Story B might not work so well together. Perhaps Story A has a side effect of corrupting the data used by Story B. The solution is to include end-to-end scenarios that touch multiple Stories when testing.
  7. Underestimating the need for extended sequence testing. Tests in XP environments tend to be straightforward. Set up the conditions. Perform the actions. Verify the results. Repeat. But that’s not how real world users use software. A real world user is more likely to set up some conditions, perform some actions, change some conditions, take a coffee break, un-do then re-do some of the actions (but not all), view the results, revisit the actions, and so on. The real world is messy. Simplistic, linear tests don’t tell us enough about the risks lurking in the software when real users use it in a real-world way.
  8. Forgetting about non-functional criteria. How many XP teams write automated tests to detect memory leaks? Or random, high-volume automated tests designed to find reliability problems? My guess is just those bitten by memory or reliability related bugs. XP teams rarely articulate non-functional quality criteria such as reliability, usability, performance, scalability, and memory footprint in Stories. And that means XP teams rarely have tests designed to provide information about these attributes. Non-functional quality criteria are by their nature more ambiguous and vague than feature Stories. But they’re just as important to the overall user experience. It’s worth the extra effort to test the non-functional attributes of the system and articulate acceptance criteria.
  9. “Fixing” a build by commenting out a test that “shouldn’t be failing.” The JUnit mantra “Keep the code clean, keep the bar green” is so powerful that XP teams have been known to cheat by simply commenting out the tests that are failing to get the build back to Green. I know: you’re all shocked. “No one on my team would ever do such a thing!” you protest. Perhaps not. But I’ve seen it happen. And I’ve been tempted to do it myself. “There is no good reason this test should be failing,” I say to myself. “It must be something unrelated to this particular test.” And sometimes that’s true. After digging around, I discover that the problem is not with the failing test but with some data pollution caused by another test. And sometimes the assertions in the test are no longer valid. So if the test is truly invalid, delete it instead of commenting it out. But sometimes the failing test is giving me a very important message, one that I’d be a fool to ignore. Commenting out the test without investigating the problem more deeply is like applying heavy cologne to cover a bad smell. And in the case of code, it’s risky behavior that undermines the power of those unit tests.
  10. Not including testing activities in the Planning Game. I’ve been in a number of Planning Game meetings at this point. And I’ve noticed that when I suggest we include time for activities related to creating test data or setting up test configurations, I usually encounter resistance. Sometimes the resistance is a Catch-22: “The Customer has to tell us he wants those activities done by putting them in Stories.” says the team. When I propose we add the activities as Stories, the team objects: “But those are infrastructure activities that have no inherent value to the Customer. They’re not Stories.” No matter how we account for the time, we’re going to have to do the testing tasks. (Or accept the risk of inadequate testing. See Item 4.) So if we don’t want our Velocity to suffer because we spend unbudgeted time on testing tasks, we should budget the time, whether in a Story or by reducing our Velocity estimates. And in order to ensure the testing tasks are done, we should track them the same way we track other infrastructure activities.

Better Testing, Worse Testing

I presented “Better Testing, Worse Quality” in 2001 at the SM/ASM conference. The paper remains one of the most popular on my site. In it, I use a diagram of system effects to explain how a big improvement in system-level independent testing can, ironically, lead to worse quality as the level of developer testing goes down.

A few months ago, S.R. Ramachandran contacted me to point out that the paper only looked at the feedback loop from one direction. What happens, he asked, if development improves? Will the independent testing become worse?

I wrote “Better Testing, Worse Quality” long before I became involved in the Agile community. At the time, I didn’t know about test-infected developers, Extreme Programming, or Test-Driven Development. Re-reading my words now, I realize that the paper is one-sided. It identifies just one possible system effect: developer testing diminishing as system testing increases. So could improved developer testing, as often happens on Agile projects, ironically lead to worse quality as the level of independent testing goes down?

As I thought about the question more, I realized that I’ve seen this happen. While working with an Extreme Programming team, I overheard the Customer comment, “I don’t need to test all that because the developer tests cover it.” Uh oh.

Back before the organization adopted XP, the manager playing the Customer role would have had a swarm of testers cover every inch of the software looking for problems. But because the developer testing had improved so much, she felt that extensive system-level testing would duplicate the developers’ efforts. She accepted the new features after only a cursory examination. A few weeks later the Customer was stunned when users surfaced bugs that more rigorous system testing would have caught. Better unit testing had led to worse system testing; and worse system testing had led to worse quality.

That memory triggered another. A development manager at another company described, in animated detail, how a COM architecture would reduce the system testing burden. “If we test all the COM objects thoroughly at the code level, everything will just work when we integrate the whole system!” he declared. The team scheduled very little system test time. Some months later, the team was still battling mysterious crashes and timing bugs.

Then I remembered a situation a participant in one of my classes described. An executive in his company was pressuring him to reduce his test estimates saying “The developers will be doing extensive unit testing. Now how much can you cut your system testing?” Notice that the executive didn’t ask, “How much time do you think efficiency gains due to quality improvements will buy us?” Instead, he pointed to increased developer testing as an argument to reduce the system testing.

Notice also that the developers weren’t yet doing all that unit testing: it was planned for the future. Apparently just the promise of more unit testing is enough for some to decide less system testing is needed.

My new, more general conclusion is that better testing at one level tends to result in worse testing in another given no other changes in the system.

This is a problem. It means that the more information we have about the software from one perspective, the less we are likely to have from other perspectives. And that means overall risk tends to remain constant, even after significant improvements in an isolated part of the overall process. Yikes!

In my original paper, I described a difficult conversation in which a VP castigated a Test Manager saying, “We’ve given you a well-stocked lab, you’ve hired a large team of experienced professionals, you’ve brought in training for them, and you’ve established good test practices. With all this investment in testing, how is it that our software is worse?”

Now I can imagine an executive saying to an XP coach, “We’ve given you a well-stocked bull pen, you’ve hired a large team of experienced XP professionals, you’ve brought in training for them, and you’ve established good development practices. With all this investment in development, how is it that our software is worse?”

How tragic.

This leads me to my next general conclusion: an isolated improvement in one aspect of a development process tends to be offset by declines in another, resulting in no overall improvement in the final result. So how do we improve results? By paying attention to the whole process and not just isolated aspects of it.

We can’t afford to use an increase in one kind of testing to justify skimping on another for the simple reason that we can’t substitute one kind of testing for another. Different types of tests answer different types of questions. Unit tests tell us very little about how the overall system works, just as system testing gives us precious little information about how well each code module or class fulfills its responsibilities. Both code-level and system-level tests are necessary to give us a complete picture of the system under test.

Instead of slashing test efforts at any level, let’s focus on efficiency gains. How can we do the same level or even more testing in less time or using fewer resources? Will improvements in overall quality enable us to spend less time spinning our wheels? How can we leverage improvements in one kind of testing to improve the efficiency of another? What steps can we take to ensure setup scripts, test harnesses, fixtures, or data are reusable?

By focusing on efficiency, I’m convinced that we can leverage better testing in one area into better testing in other areas. And that means “Better Testing, Better Testing.”