Adventures with Auto-Generated Tests and RSpec

This post started out as a quick little entry about a cool parlor trick you can do with RSpec to make it work for auto-generated test data. But in the middle of writing what was supposed to be a simple post, my tests found a subtle bug with bad consequences. (Yeah for tests!)

So now this post is about auto-generated tests with RSpec, and what I learned hunting down my bug.

Meet RSpec

In case you haven’t encountered RSpec before, it’s one of the Behavior Driven Development developer test frameworks along with JBehave, EasyB, and others.

Each RSpec test looks something like this:

  it "should be able to greet the world" do
      greet.should equal("Hello, World!")
  end

I used RSpec to TDD a solution to a slider puzzle code challenge posted on the DailyWTF.

Auto-Generating LOTS of Tests with RSpec

So let’s imagine that you’re testing something where it would be really handy to auto-generate a bunch of test cases.

In my particular case, I wanted to test my slider-puzzle example against a wide range of starting puzzle configurations.

My code takes an array representing the starting values in a 3×3 slider puzzle and, following the rules of the slider puzzle, attempts to solve it. I knew that my code would solve the puzzle sometimes, but not always. I wanted to see how often my little algorithm would work. And to test that, I wanted to pump it through a bunch of tests and give me pass/fail statistics.

I could write individual solution tests like this:

  it "should be able to solve a board" do
      @puzzle.load([1, 2, 3, 4, 5, 6, 8, 7, nil])
      @puzzle.solve
      @puzzle.solved?.should be_true
  end

But with 362,880 possible permutations of the starting board, I most certainly was NOT going to hand code all those tests. I hand coded a few in my developer tests. But I wanted more tests. Lots more.

I knew that I could generate all the board permutations. But then what? Out of the box, RSpec isn’t designed to do data driven testing.

It occurred to me that I should try putting the “it” into a loop. So I tried a tiny experiment:

  require 'rubygems'
  require 'spec'

  describe "data driven testing with rspec" do

      10.times { | count |
          it "should work on try #{count}" do
              # purposely fail to see test names
              true.should be_false
          end
      }

  end

Lo and behold, it worked!

I was able then to write a little “permute” function that took an array and generated all the permutations of the elements in the array. And then I instantiated a new test for each:

  describe "puzzle solve algorithm" do
      permutations = permute([1,2,3,4,5,6,7,8,nil])

      before(:each) do
        @puzzle = Puzzle.new
      end

      permutations.each{ |board|
          it "should be able to solve [#{board}]" do
              @puzzle.load(board)
              @puzzle.solve
              @puzzle.solved?.should be_true
          end
      }
  end

Sampling

Coming to my senses, I quickly realized that it would take a long, long time to run through all 362,880 permutations. So I adjusted, changing the loop to just take 1000 of the permutations:

  permutations[0..999].each{ |board|
      it "should be able to solve [#{board}]" do
          @puzzle.load(board)
          @puzzle.solve
          @puzzle.solved?.should be_true
      end
  }

That returned in about 20 seconds. Encouraged, I tried it with 5000 permutations. That took about 90 seconds. I decided to push my luck with 10,000 permutations. That stalled out. I backed it down to 5200 permutations. That returned in a little over 90 seconds. I cranked it up to 6000 permutations. Stalled again.

I thought it might be some kind of limitation with rspec and I was content to keep my test runs to a sample of about 5000. But I decided that sampling the first 5000 generated boards every time wasn’t that interesting. So I wrote a little more code to randomly pick the sample.

My tests started hanging again.

My Tests Found a Bug! (But I Didn’t Believe It at First.)

Curious about why my tests would be hanging, I decided to pick a sample out of the middle of the generated boards by calling:

  permutations[90000..90999]

The tests hung. I chose a different sample:

  permutations[10000..10999]

No hang.

I experimented with a variety of values and found that there was a correlation: the higher the starting number for my sample, the longer the tests seemed to take.

“That’s just nuts,” I thought. “It makes no sense. But…maybe…”

In desperation, I texted my friend Glen.

I was hoping that Glen would say, “Yeah, that makes sense because [some deep arcane thing].” (Glen knows lots of deep arcane things.) Alas, he gently (but relentlessly) pushed me to try a variety of other experiments to eliminate RSpec as a cause. Sure enough, after a few experiments I figured out that my code was falling into an infinite loop.

Once I recognized that it was my code at fault, it didn’t take long to isolate the bug to a specific condition that I had not previously checked. I added the missing low-level test and discovered the root cause of the infinite loop.

It turns out that my code had two similarly-named variables, and I’d used one when I meant the other. The result was diabolically subtle: in most situations, the puzzle solving code arrived at the same outcome it would have otherwise, just in a more roundabout way. But in a few specific situations the code ended up in an infinite loop. (And in fixing the bug, I eliminated one of the two confusing variables to make sure I wouldn’t make the same mistake again.)

I never would have found that bug if I hadn’t been running my code through its paces with a large sample of the various input permutations. So I think it’s appropriate to have discovered the bug, thus demonstrating the value of high-volume auto-generated tests, while writing about the mechanics of auto-generating tests with RSpec.

In the meantime, if you would like to play with my slider puzzle sample code and tests, I’ve released it under Creative Commons license and posted it on github. Enjoy! (I’m not planning to do much more with the sample code myself, and can’t promise to provide support on it. But I’ll do my best to answer questions. Oh, and yes, it really could use some refactoring. Seriously. A bazillion methods all on one class. Ick. But I’m publishing it anyway because I think it’s a handy example.)

Not Exhaustively Tested

It sounds like Joe Stump is having a bad time of it right now.

Joe Stump, formerly of Digg, left Digg to co-found a Mobile games company. They released the first of their games, Chess Wars, in late June.

Soon after, new players found serious problems that prevented them from playing the game. In response, the company re-submitted a new binary to Apple in July. As of this writing, the current version of Chess Wars is 1.1.

The trouble started with patch release #2. Apparently, even six weeks after Joe’s company submitted the new binary (release number 3 for those who are counting), Apple still hasn’t approved it.

Eventually Joe got so fed up with waiting, and with seeing an average rating of two-and-a-half out of five stars, that he wrote a vitriolic blog post [WARNING: LANGUAGE NOT SAFE FOR WORK (or for anyone with delicate sensibilities)] blaming Apple for his woes.

That garnered the attention of Business Insider who then published an article about the whole mess.

Predictably, reactions in the comments called out Joe Stump for releasing crappy software.

I should mention here that I don’t know Joe. I don’t know anything about how he develops software. I think that there’s some delightful irony in the name of his company: Crash Corp. But I doubt he actually intended to release software that crashes.

Anyway, Joe submitted a comment to the Business Insider article defending his company’s development practices:

We have about 50 beta testers and exhaustively test the application before pushing the binary. In addition to that the application has around 200 unit tests. The two problems were edge cases that effect [sic] only users who had nobody who were friends with the application installed.

I’m having a great deal of trouble with this defense.

Problem #1: Dismissing the Problems as “Edge Cases”

The problems “only” occur when users do not have any Facebook Friends with the application. But that’s not an aberrant corner case. This is a new application. As of the first release, no one has it yet. That means any given new user has a high probability of being the first user within a circle of friends. So this is the norm for the target audience.

Joe seems to think that it’s perfectly understandable that they didn’t find the bugs during development. But just because you didn’t think of a condition doesn’t make it an “edge case.” It might well mean that you didn’t think hard enough.

Problem #2: Thinking that “50 Beta Testers” and “200 Unit Tests” Constitutes Exhaustive Testing

Having beta testers and unit tests is a good and groovy thing. But it’s not sufficient, as this story shows. What appears to be missing is any kind of rigorous end-to-end testing.

Given an understanding of the application under development, a skilled tester would probably have identified “Number of Friends with Chess Wars Installed” as an interesting thing to vary during testing.

And since it’s a thing we can count, it’s natural to apply the 0-1-Many heuristic (as described on the Test Heuristics Cheat Sheet). So we end up testing 0-friends-with-app, 1-friend-with-app, and Many-friends-with-app.

So even the most cursory Exploratory Testing by someone with testing skill would have been likely to reveal the problem.

I’m not suggesting that Joe’s company needed to hire a tester. I am saying that someone on the implementation team should have taken a step back from the guts of the code long enough to think about how to test it. Having failed to do that, they experienced sufficiently severe quality problems to warrant not one but two patch releases.

Blaming Apple for being slow to release the second update feels to me like a cheap way of sidestepping responsibility for figuring out how to make software that works as advertised.

In short, Joe’s defense doesn’t hold water.

It’s not that I think Apple is justified in holding up the release. I have no idea what Apple’s side of the story is.

But what I really wanted to hear from Joe, as a highly visible representative of his company, is something less like “Apple sucks” and something much more like “Dang. We screwed up. Here’s what we learned…”

And I’d really like to think that maybe, just maybe, Joe’s company has learned something about testing and risk and about assuming that just because 50 people haphazardly pound on your app for a while that it’s been “exhaustively” tested.