Does anyone have thoughts on whether test suites suffer from Goodhart's law? Som...

jwalton · on June 11, 2020

My dad was a huge proponent of this back when he was in software; don't tell you developers what tests failed, only what percentage of tests failed.

A software test is, in many ways, like an exam you might write in university; just like an exam can't possibly cover 100% of what you're supposed to know, a test (especially an integration test for a large and complex system) can't possibly cover 100% of the conditions the system is supposed to operate under. An exam is a good way to measure if you know a subject though, and similarly a test suite is a good way to check that the quality of a system meets a certain bar.

Once you write the exam, if I go back and say "Here are all the questions you got wrong. Go study up and write the same exam tomorrow," though, it very much ceases to be a good measure of whether or not you know the subject matter. You can now "cheat the system" by studying only the parts of the subject that the exam covers.

Similarly, once your integration tests are failing, if someone tells you which tests are failing and how, you're going to go back and fix only what you need to get the tests passing. At this point, the tests stop being a good indication of code quality - 100% of the tests are passing, but you can't say that 100% of the defects have been removed, so the tests are, in a sense, now kind of worthless. They might stop a limited number of future defects getting in, but they're not doing an arguably much more important job of telling you what your overall quality level is.

If instead, when you submit a commit, I say "5% of the tests are now failing" and nothing else, you have to go look for defects in your code, and you're probably going to find a lot of defects on your own before you even get to the 5% that the tests are complaining about.

mdoms · on June 11, 2020

This sounds like a fun game to play with a team of developers who have no time constraints. In every organisation I've worked with you would get a very stern talking to for behaving like this.

My tests are specifically designed to show you where the defect is, so you can solve the immediate problem and get back to work. I don't expect every developer who triggered a failing test to perform a full analysis of the code base and resolve every other defect. That would be nice, if we had the time.

jwalton · on June 11, 2020

I'll prefix by saying that this is exactly how I write my tests, too. But, let's do a little critical thinking here and ask "Why do we write integration tests?" If the goal is to improve software quality, I'm afraid I have some disappointing results for you.

Back when my dad was working at a huge software company (BigCorp, let's say), he looked at how much effort the manual test team spent over a two week period, and how many defects they found. Then he did the same over the next two week period. Now, logically, in that second period, some of the earlier defects had been found and fixed, so it should now be harder to find defects, so the total defect count per unit of test effort should be lower, right? Armed with two data points, he did a regression and worked out how many defects they'd find if they did an infinite amount of testing; effectively the undiscovered defect count left in the product. The number he got was astoundingly huge - no one believed this was possible. So, he went over to the plotter and plotted out a giant effort/defect count curve, and then every two weeks he'd put a pin in his plot to show reality vs. his prediction, and for months and months until he got tired of doing it, he was pretty dead on. And he didn't just do this for one project, he did it for lots of projects, across lots of different teams of various sizes.

On all of these projects, all the manual testing they could possibly hope to accomplish if they had their entire testing staff spend 100 years testing would have reduced the overall defect count by a tiny tiny fraction of a percent. So testing and then fixing bugs found in tests (at least in all the projects at BigCorp) didn't really have a huge impact on software quality.

And this should not really be a surprise; if you were manufacturing cars, you might test the power seats on every 100th car, and use this to figure out what the quality of power seats in your cars is. You might discovery only 95% of power seats are working, and you might think that's unacceptable. If you do, though, you're not going to "solve" the problem by fixing all the broken power seats you test; you're going to go figure out where in your manufacturing process/supply chain things are going wrong and fix the problem there, and improve your quality. The testing is a measure of your process.

Software is not so different - by the time it gets to integration testing, the code has been written. The level of quality of the code has largely already been set at this point - all the defects that are going to be introduced have already been introduced. The quality level is dependent upon your process and the skill level of your developers. So testing some arbitrary fraction of the lines-of-code is going to find problems in some percentage of those lines-of-code, but fixing those particular problems? Is this going to have a huge impact on quality?

krab · on June 12, 2020

> you're not going to "solve" the problem by fixing all the broken power seats you test; you're going to go figure out where in your manufacturing process/supply chain things are going wrong

I think the point is that examining the failing seats will lead you to the points in process that should be fixed. Therefore you can fix the issue more efficiently than going through the whole process. Similar as with knowing details about failing software tests.

Imagine I hide the information about seats and tell you 10% of the finished cars have "some" defect. Where would you even start looking in the factory?

heavenlyblue · on June 12, 2020

> You can now "cheat the system" by studying only the parts of the subject that the exam covers.

You can't consistently cheat the system if the exam randomly covers every topic studied over the year.

Therefore if your unit tests are not dumb but clearly defined for edge-cases (or at least defined for random points) then you can't cheat them consistently.

feintruled · on June 12, 2020

> don't tell you developers what tests failed, only what percentage of tests failed.

Puts me in mind of that famous exchange in the list of bad fault reports:

Bug: "Something is broken in the dashboard" Engineer response: "Fixed something in the dashboard"

Seems like your dad's strategy might invite this kind of anti-response ;-)