"One error in every hundred cells. Think about that. Now also consider the number of decisions people make based on spreadsheets. Now tell me that is not a problem."
Gertjan Filarski's face

Yesterday I learned something new. Something that I should have known. I found research by Ray Panko tracing the average number of errors in spreadsheets. The conservative number? 1 in every 100 cells is wrong.

"The average spreadsheet has an error every hundred cells"

Spreadsheet Research

Ray Panko is Professor of IT Management at the University of Hawaii. You can find his work at his website Spreadsheet Development Error Experiments.

This page looks at error rates during spreadsheet development. Specifically, it looks at the cell error rate (CER), which is the percentage of cells that is incorrect in the final model, when the developer declares the model finished to his or her satisfaction.

For the nine studies that collected data on errors when developers worked alone, the CERs averaged 1.1% to 5.6%. The results are consistent with error rates in other simple but non-trivial cognitive action such as writing, doing a calculation, or writing a line of software code.

The very clever Marieke van Erp (a former colleague of mine and leader of the Digital Humanities Research Lab at the Royal Netherlands Academy) pointed me to further research published by Springer. Jonathan Maletic and Andrian Marcus found an error average of 5% introduced by people during data acquisition and cleaning.

Finally, Felienne Hermans at the Delft University analysed the email and attachment archive of the Enron Corporation published at the 2015 International Conference on Software Engineering. The entire dataset contains over 15.000 spreadsheets with an average number of 6191 non-empty cells, and 1286 formulas per spreadsheet. She found that 24% of the spreadsheets have at least one error in a formula - and significant issues with long calculation chains to boost.

So I think I can safely say that both the data and the formulas in the spreadsheets people user are very likely wrong. Also in academia. We need to accept this. And we have known this for quite a while. Ray Panko has been working on these errors since 1998.

Decision making

Sometimes the consequences are big. 'The London Whale Trader' scandal netted JPMorgan a $6.2 billion trading loss in 2013. James Kwak wrote in his analysis on The Importance of Excel in the debacle:

After subtracting the old rate from the new rate, the spreadsheet divided by their sum instead of their average, as the modeller had intended. This error likely had the effect of muting volatility by a factor of two and of lowering the VaR (value-at-risk)...

Oops.

And juicy as something this big is, the real problem is felt everywhere all over the world. After I found these figures I quickly posted them on LinkedIN yesterday. Within half an hour there were some heartfelt replies from people in research projects:

I was working in a project and halfway we ran out of funds. Nobody understood it because there was very precise control. But somebody had forgotten to adjust the hour price of external personnel to the real price, instead of the price predicted before the project start. A small error with a huge outcome 😕

How to deal with errors

You cannot blame people for making mistakes. We are human, and science suggests we have, on average, a 5% error rate in non-trivial activities like making calculations or writing lines of code. That is what people do - and quite consistently do. Software engineers deal with this truth all the time and have built some brilliant tools to limit the number of mistakes they make. Somehow these did not make it into spreadsheets.

formula

The above is an INDEX formula with some hoops that Excel makes you jump through. I won't go into details, but it is an example of common work an Excel power-user does. And it is a nightmare to write. I know some people take pride in this kind of work - but I'd rather be sent to a penal colony doing hard labour.

This is not how you write code. And yes, even in Excel and even if you do not consider yourself a software engineer: when you are writing IF-statements and refer to variables then you are for all intents and purposes writing code. And you deserve better support. Where is the color coding to separate variables from keywords and values? And for crying out loud: Excel expects you to write formulas on a single line!! Without any indentation to see in which clause or IF statement you're working! Not to mention proper code completion... no wonder people make mistakes.

These tools exist and they are not rocket science. They are the very basic. Software engineers have far more sophisticated solutions: extensive automated mocking and testing processes before release, dedicated debugging platforms, A/B testing, and extensive version control systems. Although it would be fantastic to have that at your disposal in a spreadsheet - I would be content with simple line indentation for starters.

Why isn't that a standard part of spreadsheets? Why are these systems not actively helping to reduce the number of mistakes people make? I mean, it's not as if any big decisions are based on a spreadsheet, right?