I struck a nerve with my blog on spreadsheet errors. Feel free to read '1 error in every 100 cells '. Why do we use spreadsheets for taking crucial decisions? People reached out with horror stories of the consequences of wrong formulas and data issues. The common denominator? Many are using a spreadsheet as a database. So I continued to explore spreadsheets. This week: the problems you encounter when using a spreadsheet as a database. Are you doing it too?
Advanced calculators
One of the clearest indications that you are using a spreadsheet as a database is the comment: I have too much data for my spreadsheet. It gets too slow, or you can't add more rows.
Spreadsheets have technical limitations that databases do not have. An Excel sheet cannot contain more than 1.048.576 rows and 16.384 columns. Google spreadsheets are limited to 18.278 columns and a maximum of 5 million cells. In both cases you'd be wise to leave long before you reach those limits. Another boundary is 32.767 characters in a single cell. Microsoft has a good (and utterly boring) page on Excel's technical specs. Spreadsheets have these limitations because they are advanced calculators and not databases.
Besides these technicalities, also the design of a spreadsheet differs significantly from a database, which makes it unsuitable for proper data analysis.
Separation of purpose
A spreadsheet is a sheet with a grid of cells. Cells contain values or formulas to perform calculations. But a spreadsheet makes no difference between data, calculations, or even visuals. Among power-users it is a common practice to simulate this difference by creating separate sheets for data, calculations, and visualisation. But in the end the separation is only a convention. And abiding by this separation is a sign that you use your spreadsheet as a database.
You'd expect different behaviour from different types of sheets. When you're managing a dataset, you want to keep it clean, consistent, and concise. Most users have no such qualms when doing calculations. I often add extra columns to show the separate calculation steps, instead of a single column with one massive formula. This is called the mathematical trail. My goal is transparency for myself and for my audience. Extra columns make troubleshooting easier and improve accessibility for other users. And in a reporting sheet? I expect decent lay-out tools for screen and paper. A grid of cells doesn't help at all.
Spreadsheets make no functional distinction between this different requirements. That is because they are only designed for calculations and not for data management or visualisation.
Two dimensions
When you work with data you will notice that you often repeat yourself. Imagine a spreadsheet with books where a dozen authors drive most publications. Each record includes all author details. Another example: how would you add a scenario to research estimates? By adding another column. Put scenario 1 in a column, add another column for scenario 2, etc. Repetition is a clear indication that you are using your spreadsheet as a database. It was never designed for more than two dimensions: column and row.
The solution is to add tables, either with authors or scenarios. And congratulations: you now use your spreadsheet as a database. How are you going to link the books to the authors? Or the estimates to the scenarios? A database is designed for that. A spreadsheet is not. There are no links between separate data sheets or different ranges. Of course you can introduce identifiers to connect these dimensions. A database calls those keys and they are a standard feature. But these are missing from your spreadsheet, how will you guarantee that the identifiers are unique? A database does that by design. A spreadsheet? You'd have to write that. Writing formulas, macro's and other features that copy database behaviour in your spreadsheet, is a clear indication that you are using the wrong software.
After reinventing keys for your spreadsheet you can get the author from a book-row. VLOOKUP
and INDEX/MATCH
formulas do exactly that. Their problem? They don't perform. These functions are an excellent way of slowing your spreadsheet down to a crawl.
In short: if your spreadsheet contains different tables and has calculations that can only be done with VLOOKUP
or INDEX/MATCH
formulas, then you are using a spreadsheet as a database. What you do is writing queries and not making calculations.
Types of data
A spreadsheet does not support data types. Common types are date/time or currency. But wait! A spreadsheet supports dates and currency, right? Well, no. A date in Excel is constrained to any time after January 1st, 1900, and before December 31st 9999. That is workable for some people, but definitely not for historians. Most of the past happened before 1900.
A data type is also not required. If you create a column of dates, you cannot stop anyone from adding the value 'Vanilla-Strawberry'. A database would. And yes, you can write a spreadsheet formula to test if the column contains only dates. But that is only a check afterwards, and as I said earlier: copying database behaviour is a clear indication that you use the wrong software. Custom written checks can (and will) break, leading to ice cream among your dates.
A spreadsheet does not type data, it only formats data. Unlike a database. You want a field in which you can only put 'authors'? That's fine: link the author table and you are good to go. You only want books? Easy: make a table, fill it with books, and link it to your field. It will only accept books from that table. You can simulate that to some extend in your spreadsheet, but we already know what copying database behaviour means.
Spreadsheets don't support data types for a reason. It is almost impossible to data type a cell when writing a calculation. How would you type a formula result that can vary between 'Red', 42, or € 3.797,23? Demanding data types makes life tedious and hinders a spreadsheet's purpose: making calculations.
Why do we use a spreadsheet as a database?
I can delve into the most exciting subject: the history of the spreadsheet to answer that question. I won't.
The short answer is by accident. When the first spreadsheets were designed, somehow nobody expected that users liked the calculation cells also for creating data tables. But they did, and when a user has a table they want to do tabular things with it. Instead of explaining that a database was more suitable spreadsheet companies listened to their users. Many software engineers cursed them when they added data features and created the modern Frankenstein spreadsheet: a hybrid whose VLOOKUP
doesn't scale and Pivot-tables are constrained. It should not live.
Spreadsheets need an overhaul to really add data features. That is hardly possible without alienating a core market of multinationals who depend on their spreadsheet ecosystems. So the only real solution in digital humanities and heritage is to help users leave their spreadsheets.