Imagine you need to import a data set into your Laboratory Information Management System (LIMS). Perhaps you’re taking on a new study and need the historic data or are merging labs with another team. Whatever the cause the result is the same. You’ve got a spreadsheet of data of unknown quality that needs to be “knocked into shape” before import. What do you do? Where do you even start when it comes to cleansing data to import into a LIMS?
1. Understand what each column is supposed to represent
You need to properly understand what the data means. Whether you’ve got Yes/No, Y/N or 1/0, you need to know what this meant to the team entering the data originally. If the fields are, say, consent options then does “Yes” mean the donor opted in or opted out? It’s very important to understand the meaning. And what do you do with an empty value?
2. Agree your terminology
Choose the list of valid options. The technical term is ontology. These are often already set up in your LIMS, although they’re not always complete. You may be expanding the scope of your data with, say, a new study so new picklist options need to be agreed. Ensure there aren’t any ambiguous values. “Whole blood”, “Blood, not centrifuged” or “Blood” may mean the same thing but you only want one term for each item/concept if you’re to make any sense of validation and reporting. Then you need to carefully check that every value in that column complies with the allowed values list. This is sometimes called “vertical checks”.
3. Find and cleanse the gaps
A spreadsheet can be misleading with empty values. Not all blanks are equal. An empty text field is not the same as a null value or a zero so standardising is a good idea. If any default meanings are implied in the previous system then you’ll need to substitute those for the gaps to be consistent with your new system e.g. Does an empty Date of Death value mean that the donor was alive at donation or simply that their status was unknown to the previous system? This is very important for processing as well as legal compliance!
4. Sanity check
Talk to the previous data owners to understand how the fields relate to each other. You can then do spot-checks to make sure that everything makes sense. If it doesn’t then this might highlight a fundamental misunderstanding. This is sometimes called “horizontal checks”. I’ve seen data sets that appeared valid (vertical checks) but that had samples that the data said were taken years before the donors were even born!
5. Document what you do
The cleansing you carry out will almost certainly need to be done more than once and so you need to carefully annotate all changes you carry out (not the individual rows but what the tasks were). It’s also a very useful reference if there’s a question about the data after import.
A final thought about overcoming the challenges of cleansing data for importing into a LIMS
There’s a lot of scope for misunderstanding and invalid assumptions to ruin the value of an imported data set. Taking some simple steps can maximise the quality of your data and therefore its value before importing.