Overcoming the Challenges of Cleansing Data to Import into a LIMS
Feb 19

Overcoming the Challenges of Cleansing Data to Import into a LIMS

Imagine you need to import a data set into your Laboratory Information Management System (LIMS). Perhaps you’re taking on a new study and need the historic data or are merging labs with another team. Whatever the cause the result is the same. You’ve got a spreadsheet of data of unknown quality that needs to be “knocked into shape” before import. What do you do? Where do you even start when it comes to cleansing data to import into a LIMS?

1. Understand what each column is supposed to represent

You need to properly understand what the data means. Whether you’ve got Yes/No, Y/N or 1/0, you need to know what this meant to the team entering the data originally. If the fields are, say, consent options then does “Yes” mean the donor opted in or opted out? It’s very important to understand the meaning. And what do you do with an empty value?

2. Agree your terminology

Choose the list of valid options. The technical term is ontology. These are often already set up in your LIMS, although they’re not always complete. You may be expanding the scope of your data with, say, a new study so new picklist options need to be agreed. Ensure there aren’t any ambiguous values. “Whole blood”, “Blood, not centrifuged” or “Blood” may mean the same thing but you only want one term for each item/concept if you’re to make any sense of validation and reporting. Then you need to carefully check that every value in that column complies with the allowed values list. This is sometimes called “vertical checks”.

3. Find and cleanse the gaps

A spreadsheet can be misleading with empty values. Not all blanks are equal. An empty text field is not the same as a null value or a zero so standardising is a good idea. If any default meanings are implied in the previous system then you’ll need to substitute those for the gaps to be consistent with your new system e.g. Does an empty Date of Death value mean that the donor was alive at donation or simply that their status was unknown to the previous system? This is very important for processing as well as legal compliance!

4. Sanity check

Talk to the previous data owners to understand how the fields relate to each other. You can then do spot-checks to make sure that everything makes sense. If it doesn’t then this might highlight a fundamental misunderstanding. This is sometimes called “horizontal checks”. I’ve seen data sets that appeared valid (vertical checks) but that had samples that the data said were taken years before the donors were even born!

5. Document what you do

The cleansing you carry out will almost certainly need to be done more than once and so you need to carefully annotate all changes you carry out (not the individual rows but what the tasks were). It’s also a very useful reference if there’s a question about the data after import.

A final thought about overcoming the challenges of cleansing data for importing into a LIMS

There’s a lot of scope for misunderstanding and invalid assumptions to ruin the value of an imported data set. Taking some simple steps can maximise the quality of your data and therefore its value before importing.



About The Author

Gary Rooksby has over 25 years’ experience implementing and evolving corporate systems including manufacturing and quality systems for a range of major clients such as the MOD. For the last 18 years Gary has specialised in Sample Management Software with emphasis on process optimisation and data management. Gary works in partnership with clients and draws on his wealth of experience to help institutes and their teams to maximise the benefits from new and upgraded systems. Business needs are constantly evolving, and Gary loves the changing challenges. Gary always focuses on delivering value to the users, whether that is financial, scientific or simply easing workloads. He believes that the system is never an end in itself; it is a tool to help the users achieve their goals and that principle is always at the heart of any system or data designs.