Barriers to Big Data in Health

Big Data:  a term used by the IT industry to describe the voluminous amount of unstructured data an organization creates. – Canada Health Infoway

Canada Health Infoway (CHI) released a white paper on big data analytics in health, alluding to its untapped potential and benefits. The paper can be found here. According to IBM, 90% of the world’s data has been created in the last two years, with 17MB of data created per person per second! You may have already heard several big firms and information systems giants refer to big data as “the new oil”. And while this holds true across industries, you can bet it comes with unique barriers in healthcare.

What is “Big Data”?

“Big Data” in this context is neatly defined as a term used by the IT industry to describe the voluminous amount of unstructured data an organization creates. CHI breaks it down to three characteristics: volume, velocity, and variety. Never before in history have we been able to create and capture such large sets of unorganized data (volume), at such a quick rate (velocity), and with such diversity in data types (variety).

There is no doubt that Big Data will play an ever increasing role in all industries, as it allows for the utilization of previously unusable data to provide predictive insights and trends. And while some companies have started to implement data strategies, the world of healthcare – as it usually is – is far behind.

Challenges

Unstructured Data

Long gone are the days where a SQL guru could mash together a query with a complex use of joins and cases to get exactly what you’re looking for. Healthcare data is one of the most unstructured monstrosities a data scientist can come upon. Lines and lines of free-form text with images scattered and a dash of hand-written notes makes things difficult to say the least. A promising solution to this is the advance of natural language processing and artificial intelligence to comb through records. That being said, this technology is still in its infancy.

Patients are a Hot Mess

Imagine a typical database table schema. How do you record that a patient smoked for 20 years, quit for five, relapsed for a bit, but now considers themselves a non-smoker? Most questionnaires simply have a “Smoker: Yes/No” set up. The intricacies of a patient’s health journey are hard to record and interpret. Add to that the fact that every patient is unique and is not the product of only one condition. Complex medical histories play a significant role in generating unstructured data, and are the most difficult to decipher into a predictable pattern.

Standards

The collection of data proves to be the initial bottleneck as there is no enforced standard across the board. Most data collected by a healthcare organization is generally in a self-contained database, with specifications differing even intra-organization depending on the user, type of data, and available resources. Some information is stored via flat files and spreadsheets, while a cloud-based solution may seem more appropriate in other scenarios. The bottom line is there are numerous standards with such differing specifications that makes it (presently) difficult for computer systems to generate meaningful conclusions.