The Unknown Unknowns of Missing Data
by Jason S. Brinkley, PhD, MA, MS
On the Brink addresses topics related to data, analytics, and visualizations on personal health and public health research. This column explores current practices in the health arena and how both the data and mathematical sciences have an impact. (The opinions and views represented here are the author’s own and do not reflect any group for which the author has an association.)
“Reports that say that something hasn’t happened are always interesting to me, because as we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns — the ones we don’t know we don’t know. And if one looks throughout the history of our country and other free countries, it is the latter category that tend to be the difficult ones.” – Donald Rumsfeld former US Secretary of Defense
Missing data is one of the biggest challenges currently facing many areas of health research, and it happens for a variety of reasons. Failure to follow-up, integration of data from multiple sources, changes in record-keeping procedures, and simple human error are just a few of the reasons health data can be incomplete. In some cases, the amount of missing data can be pretty severe, and health researchers need to take steps to overcome the missing data hurdle in order to answer timely and relevant questions of public health interest. While there are different ways to approach missing data, their implementation comes down to knowing whether this absent information represents known or unknown unknowns.
So how DO we deal with missing data? While there are several textbooks full of techniques and no common agreement on which method is “best,” they generally break down into three basic categories:
- Prediction – We use the information from the data that is not missing to make estimates or educated guesses as to what the missing data’s value might be.
- Replacement – We replace the missing value with a value from a “similar” data point where we actually observed the quantity of interest.
- Categorization – We categorize whether the data was observed or missing and correct the problem on the back end with analytic techniques that don’t require us to fill in the missing data.
The first two methods work well when the missing data are “known unknowns,” in this case representing scenarios where we don’t actually have the values of the data but we have a pretty good idea what they should be. In cases where we might not have a lot of quality information, then we might want to do replacement and look at data from many “similar” patients. Categorization is currently our best weapon to explore the “unknown unknowns,” those missing data values for which we don’t have a lot of information. By categorizing the present/absent data, we can study whether the missing data contribute to any study biases. The problem with categorization is that it only gives us insight into whether we might have unknown unknowns but doesn’t necessarily give us any info on what to actually do about it.
To understand how difficult this can be, let’s walk through an imaginary example. Suppose physicians have two choices for intervention (call them treatments A and B). Suppose all the really sick patients get treatment A and the otherwise healthy patients get treatment B. Now suppose that there is a big block of “toss-up” patients in the middle where it isn’t obvious whether to give them treatment A or B. So then let’s say doctors order a specific test to help break the tie. If we gather a bunch of patient data and include the results of this test in our data, then we only have the info on the middle-of-the-road patients. We can’t really make predictions on the really sick or otherwise healthy patients, and we have no “similar” patients for replacement. We may not really know what the diagnostic test result should be for those unknown people. This sort of thing happens a lot more often than one might think and it doesn’t just happen in clinical medicine, public health researchers struggle with similar issues all the time in areas such as access to care, insurance coverage, and geographic region where we might only observe certain data on a specific set of people.
So do we know when we have “unknown unknowns”? Not really. The best we can do is study the missing data and look for patterns to see if there are clear gaps in evidence. Some argue that when incomplete information is a concern, we should do the analysis with and without dealing with the missing data and compare results. These types of “sensitivity analyses” can be very useful as long as we understand we have limitations and the absence of evidence is not proof of anything really. That is to say that if you do the analysis both ways and get the same result, then that isn’t proof that the analysis is right because assuming that you have all the necessary information to do the right analysis is something that has to be decided before collecting data. Which brings me to my personal philosophy for analytics which is also based on a Donald Rumsfeld quote: “You do analysis with the data you have, not the data you might want or wish to have at a later time.”
If you have a desire to repurpose your own Donald Rumsfeld quote, then a simple internet search will bring up dozens of websites; however, I would also suggest Pieces of Intelligence: The Existential Poetry of Donald H. Rumsfeld . If you come up with any good ones about data, please share them below or send them to me via Twitter @DrJasonBrinkley.
Jason S. Brinkley, PhD, MS, MA is a Senior Researcher and Biostatistician at the American Institutes for Research where he works on a wide variety of data for health services, policy, and disparities research. He maintains a research affiliation with the North Carolina Agromedicine Institute and serves on the executive committee for the NC Chapter of the American Statistical Association and the Southeast SAS Users Group. Follow him on Twitter. [Full Bio]
Previous posts by this author:
- Communicating Science–More Than Just Good Words?
- Counting Alabamas
- The Third World in Your Own Backyard
- The Unrealistic Gold Standard
- Does MACRA Signal the Beginning of the End for Medicare Claims Data?
- Think You Aren’t Extraordinary? Odds Are You’re Wrong
- Mapping by Words
- Are We Asking Too Much From Surveys?
- Making Better Comparisons
- What Kills Us?