Think You Aren’t Extraordinary? Odds Are You’re Wrong
by Jason S. Brinkley, PhD, MA, MS
On the Brink addresses topics related to data, analytics, and visualizations on personal health and public health research. This column explores current practices in the health arena and how both the data and mathematical sciences have an impact. (The opinions and views represented here are the author’s own and do not reflect any group for which the author has an association.)
Cartoon movies are a staple at the Brinkley household where television viewing is dominated by animated adventures of every variety. A favorite is The Lego Movie, which chronicles the tale of Emmett whose ordinariness makes him the perfect champion to defeat the diabolical villain. The question for today is whether Emmett really exists. And, as it turns out, your grandmother may have been right: it is highly likely that you are a unique and special snowflake. I can prove it with a little math. More important, if we are all snowflakes, then what does that say about the data regarding our health? If you’ve read the Malcolm Gladwell book Outliers, then you are probably familiar with the concept of individuals (or data points) that are extraordinary in comparison to everyone else. But have you heard of an inlier? Probably not, although they can be just as extraordinary.
Almost everyone is “typical” on a specific trait. Take height, for instance. We measure percentiles to gauge where people fall with respect to some larger population, and we say that if your height is in the 90th percentile for US citizens, then you are as tall or taller than 90% of Americans. Some would say that being in the 90th percentile makes you extraordinary, but for our purposes, let’s say that if you fall in the bottom 90%, then you are “ordinary.” Now let’s get mathy and suppose we talk about two independent traits (things that really should have no general association) like say your height and your IQ. Being “ordinary” on either trait means that you are of ordinary height or of ordinary IQ (90th percentile or less). Probability suggests that the chance that you are “ordinary” on both height and IQ is the product of the individual independent probabilities, so 0.90 x 0.90 = 0.81 or 81%.
81% is still high, so let’s continue. Suppose instead we measured 20 different independent characteristics. The chance you are ordinary on all 20 characteristics (ie, 90th percentile or less) is 0.90 x 0.90 x…x 0.90 = (0.90)20 = 0.12 or 12%. Only 12% of individuals would be completely ordinary on all 20 dimensions, which means 88% of individuals are extraordinary (beyond 90th percentile) in at least one thing. Simply put, math tells us that if we measure enough unique factors, the chance that anyone is indeed a completely ordinary Emmett shrinks toward 0, or that if you look hard enough, then everyone is an outlier on something.
So other than making us feel better about ourselves, how does this information influence the data on our health? First, in the world of Big Data, researchers have to understand that outliers are to be expected, which is the direct result of the math above. But the truly savvy researcher should understand that “completely average” data is also unlikely and that real data should have few or no Emmetts. This brings us to the concept of an inlier. An inlier can be thought of as the opposite of an outlier in that it is a data point that is close to the center (or in this case, the average) of a set of data. More detailed information on multivariate inliers can be found here and here. The idea is to identify extraordinary data not by looking in the extremes but by looking at data that is considered typical across many different dimensions.
So what can measuring inliers do for the study of health? A great application of this concept is discussed by Richard Zink from the JMP® Division of SAS® Institute in his book Risk-Based Monitoring and Fraud Detection in Clinical Trials Using JMP® and SAS®. Imagine that someone is committing fraud (in Zink’s case, fraud in clinical trials). Then one method would be to fake results that fly under the radar. That is to say, they are completely typical in every way conceivable. But as we have seen, completely typical can be as rare as being completely extraordinary, and the idea here is that if an entire block of data has values that are all close to average across many characteristics, then this should be held back for additional scrutiny. Measuring for inliers in such data gives us an opportunity to expose this kind of too-good-to-be-true data.
It is a small but powerful idea that hopefully changes the way you think about your expectations from data collected across multiple dimensions.
That’s all for today. Join us again next month to explore more big ideas in data and health. In the meantime, follow me on Twitter @DrJasonBrinkley and feel free to let me know your favorite children’s cartoon movie, which can serve as an alternative to keep me from hearing about how everything is awesome for the ten-thousandth time.
Jason S. Brinkley, PhD, MS, MA is a Senior Researcher and Biostatistician at the American Institutes for Research where he works on a wide variety of data for health services, policy, and disparities research. He maintains a research affiliation with the North Carolina Agromedicine Institute and serves on the executive committee for the NC Chapter of the American Statistical Association and the Southeast SAS Users Group. Follow him on Twitter. [Full Bio]