## Think You Aren’t Extraordinary? Odds Are You’re Wrong

###### by Jason S. Brinkley, PhD, MA, MS

On the Brink addresses topics related to data, analytics, and visualizations on personal health and public health research. This column explores current practices in the health arena and how both the data and mathematical sciences have an impact. (The opinions and views represented here are the author’s own and do not reflect any group for which the author has an association.)

Cartoon movies are a staple at the Brinkley household where television viewing is dominated by animated adventures of every variety. A favorite is *The Lego Movie,* which chronicles the tale of Emmett whose ordinariness makes him the perfect champion to defeat the diabolical villain. The question for today is whether Emmett really exists. And, as it turns out, your grandmother may have been right: it is highly likely that you are a unique and special snowflake. I can prove it with a little math. More important, if we are all snowflakes, then what does that say about the data regarding our health? If you’ve read the Malcolm Gladwell book *Outliers,* then you are probably familiar with the concept of individuals (or data points) that are extraordinary in comparison to everyone else. But have you heard of an *inlier*? Probably not, although they can be just as extraordinary.

Almost everyone is “typical” on a specific trait. Take height, for instance. We measure percentiles to gauge where people fall with respect to some larger population, and we say that if your height is in the 90^{th} percentile for US citizens, then you are as tall or taller than 90% of Americans. Some would say that being in the 90^{th} percentile makes you extraordinary, but for our purposes, let’s say that if you fall in the bottom 90%, then you are “ordinary.” Now let’s get mathy and suppose we talk about two independent traits (things that really should have no general association) like say your height and your IQ. Being “ordinary” on either trait means that you are of ordinary height or of ordinary IQ (90^{th} percentile or less). Probability suggests that the chance that you are “ordinary” on both height and IQ is the product of the individual independent probabilities, so 0.90 x 0.90 = 0.81 or 81%.

81% is still high, so let’s continue. Suppose instead we measured 20 different independent characteristics. The chance you are ordinary on all 20 characteristics (ie, 90^{th} percentile or less) is 0.90 x 0.90 x…x 0.90 = (0.90)^{20} = 0.12 or 12%. Only 12% of individuals would be completely ordinary on all 20 dimensions, which means 88% of individuals are extraordinary (beyond 90^{th} percentile) in at least one thing. Simply put, math tells us that if we measure enough unique factors, the chance that anyone is indeed a completely ordinary Emmett shrinks toward 0, or that if you look hard enough, then everyone is an outlier on something.

So other than making us feel better about ourselves, how does this information influence the data on our health? First, in the world of Big Data, researchers have to understand that outliers are to be expected, which is the direct result of the math above. But the truly savvy researcher should understand that “completely average” data is also unlikely and that real data should have few or no Emmetts. This brings us to the concept of an *inlier*. An inlier can be thought of as the opposite of an outlier in that it is a data point that is close to the center (or in this case, the average) of a set of data. More detailed information on multivariate inliers can be found here and here. The idea is to identify extraordinary data not by looking in the extremes but by looking at data that is considered typical across many different dimensions.

So what can measuring inliers do for the study of health? A great application of this concept is discussed by Richard Zink from the JMP® Division of SAS® Institute in his book *Risk-Based Monitoring and Fraud Detection in Clinical Trials Using JMP® and SAS®*. Imagine that someone is committing fraud (in Zink’s case, fraud in clinical trials). Then one method would be to fake results that fly under the radar. That is to say, they are completely typical in every way conceivable. But as we have seen, completely typical can be as rare as being completely extraordinary, and the idea here is that if an entire block of data has values that are all close to average across many characteristics, then this should be held back for additional scrutiny. Measuring for inliers in such data gives us an opportunity to expose this kind of too-good-to-be-true data.

It is a small but powerful idea that hopefully changes the way you think about your expectations from data collected across multiple dimensions.

That’s all for today. Join us again next month to explore more big ideas in data and health. In the meantime, follow me on Twitter @DrJasonBrinkley and feel free to let me know your favorite children’s cartoon movie, which can serve as an alternative to keep me from hearing about how everything is awesome for the ten-thousandth time.

**Jason S. Brinkley, PhD, MS, MA **is a Senior Researcher and Biostatistician at Abt Associates Inc. where he works on a wide variety of data for health services, policy, and disparities research. He maintains a research affiliation with the North Carolina Agromedicine Institute and serves on the executive committee for the NC Chapter of the American Statistical Association and the Southeast SAS Users Group. Follow him on Twitter. [Full Bio]

Pingback: The COVID Denominator - JPHMP Direct

Pingback: The Opioid Data Crisis by Jason Brinkley -- JPHMP Direct

Pingback: We R-Naught Ready for an Epidemic - JPHMP Direct

Pingback: Taste Testing Generic Drugs - JPHMP Direct

Pingback: Halloween by the Numbers

Pingback: What Kills Kids? - JPHMP Direct

Pingback: Opportunistic Research Opportunities - JPHMP Direct

Pingback: Beginning of the End for Medicare Claims Data? - JPHMP Direct

Pingback: The Population Bullet - JPHMP Direct

Pingback: Counting Alabamas | JPHMP Direct

Pingback: The Third World in Your Own Backyard | JPHMP Direct

Pingback: The Unrealistic Gold Standard | JPHMP Direct

Pingback: Does MACRA Signal the Beginning of the End for Medicare Claims Data? | JPHMP Direct

Rick – Great comments. Thanks for sharing. I agree that in the real world many of these traits are rarely independent. I think that only slows the problem down but doesn’t solve it. As we measure more and more traits, unless they are all perfectly correlated then eventually there are distinct dimensions that act as ‘independent’ of one another. So while an individual is of average height then they may indeed be of average weight, bmi, and so on indicating they are of typical ‘size’. Or they may be of average IQ which means they may have average standardized test score or other indicators of say ‘intelligence’. The underlying idea here is that ‘size’ or ‘intelligence’ are the latent traits that all of these other things are really trying to measure. I contend that the more unique things we measure the more likely our number of latent traits increases. So I might have 100 metrics but only represent 20 independent dimensions. Then the likelihood that a person is ‘typical’ in all 20 independent dimensions should be small. In health, examples of data that measure a lot of characteristics are electronic health records and genetic data. Both of which have a lot of related quantities but are known to have some independent dimensions.

An interesting article. This phenomenon is known as the “curse of dimensionality”: https://en.wikipedia.org/wiki/Curse_of_dimensionality One implication is that in a high-dimensional cube (called a hypercube), most of the volume of the cube is near the corners of the cube.Therefore a subject with many independent (uniformly distributed) traits tends to be near a corner of the cube, not near the middle, which is what you have shown.

In real life, traits are rarely independent, and in correlated data the phenomenon is less pronounced. In strongly correlated data, the average distance to the population mean is smaller than for uncorrelated data. Thus Emmett might be alive and well in correlated data….and might have many inlier friends!