Are We Asking Too Much From Surveys?
by Jason S. Brinkley, PhD, MA, MS
On the Brink addresses topics related to data, analytics, and visualizations on personal health and public health research. This column explores current practices in the health arena and how both the data and mathematical sciences have an impact.
Population-based surveys have become a staple for research into the socioeconomics, behaviors, and general health of the United States.
Much of what we know about population trends comes from the routine collection of this data, and a lot of funding and personnel are devoted to its gathering and accurate reporting. Surveys such as the National Health and Nutrition Examination Survey have been instrumental in showing the obesity and diabetes epidemics currently gripping both older retirees, as well as our children.
These studies are designed to provide a big picture snapshot and, in many cases, a large sample is gathered so that we can perform some deeper analyses and make comparisons across different population groups. Those deep dives are vital for exploring potential disparities, but is it possible to go too far?
I’ve been thinking about this for a while. Data collection is omnipresent in the modern industrialized world. But can researchers ask too much of data in an attempt to force answers to questions way beyond the scope of the original surveys? I experienced this issue firsthand last fall when the Brinkley household was expecting its newest bundle of joy.
My wife and I were married in 2004, and then we had three delightful (and very loud) boys. They were all very excited to learn that the Brinkley family would add a new baby in the fall of 2015. Our friends and family responded to the news with a mix of congratulations and “Y’all really want to have a girl, don’t you?” (My people are from the South.)
After hearing the gender question for what seemed like the 1000th time, I began to wonder if the gender of the baby had already been decided, statistically. Would having three sons in a row actually increase the likelihood that our fourth child would be a boy? Maybe genetics trumps basic 50/50 logic, and there is something to the idea that certain genders “run” in specific families.
Interestingly, I found a website, ingender.com, where someone had already tried to work out that problem. You can find her write-up here. She had data from the National Longitudinal Surveys (NLS), which has a 40-year record of providing data on economic and life-changing events among a representative sample of US citizens. According to the ingender.com data, the NLS says 43.6 percent of families with three sons in a row gave birth to a girl on try number four.
Like most people, I was happy to see that someone else had asked the same question. Unlike most people, I saw that the data were available for download and my interest was further piqued. “Look honey, they have data!” I exclaimed to my wife, which led to a goodnight kiss on the cheek and a request to not stay up too late.
I pulled the data into my favorite statistical analysis software and started drilling down. To me, it wasn’t enough to just replicate the website’s result. Why look at all families when I could dig deeper and look at families in my specific age and race group? Wouldn’t that provide me with a more reliable estimate of what may happen to my family? What if the overall percentage had come from examining families that aren’t really like my family at all?
An hour of digging led to a deeper result. In a large sample of more than 6,000 families, 31 were white couples aged 30-40 with four children, and the first three were sons. Of those, 14 (or 45.2 percent) had a girl on child number four.
At that point, I moved way beyond the original purposes of the survey. The study designers likely never intended for me to take the data to such an extreme to try for a specific answer. While many would agree that the 45.2% estimate is not very reliable, there is a way to represent the result in a manner that could be interpreted as legitimate.
And therein lies the problem with misusing Big Data. While it seems ludicrous to make family planning decisions based on these results, can we be certain there haven’t been any health or social policies determined from this type of deep dive into population-based surveys? Researchers have a responsibility to explore the validity and methodology of results from this type of research and request clarity and transparency in how different estimates are formed.
But, in essence, I really did have my answer, from a certain point of view. There was no evidence to suggest a fourth son was a foregone conclusion, and that was the whole point of my research. So an argument in favor of the use of such data is that having an estimate based on data rather than mere speculation is better and may potentially provide a good starting point for additional research.
So how did the story end for the Brinkleys? Elizabeth Ann Brinkley was born two days before Thanksgiving 2015. Affectionately referred to as Ellie Bean the Mean Machine, she is spoiled by the attention lavished upon her by her three brothers, and the Brinkley home is (sadly) as loud as ever.
For a recent family photo and more about me, follow me on Twitter @DrJasonBrinkley.
Jason S. Brinkley, PhD, MS, MA is a Senior Researcher and Biostatistician at the American Institutes for Research where he works on a wide variety of data for health services, policy, and disparities research. He maintains a research affiliation with the North Carolina Agromedicine Institute and serves on the executive committee for the NC Chapter of the American Statistical Association and the Southeast SAS Users Group. Follow him on Twitter. [Full Bio]