Text Mining UFO Data: Little Green Aliens or Santa’s Elves?
by Jason S. Brinkley, PhD, MA, MS
On the Brink addresses topics related to data, analytics, and visualizations on personal health and public health research. This column explores current practices in the health arena and how both the data and mathematical sciences have an impact. (The opinions and views represented here are the author’s own and do not reflect any group for which the author has an association.)
Context is an extremely important part of understanding data collected from all areas of research. Without an understanding of where data come from and how data are created, we run the risk of making wrong conclusions and very bad decisions. The use of unstructured text for data analysis is a growing frontier of research and certainly one where context is key. Today’s special holiday edition of On the Brink shows how misinterpreting results by taking data out of context can lead to unique and sometimes hilarious conclusions.
Several years ago, I was fortunate to instruct a short course titled Data-Driven Decision Making to a group of very advanced high school students as part of a program that brought these students to college campuses in the summer for special opportunities. The novel curriculum was developed and piloted as a 10-day short course to immerse students in the world of data science and statistical analysis. You can find a white-paper I wrote on the whole program here.
One day, the students and I tackled some basics of text analytics and language processing. In trying to find useful and fun data sources to play with, I turned to the National UFO Reporting Center. The students and I spent an afternoon with some easy-to-use tools to aggregate and explore the unstructured text of UFO sightings. Most of the insights were not surprising; for example, most reports involved lights in the sky. Typically those lights were white, yellow, or red and rarely were they listed as green, blue, or purple. My lesson plan had involved looking mostly at national data and then comparing to the state of Texas to see whether there were differences in the common key words and phrases used to describe UFO sightings.
We moved on to an open time block where the class and I decided as a group to explore some new area of the data, and several students asked to compare data from California to the data from Texas. Both states are large and have numerous UFO sightings, and students wondered whether the sightings used similar key words. In looking at the California data, the word “Santa” was near the top of the list of common terms. Hilarity ensued as twenty 16-year-olds began a series of jokes about Californians seeing Santa Claus in the sky. Maybe those little green men are just Santa’s elves? After a few moments, one young lady raised her hand and shifted the whole conversation. She said the words “Santa Barbara” and recognition immediately came to all the other students. Not being California natives, this group of North Carolina students missed the context from which the word “Santa” was being presented. Not Santa Claus but Santa Barbara, Santa Cruz, or Santa Monica.
As an educator, it was a great experience to watch students make the connection and learn how their first inclinations were very wrong. As a researcher, I keep the story in mind quite often when I am looking at something unusual or counter-intuitive. When we rely on our own subject matter expertise and previous experiences to interpret results, perspective and context is extremely important.
That’s all for today. On the Brink will be back in 2018 with more fresh perspectives and context on topics in data and health. If you have suggestions or ideas for topics to discuss, then find me on Twitter @DrJasonBrinkley.
Jason S. Brinkley, PhD, MS, MA is a Senior Researcher and Biostatistician at Abt Associates Inc. where he works on a wide variety of data for health services, policy, and disparities research. He maintains a research affiliation with the North Carolina Agromedicine Institute and serves on the executive committee for the NC Chapter of the American Statistical Association and the Southeast SAS Users Group. Follow him on Twitter. [Full Bio]
Previous posts by this author:
- Should You Know Your Doctor’s Home Address?
- The Population Bullet
- The Unknown Unknowns of Missing Data
- Communicating Science–More Than Just Good Words?
- Counting Alabamas
- The Third World in Your Own Backyard
- The Unrealistic Gold Standard
- Does MACRA Signal the Beginning of the End for Medicare Claims Data?
- Think You Aren’t Extraordinary? Odds Are You’re Wrong
- Mapping by Words
- Are We Asking Too Much From Surveys?
- Making Better Comparisons
- What Kills Us?