About a year ago, I started listening to a podcast that was recommended to me by my office mate, Jan. We both share an interest in machine learning, and he recommended that I listen to (among others) a show called Linear Digressions. I pretty much instantly fell in love with this podcast, as it discusses practical, easy to relate to applications and stories around machine learning.
One episode in particular, though, has stayed with me ever since I listened to it. Podcast hosts Katie and Ben discussed outliers in data: Outliers are values that seem to fall outside of the general range of values that you expect or find in your dataset.
As a researcher, now and again I am confronted with outliers in my data as well, and in my own education I have been taught (like many others) that outliers are generally something to get rid of as soon as possible, since they may violate assumptions of some of the most-used analyses in social sciences (rendering them inappropriate for your data).
In Linear Digressions, though, Katie discussed a fascinating story about why outliers don’t always deserve the flack that they catch. It’s true; these strange values might be caused by typing or measurement errors. Sometimes, however, these oddballs can tell us so much about the world out there, the one we’re trying to study. In fact, in Katie’s story, the outlier helped solve a public health crisis in 19th century London! For anyone who is intrigued, I highly recommend the episode. It’s as informative about data as it is about 19th century history, actually.
This is opened my eyes to how fascinating outliers in data can really be, and actually made me wish for more outliers in our data! Their potential for important new insights has captivated my imagination ever since.
I hope that such stories can make more people aware of the beauty of outliers. Can’t help but draw this parallel between societies and data; in both cases outliers deserve much more care and attention than they have been getting.