When we have good graphs and statistical aids, thinking statistically can feel straightforward and intuitive. Clear charts can help us tell a story, can help us visualize trends and relationships, and can help us better conceptualize risk and probability. However, understanding data is hard, especially if the way that data is collected creates statistical artifacts.
Yesterday’s post was about extreme outcomes, and how it is the smallest counties in the United States where we see both the highest per capita instances of cancer and the lowest per capita instances of cancer. Small populations allow for large fluctuations in per capita cancer diagnoses, and thus extreme outcomes in cancer rates. We could graph the per capita rates, model them on a map of the United States, or present the data in unique ways, but all we would really be doing is creating a visual aid influenced by statistical artifacts from the samples we used. As Daniel Kahneman explains in his book Thinking Fast and Slow, “the differences between dense and rural counties do not really count as facts: they are what scientists call artifacts, observations that are produced entirely by some aspect of the method of research – in this case, by differences in sample size.”
Counties in the United States vary dramatically. Some counties are geographically huge, while others are pretty small – Nevada’s is a large state with over 110,000 square miles of land but only 17 counties compared to West Virginia with under 25,000 square feet of land and 55 counties. Across the US, some counties are exclusively within metropolitan areas, some are completely within suburbs, some are entirely rural with only a few hundred people, and some manage to incorporate major metros, expansive suburbs, and vast rural stretches (shoutout to Clark County, NV). They are convenient for collecting data, but can cause problems when analyzing population trends across the country. The variations in size and other factors creates the possibility for the extreme outcomes we see in things like cancer rates across counties. When smoothed out over larger populations, the disparities in cancer rates disappears.
Most of us are not collecting lots of important data for analysis each day. Most of us probably don’t have to worry too much on a day to day basis about some important statistical sampling problem. But we should at least be aware of how complex information is, and how difficult it can be to display and share information in an accurate manner. We should turn to people like Tim Harford for help interpreting and understanding complex statistics when we can, and we should try to look for factors that might interfere with a convenient conclusion before we simply believe what we would like to believe about a set of data. Statistical artifacts can play a huge role in shaping the way we understand a particular phenomenon, and we shouldn’t jump to extreme conclusions based on poor data.