From big tech companies, sci-fi movies, and policy entrepreneurs data mining is presented as a solution to many of our problems. With traffic apps collecting mountains of movement data, governments collecting vast amounts of tax data, and heath-tech companies collecting data for every step we take, the promise of data mining is that our sci-fi fantasies will be realized here on earth in the coming years. However, data mining is only a first step on a long road to the development of real knowledge that will make our world a better place. The data alone is interesting and our computing power to work with big data is astounding, but data mining can’t give us answers, only interesting correlations and statistics.
In The Book of Why Judea Pearl writes:
“It’s easy to understand why some people would see data mining as the finish rather than the first step. It promises a solution using available technology. It saves us, as well as future machines, the work of having to consider and articulate substantive assumptions about how the world operates. In some fields our knowledge may be in such an embryonic state that we have no clue how to begin drawing a model of the world. But big data will not solve this problem. The most important part of the answer must come from such a model, whether sketched by us or hypothesized and fine-tuned by machines.”
Big data can give us insights and help us identify unexpected correlations and associations, but identifying unexpected correlations and associations doesn’t actually tell us what is causing the observations we make. The messaging of massive data mining is that we will suddenly understand the world and make it a better place. The reality is that we have to develop hypotheses about how the world works based on causal understandings of the interactions between various factors of reality. This is crucial or we won’t be able to take meaningful action based what comes from our data mining. Without developing causal hypotheses we cannot experiment with associations and continue to learn, we can only observe what correlations come from big data. Using the vast amounts of data we are collecting is important, but we have to have a goal to work toward and a causal hypothesis of how we can reach that goal in order for data mining to be meaningful.