Most of us probably don’t work with huge data sets, but all of us contribute to huge data sets. We know the world of big data is out there, and we know people are working with big data, but there are not many of us who truly know what it means and how we should think about any of it. In The Book of Why, Judea Pearl argues that even many of those doing research and running companies based on big data don’t fully understand what it all means.
Pearl is critical of researchers and entrepreneurs who lack causal understandings but pursue new knowledge and information by pulling correlations and statistics out of large data sets. There are some companies that are taking advantage of the fact that huge amounts of computing power can give us insights into data sets that we never before could have generated, however, these insights are not always as meaningful as we are lead to believe.
Pearl writes, “The hope – and at present, it is usually a silent one – is that the data themselves will guide us to the right answers whenever causal questions come up.”
My last post was about the overuse of the phrase: correlation is not causation. Finding correlations and relationships in data is meaningless if we don’t also have causal understandings in mind. This is the critique that Pearl makes with the quote above. If we don’t have a way of understanding basic causal structures, then the phrase is right, correlations don’t mean anything. Many companies and researchers are in a stage where they are finding correlations and unexpected statistical results in big data, but they lack causal understandings to do anything meaningful with the data. In the world of public policy this feels like the saying, a solution in search of a problem or in the world of healthcare like a pay and chase scenario.
Pearl argues throughout the book that we are better at identifying causal structures than we are lead to believe in our statistics courses. He also argues that understanding causality is key to unlocking the potential of big data and actually getting something useful out of massive datasets. Without a grounding in causality, we are wasting our time with the statistical research we do. We are running around with solutions in the forms of big data correlations that don’t have a causal underpinning. It is as if we are paying fraudulent claims, then chasing down some of the money we spent and congratulating ourselves on preventing fraud. The end result is a poor use of data that we prop up as a magnanimous solution.