Data Mining is a First Step

Data Mining is a First Step

From big tech companies, sci-fi movies, and policy entrepreneurs data mining is presented as a solution to many of our problems. With traffic apps collecting mountains of movement data, governments collecting vast amounts of tax data, and heath-tech companies collecting data for every step we take, the promise of data mining is that our sci-fi fantasies will be realized here on earth in the coming years. However, data mining is only a first step on a long road to the development of real knowledge that will make our world a better place. The data alone is interesting and our computing power to work with big data is astounding, but data mining can’t give us answers, only interesting correlations and statistics.
In The Book of Why Judea Pearl writes:
“It’s easy to understand why some people would see data mining as the finish rather than the first step. It promises a solution using available technology. It saves us, as well as future machines, the work of having to consider and articulate substantive assumptions about how the world operates. In some fields our knowledge may be in such an embryonic state that we have no clue how to begin drawing a model of the world. But big data will not solve this problem. The most important part of the answer must come from such a model, whether sketched by us or hypothesized and fine-tuned by machines.”
Big data can give us insights and help us identify unexpected correlations and associations, but identifying unexpected correlations and associations doesn’t actually tell us what is causing the observations we make. The messaging of massive data mining is that we will suddenly understand the world and make it a better place. The reality is that we have to develop hypotheses about how the world works based on causal understandings of the interactions between various factors of reality. This is crucial or we won’t be able to take meaningful action based what comes from our data mining. Without developing causal hypotheses we cannot experiment with associations and continue to learn, we can only observe what correlations come from big data. Using the vast amounts of data we are collecting is important, but we have to have a goal to work toward and a causal hypothesis of how we can reach that goal in order for data mining to be meaningful.
Complex Causation Continued

Complex Causation Continued

Our brains are good at interpreting and detecting causal structures, but often, the real causal structures at play are more complicated than what we can easily see. A causal chain may include a mediator, such as citrus fruit providing vitamin C to prevent scurvy. A causal chain may have a complex mediator interaction, as in the example of my last post where a drug leads to the body creating an enzyme that then works with the drug to be effective. Additionally, causal chains can be long-term affairs.
In The Book of Why Judea Pearl discusses long-term causal chains writing, “how can you sort out the causal effect of treatment when it may occur in many stages and the intermediate variables (which you might want to use as controls) depend on earlier stages of treatment?”
This is an important question within medicine and occupational safety. Pearl writes about the fact that factory workers are often exposed to chemicals over a long period, not just in a single instance. If it was repeated exposure to chemicals that caused cancer or another disease, how do you pin that on the individual exposures themselves? Was the individual safe with 50 exposures but as soon as a 51st exposure occurred the individual developed a cancer? Long-term exposure to chemicals and an increased cancer risk seems pretty obvious to us, but the actual causal mechanism in this situation is a bit hazy.
The same can apply in the other direction within the field of medicine. Some cancer drugs or immune therapy treatments work for a long time, stop working, or require changes in combinations based on how disease has progressed or how other side effects have manifested. Additionally, as we have all learned over the past year with vaccines, some medical combinations work better with boosters or time delayed components. Thinking about causality in these kinds of situations is difficult because the differing time scopes and combinations make it hard to understand exactly what is affecting what and when. I don’t have any deep answers or insights into these questions, but simply highlight them to again demonstrate complex causation and how much work our minds must do to fully understand a causal chain.
Complex Causation

Complex Causation

In linear causal models the total effect of an action is equal to the direct effect of that action and its indirect effect. We can think of an oversimplified anti-tobacco public health campaign to conceptualize this equation. A campaign could be developed to use famous celebrities in advertisements against smoking. This approach may have a direct effect on teen smoking rates if teens see the advertisements and decide not to smoke as a result of the influential messaging from their favorite celebrity. This approach may also have indirect effects. Imagine a teen who didn’t see the advertising, but their best friend did see it. If their best friend was influenced, then they may adopt their friend’s anti-smoking stance. This would be an indirect effect of the advertising campaign in the positive direction. The total effect of the campaign would then be the kids who were directly deterred from smoking combined with those who didn’t smoke because their friends were deterred.
However, linear causal models don’t capture all of the complexity that can exist within causal models. As Judea Pearl explains in The book of Why, there can be complex causal models where the equation that I started this post with doesn’t hold. Pearl uses a drug used to treat a disease as an example of a situation where the direct effect and indirect effect of a drug don’t equal the total effect. He says that in situations where a drug causes the body to release an enzyme that then combines with the drug to treat a disease, we have to think beyond the equation above. In this case he writes, “the total effect is positive but the direct and indirect effects are zero.”
The drug itself doesn’t do anything to combat the disease. It stimulates the release of an enzyme and without that enzyme the drug is ineffective against the disease. The enzyme also doesn’t have a direct effect on the disease. The enzyme is only useful when combined with the drug, so there is no indirect effect that can be measured as a result of the original drug being introduced. The effect is mediated between the interaction of both the drug and enzyme together. In the model Pearl shows us, there is only the mediating effect, not a direct or indirect effect.
This model helps us see just how complicated ideas and conceptions of causation are. Most of the time we think about direct effects, and we don’t always get to thinking about indirect effects combined with direct effects. Good scientific studies are able to capture the direct and indirect effects, but to truly understand causation today, we have to be able to include mediating effects in complex causation models like the one Pearl describes.
Causal Illusions

Causal Illusions

In The Book of Why Judea Pearl writes, “our brains are not wired to do probability problems, but they are wired to do causal problems. And this causal wiring produces systematic probabilistic mistakes, like optical illusions.” This can create problems for us when no causal link exists and when data correlate without any causal connections between outcomes. According to Pearl, our causal thinking, “neglects to account for the process by which observations are selected.” We don’t always realize that we are taking a sample, that our sample could be biased, and that structural factors independent of the phenomenon we are trying to observe could greatly impact the observations we actually make.
Pearl continues, “We live our lives as if the common cause principle were true. Whenever we see patterns, we look for a causal explanation. In fact, we hunger for an explanation, in terms of stable mechanisms that lie outside the data.” When we see a correlation our brains instantly start looking for a causal mechanism that can explain the correlation and the data we see. We don’t often look at the data itself to ask if there was some type of process in the data collection that lead to the outcomes we observed. Instead, we assume the data is correct and  that the data reflects an outside, real-world phenomenon. This is the cause of many causal illusions that Pearl describes in the book. Our minds are wired for causal thinking, and we will invent causality when we see patterns, even if there truly isn’t a causal structure linking the patterns we see.
It is in this spirit that we attribute negative personality traits to people who cut us off on the freeway. We assume they don’t like us, that they are terrible people, or that they are rushing to the hospital with a sick child so that our being cut off has a satisfying causal explanation. When a particular type of car stands out and we start seeing that car everywhere, we misattribute our increased attention to the type of car and assume that there really are more of those cars on the road now. We assume that people find them more reliable or more appealing and that people purposely bought those cars as a causal mechanism to explain why we now see them everywhere. In both of these cases we are creating causal pathways in our mind that in reality are little more than causal illusions, but we want to find a cause to everything and we don’t always realize that we are doing so.
Regression Coefficients

Regression Coefficients

Statistical regression is a great thing. We can generate a scatter plot, generate a line of best fit, and measure how well that line describes the relationship between the individual points within the data. The better the line fits (the more that individual points stick close to the line) the better the line describes the relationships and trends in our data. However, this doesn’t mean that the regression coefficients tell us anything about causality. It is tempting to say that a causal relationship exists when we see a trend line with lots of tight fitting dots around and two different variables on an X and Y axis, but this can be misleading.
In The Book of Why Judea Pearl writes, “Regression coefficients, whether adjusted or not, are only statistical trends, conveying no causal information in themselves.” It is easy to forget this, even if you have had a statistics class and know that correlation does not imply causation. Humans are pattern recognition machines, but we go a step beyond simply recognizing a pattern, we instantly set about trying to understand what is causing the pattern. However, our regression coefficients and scatter plots don’t always hold clear causal information. Quite often there is a third hidden variable that cannot be measured directly that is influencing the relationship we discover in our regression coefficients.
Pearl continues, “sometimes a regression coefficient represents a causal effect, and sometimes it does not – and you can’t rely on the data alone to tell you the difference.” Imagine a graph with a regression line running through a plot of force applied by a hydraulic press and fracture rates for ceramic mugs. One axis may be pressure, and the other axis may be thickness of the ceramic mug. The individual points represent the point at which individual mugs fractured We would be able to generate a regression line by testing the fracture strength of mugs of different thickness, and from this line we would be able to develop pretty solid causal inferences about thickness and fracture rates. A clear causal link could be identified by the regression coefficients in this scenario.
However, we could also imagine a graph that plotted murder rates in European cities and the spread of Christianity. With one axis being the number of years a city has had a Catholic bishop and the other axis being the number of murders, we may find that murders decrease the longer a city has had a bishop.  From this, we might be tempted to say that Christianity (particularly the location of a Bishop in a town) reduces murder. But what would we point to as the causal mechanism? Would it be religious beliefs adopted by people interacting with the church? Would it be that marriage rules that limited polygamy ensured more men found wives and became less murderous as a result? Would it be that some divinity smiled upon the praying people and made them to be less murderous? A regression like the one I described above wouldn’t tell us anything about the causal mechanism in effect in this instance. Our causal-thinking minds, however, would still generate causal hypothesis, some of which would be reasonable but others less so (this example comes from the wonderful The WEIRDest People in the World by Joseph Henrich).
Regression coefficients can be helpful, but they are less helpful when we cannot understand the causal mechanisms at play. Understanding the causal mechanisms can help us better understand the relationship represented by the regression coefficients, but the coefficient itself only represents a relationship, not a causal structure. Approaching data and looking for trends doesn’t help us generate useful information. We must first have a sense of a potential causal mechanism, then examine the data to see if our proposed causal mechanism has support or not. This is how we can use data and find support for causal hypothesis within regression coefficients.
Scrutinizing Causal Assumptions

Scrutinizing Causal Assumptions

Recently I have been writing about my biggest take-away from The Book of Why by Judea Pearl. The book is more technical than I can fully understand and grasp since it is written for a primarily academic audience with some knowledge of the fields that Pearl dives into, but I felt that I still was able to gain some insights from the book. Particularly, Pearl’s idea that humans are better causal thinkers than we typically give ourselves credit for was a big lesson for me. In thinking back on the book, I have been trying to recognize our powerful causal intuitions and to understand the ways in which our causal thinking can be trusted. Still, for me it feels that it can be dangerous to indulge our natural causal thinking tendencies.
However, Pearl offers guidance on how and when we can trust our causal instincts. he writes, “causal assumptions cannot be invented at our whim; they are subject to the scrutiny of data and can be falsified.”
Our ability to imagine different future states and to understand causality at an instinctual level has allowed our human species to move form hunter-gatherer groups to massive cities connected by electricity and Wi-Fi. However, our collective minds have also drawn causal connections between unfortunate events and imagined demons. Dictators have used implausible causal connections to justify eugenics and genocide and still to this day society is hampered by conspiracy theories that posit improbable causal links between disparate events.
The important thing to note, as Pearl demonstrates, is that causal assumptions can be falsified and must be supported with data. Supernatural demons cannot be falsified and wild conspiracy theories often lack any supporting data or evidence. We can intuit causal relations, but we must be able to test them in situations that would falsify our assumptions if we are to truly believe them. Pearl doesn’t simply argue that we are good causal thinkers and that we should blindly trust the causal assumptions that come naturally to our mind. Instead, he suggests that we lean into our causal faculties and test causal relationships and assumptions that are falsifiable and can be either supported or disproven by data. Statistics still has a role in this world, but importantly we are not looking at the data without making causal assumptions. We are making predictions and determining whether the data falsifies those predictions.
Causal Hypotheses

Causal Hypotheses

In The Book of Why Judea Pearl argues that humans have a unique superpower among animals and living creatures on earth. We are great at developing causal hypotheses. Animals are able to make observations about the world and some are even able to use tools to open fruit, find insects, and perform other tasks. However, humans alone seem to be able to take a tool, develop a hypothesis for why a tool works, and imagine what could be done to improve its functioning. This step requires that we develop causal hypotheses about the nature and reality of tools and how they interact with the objects we wish to manipulate. This is a hugely consequential mental ability, and one that humans have developed the ability to improve overtime, especially through cultural learning.
Our minds are imaginative and can think about potential future states. We can understand how our tools work and imagine ways in which our tools might be better in order for us to better achieve our goals. This is how we build causal hypotheses about the world, and how we go about exploring the world in search of evidence that confirms or overturns our imagined causal structures.
In the book, Pearl writes, “although we don’t need to know every causal relation between the variables of interest and might be able to draw some conclusions with only partial information, Wright makes one point with absolute clarity: you cannot draw causal conclusions without some causal hypothesis.”  (Sewall Wright is who Pearl references)
To answer causal questions we need to develop a causal hypothesis. We don’t need to have every bit of data possible, and we don’t need to perfectly intuit or know every causal structure, but we can still understand causality by investigating imagined causal pathways. Our brains are powerful enough to draw conclusions based on observed data and imagined causal pathways. While we might be wrong and have historically made huge errors in our causal attributions about the world, in many instances, we are great causal thinkers, to the point where causal structures that we identify are common sense. We might not know exactly what is happening at the molecular level, but we can understand the causal pathway between sharpening a piece of obsidian to form a point that could penetrate the flesh of an animal we are hunting. While some causal pathways are nearly invisible to us, a great deal are ready for us to view, and we should not forget that. We can get bogged down in statistics and become overly reliant on correlations and statistical relationships if we ignore the fact that our minds are adept at identifying and imagining causal structures.
The Quest of Science & Life

The Quest of Science & Life

“It is an irony of history that Galton started out in search of causation and ended up discovering correlation, a relationship that is oblivious of causation,” writes Judea Pearl in his book The Book of Why. Pearl examines the history of the study of causation in his book suggesting that Galton abandoned his original quest to define causation. Galton, along with Karl Pearson is a titanic figure in the study of statistics. The pair are in many ways responsible for the path of modern statistics, but as Pearl describes it, that was not the original intent, at least for Galton.
Pearl describes Galton as trying to work toward universal theories and approaches to causation. Correlation, the end product of Galton’s research is helpful and a vital part of how we understand the world today, but it is not causation. Correlation does not tell us if one thing causes another, only that a relationship exists. It doesn’t tell us which way the arrow of causation moves and whether other factors are important in causation. It tells us that as one thing changes, another changes with it, or that as other variables adjust, outcomes in the specific thing we want to see also adjust. But from correlation and statistical studies, we don’t truly know why the world works the way it does. I think that Pearl would argue that in its best form, statistics helps us narrow down causal possibilities and pathways, but it never tells us with any certainty that a relationship exists because of specific causal factors.
The direction of Galton’s research is emblematic of science and of our lives in general. Galton set out in search of one thing, and gave rise to an entirely different field of study. For his work he clearly became successful, influential, and well regarded, but today (as Pearl argues) we are living with the consequences of his work. We haven’t been able to move forward from the paradigm he created. A paradigm he didn’t really set out to establish.
Quite often in our lives we follow paths that we don’t fully understand, ending up in places we didn’t quite expect. We can make the most out of where our journeys take us and live full lives, even if we didn’t expect to be where we are living. We can’t fully control where the path takes us, and if we chose to stop, there is no reason the path has to stop as well. What we set out to do can become more than us, and can carry far beyond our imaginations, and the world will have to live with those consequences, even if we walk away or pass away.
They key point in this post is to remember that the world is complex. Remember that what you see is only a partial slice, that your causal explanations of the world may be inaccurate, and that the correlations you see are not complete explanations of reality. The path you walk shapes the future of the world, for you and for others, so you have a responsibility to make the best decisions you can, and to live well with the destination you reach, even if it isn’t the destination you thought you were walking toward. Your journey will end at some point, but the path you start could keep going far beyond your end-point, so consider whether you are leaving a path that others can continue to follow, or if you are forging a trail that will cause problems down the road. The lesson is to be considerate and make the most out of the winding and unpredictable path ahead of you as you set out on your quest.
Slope is Agnostic to Cause and Effect

Slope is Agnostic to Cause and Effect

I like statistics. I like to think statistically, to recognize that there is a percent chance of one outcome that can be influenced by other factors. I enjoy looking at best fit lines, seeing that there are correlations between different variables, and seeing how trend-lines change if you control for different variables. However, statistics and trend lines don’t actually tell us anything about causality.
In The Book of Why Judea Pearl writes, “the slope (after scaling) is the same no matter whether you plot X against Y or Y against X. In other words, the slope is completely agnostic as to cause and effect. One variable could cause the other, or they could both be effects of a third cause; for the purpose of prediction, it does not matter.”
In statistics we all know that correlation is not causation, but this quote helps us remember important information when we see a statistical analysis and a plot with linear regression line running through it. The regression line is like the owl that Pearl had described earlier in the book. The owl is able to predict where a mouse is likely to be and able to predict which direction it will run, but the owl does not seem to know why a mouse is likely to be in a given location or why it is likely to run in one direction over another. It simply knows from experience and observation what a mouse is likely to do.
The regression line is a best fit for numerous observations, but it doesn’t tell us whether one variable causes another or whether both are influenced in a similar manner by another variable. The regression line knows where the mouse might be and where it might run, but it doesn’t know why.
In statistics courses we end at this point of correlation. We might look for other variables that are correlated or try to control for third variables to see if the relationship remains, but we never answer the question of causality, we never get to the why. Pearl thinks this is a limitation we do not need to put on ourselves. Humans, unlike owls, can understand causality, we can recognize the various reasons why a mouse might be hiding under a bush, and why it may chose to run in one direction rather than another. Correlations can help us start to see where relationships exist, but it is the ability of our mind to understand causal pathways that helps us determine causation.
Pearl argues that statisticians avoid these causal arguments out of caution, but that it only ends up creating more problems down the line. Important statistical research in areas of high interest or concern to law-makers, business people, or the general public are carried beyond the cautious bounds that causality-averse statisticians place on their work. Showing correlations without making an effort to understand the causality behind it makes scientific work vulnerable to the epistemically malevolent who would like to use correlations to their own ends. While statisticians rigorously train themselves to understand that correlation is not causation, the general public and those struck with motivated reasoning don’t hold themselves to the same standard. Leaving statistical analysis at the level of correlation means that others can attribute the cause and effect of their choice to the data, and the proposed causal pathways can be wildly inaccurate and even dangerous. Pearl suggests that statisticians and researchers are thus obligated to do more with causal structures, to round off  their work and better develop ideas of causation that can be defended once their work is beyond the world of academic journals.
Predictions & Explanations

Predictions & Explanations

The human mind has incredible predictive abilities, but our explanatory abilities do not always turn out to be as equally incredible. Prediction is relatively easy when compared to explanation. Animals can predict where a food source will be without being able to explain how it got there. For most of human history our ancestors were able to predict that the sun would rise the next day without having any way of explaining why it would rise. Computer programs today can predict our next move in chess but few can explain their prediction or why we would make the choice that was predicted.
As Judea Pearl writes in The Book of Why, “Good predictions need not have good explanations. The owl can be a good hunter without understanding why the rat always goes from point A to point B.” Prediction is possible with statistics and good observations. With a large enough database, we can make a prediction about what percentage of individuals will have negative reactions to medications, we can predict when a traffic jam will occur, and we can predict how an animal will behave. What is harder, according to Pearl, is moving to the stage where we describe why we observe the relationships that statistics reveal.
Statistics alone cannot tell us why particular patterns emerge. Statistics cannot identify causal structures. As a result, we continually tell ourselves that correlation is not causation and that we can only determine what relationships are truly causal through randomized controlled trials. Pearl would argue that this is incorrect, and he would argue that this idea results from the fact that statistics is trying to answer a completely different question than causation. Approaching statistical questions from a causal lens may lead to inaccurate interpretations of data or “p-hacking” an academic term used to describe efforts to get the statistical results you wanted to see. The key is not hunting for causation within statistics, but understanding causation and supporting it through evidence uncovered via statistics.
Seeing the difference between causation and statistics is helpful when thinking about the world. Being stuck without a way to see and determine causation leads to situations like tobacco companies claiming that cigarettes don’t cause cancer or oil and gas companies claiming that humans don’t contribute to global warming. Causal thinking, however, utilizes our ability to develop explanations and applies those explanations to the world. Our ability to predict different outcomes based on different interventions helps us interpret and understand the data that the world produces. We may not see the exact picture in the data, but we can understand it and use it to help us make better decisions that will lead to more accurate causal understandings over time.