Data Driven Methods

Data Driven Methods

In the world of big data scientists today have a real opportunity to push the limits scientific inquiry in ways that were never before possible. We have the collection methods and computing power available to analyze huge datasets and make observations in minutes that would have taken decades just a few years ago. However, many areas of science are not being strategic with this new power. Instead, many areas of science simply seem to be plugging variables into huge data sets and haphazardly looking for correlations and associations. Judea Pearl is critical of this approach to science in The Book of Why and uses the Genome-wide association study (GWAS) to demonstrate the shortcomings of this approach.
 
 
Pearl writes, “It is important to notice the word association in the term GWAS. This method does not prove causality; it only identifies genes associated with a certain disease in the given sample. It is a data-driven rather than hypothesis-driven method, and this presents problems for causal inference.”
 
 
In the 1950s and 1960s, Pearl explains, R. A. Fisher was skeptical that smoking caused cancer and argued that the correlation between smoking and cancer could have simply been the result of a hidden variable. He suggested it was possible for a gene to exist that both predisposed people to smoke and predisposed people to develop lung cancer. Pearl writes that such a smoking gene was indeed discovered in 2008 through the GWAS, but Pearl also notes that the existence of such a gene doesn’t actually provide us with any causal mechanism between people’s genes and smoking behavior or cancer development.  The smoking gene was not discovered by a hypothesis driven method but rather by data driven methods. Researchers simply looked at massive genomic datasets to see if any genes correlated between people who smoke and people who develop lung cancer. The smoking gene stood out in that study.
 
 
Pearl continues to say that causal investigations have shown that the gene in question is important for nicotine receptors  in lung cells, positing a causal pathway to smoking predispositions and the gene. However, causal studies also indicate that the gene increases your chance of developing lung cancer by less than doubling the chance of cancer. “This is serious business, no doubt, but it does not compare to the danger you face if you are a regular smoker,” writes Pearl. Smoking is associated with a 10 times increase in the risk of developing lung cancer, while the smoking gene only accounts for a less than double risk increase. The GWAS tells us that the gene is involved in cancer, but we can’t make any causal conclusions from just an association. We have to go deeper to understand its causality and to relate that to other factors that we can study. This helps us contextualize the information from the GWAS.
 
 
Much of science is still like the GWAS, looking for associations and hoping to be able to identify a causal pathway as was done with the smoking gene. In some cases these data driven methods can pay off by pointing the way for researchers to start looking for hypothesis driven methods, but we should recognize that data driven methods themselves don’t answer our questions and only represent correlations, not underlying causal structures. This is important because studies and findings based on just associations can be misleading. Discovering a smoking gene and not explaining the actual causal relationship or impact could harm people’s health, especially if they decided that they would surely develop cancer because they had the gene. Association studies ultimately can be misleading, misused, misunderstood, and dangerous, and that is part of why Pearl suggests a need to move beyond simple association studies. 

Regression Coefficients

Regression Coefficients

Statistical regression is a great thing. We can generate a scatter plot, generate a line of best fit, and measure how well that line describes the relationship between the individual points within the data. The better the line fits (the more that individual points stick close to the line) the better the line describes the relationships and trends in our data. However, this doesn’t mean that the regression coefficients tell us anything about causality. It is tempting to say that a causal relationship exists when we see a trend line with lots of tight fitting dots around and two different variables on an X and Y axis, but this can be misleading.
In The Book of Why Judea Pearl writes, “Regression coefficients, whether adjusted or not, are only statistical trends, conveying no causal information in themselves.” It is easy to forget this, even if you have had a statistics class and know that correlation does not imply causation. Humans are pattern recognition machines, but we go a step beyond simply recognizing a pattern, we instantly set about trying to understand what is causing the pattern. However, our regression coefficients and scatter plots don’t always hold clear causal information. Quite often there is a third hidden variable that cannot be measured directly that is influencing the relationship we discover in our regression coefficients.
Pearl continues, “sometimes a regression coefficient represents a causal effect, and sometimes it does not – and you can’t rely on the data alone to tell you the difference.” Imagine a graph with a regression line running through a plot of force applied by a hydraulic press and fracture rates for ceramic mugs. One axis may be pressure, and the other axis may be thickness of the ceramic mug. The individual points represent the point at which individual mugs fractured We would be able to generate a regression line by testing the fracture strength of mugs of different thickness, and from this line we would be able to develop pretty solid causal inferences about thickness and fracture rates. A clear causal link could be identified by the regression coefficients in this scenario.
However, we could also imagine a graph that plotted murder rates in European cities and the spread of Christianity. With one axis being the number of years a city has had a Catholic bishop and the other axis being the number of murders, we may find that murders decrease the longer a city has had a bishop.  From this, we might be tempted to say that Christianity (particularly the location of a Bishop in a town) reduces murder. But what would we point to as the causal mechanism? Would it be religious beliefs adopted by people interacting with the church? Would it be that marriage rules that limited polygamy ensured more men found wives and became less murderous as a result? Would it be that some divinity smiled upon the praying people and made them to be less murderous? A regression like the one I described above wouldn’t tell us anything about the causal mechanism in effect in this instance. Our causal-thinking minds, however, would still generate causal hypothesis, some of which would be reasonable but others less so (this example comes from the wonderful The WEIRDest People in the World by Joseph Henrich).
Regression coefficients can be helpful, but they are less helpful when we cannot understand the causal mechanisms at play. Understanding the causal mechanisms can help us better understand the relationship represented by the regression coefficients, but the coefficient itself only represents a relationship, not a causal structure. Approaching data and looking for trends doesn’t help us generate useful information. We must first have a sense of a potential causal mechanism, then examine the data to see if our proposed causal mechanism has support or not. This is how we can use data and find support for causal hypothesis within regression coefficients.
Laboratory Proof

Laboratory Proof

“If the standard of laboratory proof had been applied to scurvy,” writes Judea Pearl in The Book of Why, “then sailors would have continued dying right up until the 1930’s, because until the discovery of vitamin C, there was no laboratory proof that citrus fruits prevented scurvy.” Pearl’s quote shows that high scientific standards for definitive and exact causality are not always for the greater good. Sometimes modern science will spurn clear statistical relationships and evidence because statistical relationships alone cannot be counted on as concrete causal evidence. A clear answer will not be given because some marginal unknowns may still exist, and this can have its own costs.
Sailors did not know why or how citrus fruits prevented scurvy, but observations demonstrated that citrus fruits managed to prevent scurvy. There was no clear understanding of what scurvy was or why citrus fruits were helpful, but it was commonly understood that a causal relationship existed. People acted on these observations and lives were saved.
On two episodes, the Don’t Panic Geocast has talked about journal articles in the British Medical Journal that make the same point as Pearl. As a critique of the need for randomized controlled trials, the two journal articles highlight the troubling reality that there have not been any randomized controlled trials on the effectiveness of parachute usage when jumping from airplanes. The articles are hilarious and clearly satirical, but ultimately come to the same point that Pearl does with the quote above – laboratory proof is not always necessary, practical, or reasonable when lives are on the line.
Pearl argues that we can rely on our abilities to identify causality even without laboratory proof when we have sufficient statistical analysis and understanding of relationships. Statisticians always tell us that correlation is not causation and that observational studies are not sufficient to determine causality, yet the citrus fruit and parachute examples highlight that this mindset is not always appropriate. Sometimes more realistic and common sense understanding of causation – even if supported with just correlational relationships and statistics – are more important than laboratory proof.
The Quest of Science & Life

The Quest of Science & Life

“It is an irony of history that Galton started out in search of causation and ended up discovering correlation, a relationship that is oblivious of causation,” writes Judea Pearl in his book The Book of Why. Pearl examines the history of the study of causation in his book suggesting that Galton abandoned his original quest to define causation. Galton, along with Karl Pearson is a titanic figure in the study of statistics. The pair are in many ways responsible for the path of modern statistics, but as Pearl describes it, that was not the original intent, at least for Galton.
Pearl describes Galton as trying to work toward universal theories and approaches to causation. Correlation, the end product of Galton’s research is helpful and a vital part of how we understand the world today, but it is not causation. Correlation does not tell us if one thing causes another, only that a relationship exists. It doesn’t tell us which way the arrow of causation moves and whether other factors are important in causation. It tells us that as one thing changes, another changes with it, or that as other variables adjust, outcomes in the specific thing we want to see also adjust. But from correlation and statistical studies, we don’t truly know why the world works the way it does. I think that Pearl would argue that in its best form, statistics helps us narrow down causal possibilities and pathways, but it never tells us with any certainty that a relationship exists because of specific causal factors.
The direction of Galton’s research is emblematic of science and of our lives in general. Galton set out in search of one thing, and gave rise to an entirely different field of study. For his work he clearly became successful, influential, and well regarded, but today (as Pearl argues) we are living with the consequences of his work. We haven’t been able to move forward from the paradigm he created. A paradigm he didn’t really set out to establish.
Quite often in our lives we follow paths that we don’t fully understand, ending up in places we didn’t quite expect. We can make the most out of where our journeys take us and live full lives, even if we didn’t expect to be where we are living. We can’t fully control where the path takes us, and if we chose to stop, there is no reason the path has to stop as well. What we set out to do can become more than us, and can carry far beyond our imaginations, and the world will have to live with those consequences, even if we walk away or pass away.
They key point in this post is to remember that the world is complex. Remember that what you see is only a partial slice, that your causal explanations of the world may be inaccurate, and that the correlations you see are not complete explanations of reality. The path you walk shapes the future of the world, for you and for others, so you have a responsibility to make the best decisions you can, and to live well with the destination you reach, even if it isn’t the destination you thought you were walking toward. Your journey will end at some point, but the path you start could keep going far beyond your end-point, so consider whether you are leaving a path that others can continue to follow, or if you are forging a trail that will cause problems down the road. The lesson is to be considerate and make the most out of the winding and unpredictable path ahead of you as you set out on your quest.
Slope is Agnostic to Cause and Effect

Slope is Agnostic to Cause and Effect

I like statistics. I like to think statistically, to recognize that there is a percent chance of one outcome that can be influenced by other factors. I enjoy looking at best fit lines, seeing that there are correlations between different variables, and seeing how trend-lines change if you control for different variables. However, statistics and trend lines don’t actually tell us anything about causality.
In The Book of Why Judea Pearl writes, “the slope (after scaling) is the same no matter whether you plot X against Y or Y against X. In other words, the slope is completely agnostic as to cause and effect. One variable could cause the other, or they could both be effects of a third cause; for the purpose of prediction, it does not matter.”
In statistics we all know that correlation is not causation, but this quote helps us remember important information when we see a statistical analysis and a plot with linear regression line running through it. The regression line is like the owl that Pearl had described earlier in the book. The owl is able to predict where a mouse is likely to be and able to predict which direction it will run, but the owl does not seem to know why a mouse is likely to be in a given location or why it is likely to run in one direction over another. It simply knows from experience and observation what a mouse is likely to do.
The regression line is a best fit for numerous observations, but it doesn’t tell us whether one variable causes another or whether both are influenced in a similar manner by another variable. The regression line knows where the mouse might be and where it might run, but it doesn’t know why.
In statistics courses we end at this point of correlation. We might look for other variables that are correlated or try to control for third variables to see if the relationship remains, but we never answer the question of causality, we never get to the why. Pearl thinks this is a limitation we do not need to put on ourselves. Humans, unlike owls, can understand causality, we can recognize the various reasons why a mouse might be hiding under a bush, and why it may chose to run in one direction rather than another. Correlations can help us start to see where relationships exist, but it is the ability of our mind to understand causal pathways that helps us determine causation.
Pearl argues that statisticians avoid these causal arguments out of caution, but that it only ends up creating more problems down the line. Important statistical research in areas of high interest or concern to law-makers, business people, or the general public are carried beyond the cautious bounds that causality-averse statisticians place on their work. Showing correlations without making an effort to understand the causality behind it makes scientific work vulnerable to the epistemically malevolent who would like to use correlations to their own ends. While statisticians rigorously train themselves to understand that correlation is not causation, the general public and those struck with motivated reasoning don’t hold themselves to the same standard. Leaving statistical analysis at the level of correlation means that others can attribute the cause and effect of their choice to the data, and the proposed causal pathways can be wildly inaccurate and even dangerous. Pearl suggests that statisticians and researchers are thus obligated to do more with causal structures, to round off  their work and better develop ideas of causation that can be defended once their work is beyond the world of academic journals.
Hope in Big Data

Hope in Big Data

Most of us probably don’t work with huge data sets, but all of us contribute to huge data sets. We know the world of big data is out there, and we know people are working with big data, but there are not many of us who truly know what it means and how we should think about any of it. In The Book of Why, Judea Pearl argues that even many of those doing research and running companies based on big data don’t fully understand what it all means.
Pearl is critical of researchers and entrepreneurs who lack causal understandings but pursue new knowledge and information by pulling correlations and statistics out of large data sets. There are some companies that are taking advantage of the fact that huge amounts of computing power can give us insights into data sets that we never before could have generated, however, these insights are not always as meaningful as we are lead to believe.
Pearl writes, “The hope – and at present, it is usually a silent one – is that the data themselves will guide us to the right answers whenever causal questions come up.”
My last post was about the overuse of the phrase: correlation is not causation. Finding correlations and relationships in data is meaningless if we don’t also have causal understandings in mind. This is the critique that Pearl makes with the quote above. If we don’t have a way of understanding basic causal structures, then the phrase is right, correlations don’t mean anything. Many companies and researchers are in a stage where they are finding correlations and unexpected statistical results in big data, but they lack causal understandings to do anything meaningful with the data. In the world of public policy this feels like the saying, a solution in search of a problem or in the world of healthcare like a pay and chase scenario.
Pearl argues throughout the book that we are better at identifying causal structures than we are lead to believe in our statistics courses. He also argues that understanding causality is key to unlocking the potential of big data and actually getting something useful out of massive datasets. Without a grounding in causality, we are wasting our time with the statistical research we do. We are running around with solutions in the forms of big data correlations that don’t have a causal underpinning. It is as if we are paying fraudulent claims, then chasing down some of the money we spent and congratulating ourselves on preventing fraud. The end result is a poor use of data that we prop up as a magnanimous solution.
Correlation and Causation - Judea Pearl - The Book of Why - Joe Abittan

Correlation and Causation

I have an XKCD comic taped to the door of my office. The comic is about the mantra of statistics, that correlation is not causation. I taped the comic to my office door because I loved learning statistics in graduate school and thinking deeply about associations and how mere correlations cannot be used to demonstrate that one thing causes another. Two events can correlate, but have nothing to do with each other, and a third thing may influence both, causing them to correlate without any causal link between the two things.
But Judea Pearl thinks that science and researchers have fallen into a trap laid out by statisticians and the infinitely repeated correlation does not imply causation mantra. Regarding this perspective of statistics he writes, “it tells us that correlation is not causation, but it does not tell us what causation is.”
Pearl seems to suggest in The Book of Why that there was a time where there was too much data, too much humans didn’t know, and too many people ready to offer incomplete assessments based on anecdote and incomplete information. From this time sprouted the idea that correlation does not imply causation. We started to see that statistics could describe relationships and that statistics could be used to pull apart entangled causal webs, identifying each individual component and assessing its contribution to a given outcome. However, as his quote shows, this approach never actually answered what causation is. It never actually told us when we can know and ascertain that a causal structure and causal mechanism is in place.
“Over and over again,” writes Pearl, “in science and in business, we see situations where mere data aren’t enough.”
To demonstrate the shortcomings of our high regard for statistics and our mantra that correlation is not causation, Pearl walks us through the congressional testimonies and trials of big tobacco companies in the United States. The data told us there was a correlation between smoking and lung cancer. There was overwhelming statistical evidence that smoking was related or associated with lung cancer, but we couldn’t attain 100% certainty just through statistics that smoking caused lung cancer. The companies themselves muddied the water with misleading studies and cherry picked results. They hid behind a veil that said that correlation was not causation, and hid behind the confusion around causation that statistics could never fully clarify.
Failing to develop a real sense of causation, failing to move beyond big data, and failing to get beyond statistical correlations can have real harms. We need to be able to recognize causation, even without relying on randomized controlled trials, and we need to be able to make decisions to save lives. The lesson of the comic taped to my door is helpful when we are trying to be scientific and accurate in our thinking, but it can also lead us astray when we fail to trust a causal structure that we can see, but can’t definitively prove via statistics.