Statistical regression is a great thing. We can generate a scatter plot, generate a line of best fit, and measure how well that line describes the relationship between the individual points within the data. The better the line fits (the more that individual points stick close to the line) the better the line describes the relationships and trends in our data. However, this doesn’t mean that the regression coefficients tell us anything about causality. It is tempting to say that a causal relationship exists when we see a trend line with lots of tight fitting dots around and two different variables on an X and Y axis, but this can be misleading.
In The Book of Why Judea Pearl writes, “Regression coefficients, whether adjusted or not, are only statistical trends, conveying no causal information in themselves.” It is easy to forget this, even if you have had a statistics class and know that correlation does not imply causation. Humans are pattern recognition machines, but we go a step beyond simply recognizing a pattern, we instantly set about trying to understand what is causing the pattern. However, our regression coefficients and scatter plots don’t always hold clear causal information. Quite often there is a third hidden variable that cannot be measured directly that is influencing the relationship we discover in our regression coefficients.
Pearl continues, “sometimes a regression coefficient represents a causal effect, and sometimes it does not – and you can’t rely on the data alone to tell you the difference.” Imagine a graph with a regression line running through a plot of force applied by a hydraulic press and fracture rates for ceramic mugs. One axis may be pressure, and the other axis may be thickness of the ceramic mug. The individual points represent the point at which individual mugs fractured We would be able to generate a regression line by testing the fracture strength of mugs of different thickness, and from this line we would be able to develop pretty solid causal inferences about thickness and fracture rates. A clear causal link could be identified by the regression coefficients in this scenario.
However, we could also imagine a graph that plotted murder rates in European cities and the spread of Christianity. With one axis being the number of years a city has had a Catholic bishop and the other axis being the number of murders, we may find that murders decrease the longer a city has had a bishop. From this, we might be tempted to say that Christianity (particularly the location of a Bishop in a town) reduces murder. But what would we point to as the causal mechanism? Would it be religious beliefs adopted by people interacting with the church? Would it be that marriage rules that limited polygamy ensured more men found wives and became less murderous as a result? Would it be that some divinity smiled upon the praying people and made them to be less murderous? A regression like the one I described above wouldn’t tell us anything about the causal mechanism in effect in this instance. Our causal-thinking minds, however, would still generate causal hypothesis, some of which would be reasonable but others less so (this example comes from the wonderful The WEIRDest People in the World by Joseph Henrich).
Regression coefficients can be helpful, but they are less helpful when we cannot understand the causal mechanisms at play. Understanding the causal mechanisms can help us better understand the relationship represented by the regression coefficients, but the coefficient itself only represents a relationship, not a causal structure. Approaching data and looking for trends doesn’t help us generate useful information. We must first have a sense of a potential causal mechanism, then examine the data to see if our proposed causal mechanism has support or not. This is how we can use data and find support for causal hypothesis within regression coefficients.