Statistical regression is a great thing. We can generate a scatter plot, generate a line of best fit, and measure how well that line describes the relationship between the individual points within the data. The better the line fits (the more that individual points stick close to the line) the better the line describes the relationships and trends in our data. However, this doesn’t mean that the regression coefficients tell us anything about causality. It is tempting to say that a causal relationship exists when we see a trend line with lots of tight fitting dots around and two different variables on an X and Y axis, but this can be misleading.
In The Book of Why Judea Pearl writes, “Regression coefficients, whether adjusted or not, are only statistical trends, conveying no causal information in themselves.” It is easy to forget this, even if you have had a statistics class and know that correlation does not imply causation. Humans are pattern recognition machines, but we go a step beyond simply recognizing a pattern, we instantly set about trying to understand what is causing the pattern. However, our regression coefficients and scatter plots don’t always hold clear causal information. Quite often there is a third hidden variable that cannot be measured directly that is influencing the relationship we discover in our regression coefficients.
Pearl continues, “sometimes a regression coefficient represents a causal effect, and sometimes it does not – and you can’t rely on the data alone to tell you the difference.” Imagine a graph with a regression line running through a plot of force applied by a hydraulic press and fracture rates for ceramic mugs. One axis may be pressure, and the other axis may be thickness of the ceramic mug. The individual points represent the point at which individual mugs fractured We would be able to generate a regression line by testing the fracture strength of mugs of different thickness, and from this line we would be able to develop pretty solid causal inferences about thickness and fracture rates. A clear causal link could be identified by the regression coefficients in this scenario.
However, we could also imagine a graph that plotted murder rates in European cities and the spread of Christianity. With one axis being the number of years a city has had a Catholic bishop and the other axis being the number of murders, we may find that murders decrease the longer a city has had a bishop. From this, we might be tempted to say that Christianity (particularly the location of a Bishop in a town) reduces murder. But what would we point to as the causal mechanism? Would it be religious beliefs adopted by people interacting with the church? Would it be that marriage rules that limited polygamy ensured more men found wives and became less murderous as a result? Would it be that some divinity smiled upon the praying people and made them to be less murderous? A regression like the one I described above wouldn’t tell us anything about the causal mechanism in effect in this instance. Our causal-thinking minds, however, would still generate causal hypothesis, some of which would be reasonable but others less so (this example comes from the wonderful The WEIRDest People in the World by Joseph Henrich).
Regression coefficients can be helpful, but they are less helpful when we cannot understand the causal mechanisms at play. Understanding the causal mechanisms can help us better understand the relationship represented by the regression coefficients, but the coefficient itself only represents a relationship, not a causal structure. Approaching data and looking for trends doesn’t help us generate useful information. We must first have a sense of a potential causal mechanism, then examine the data to see if our proposed causal mechanism has support or not. This is how we can use data and find support for causal hypothesis within regression coefficients.
One thought on “Regression Coefficients”
“Approaching data and looking for trends doesn’t help us generate useful information. We must first have a sense of a potential causal mechanism, then examine the data to see if our proposed causal mechanism has support or not.”
I think there’s more fuzzy feedback mechanisms to consider here. It makes sense that merely discovering relationships among variables isn’t sufficient for causal conclusions, but I do think stumbling across a correlation (e.g., through data mining) can be helpful for generating useful information if only for the reason that it can motivate people to dig deeper and develop a model that explains the correlation. It seems like useful information could come from randomly discovering that some health outcome is associated with an individual’s zip code. Of course, its silly to consider that the zip code “caused” the disparity, but I think many people will start to wonder things like “what other things associated with zip code could explain this relationship?”. The initial discovery of the relationship can itself act as a signal that “Hey, there’s potentially something useful to discover here. Look deeper. Explore a variety hypotheses.”
So, I completely agree that simply quantifying relationships isn’t enough to establish causal relationships, but I don’t think that we always need to START with the causal model. I think we can start from nothing, randomly discover relationships, and THEN generate a variety of causal models. Then, of course, the next thing to do is go out and collect more data to test those models, refine them, test again, and so on ad infinitum. I guess my (perhaps trivial) point is that, yes, an undirected search for trends will not on its own generate directly actionable or causally conclusive information. But, I do think that raw correlations can be inherently useful if only as a message from the science gods telling one: “keep going, dig deeper. X probably doesn’t cause Y, but SOMETHING causes Y.”
As I write this, I thinking of the story of how Alexander Fleming discovered penicillin. It all started with seeing an unexpected and unexplained relationship. I guess my point is just that models don’t always need to precede data. There’s often a lot of back-and-forth and it doesn’t seem obvious that one should always or mostly come first.
Does that make sense?