Data Driven Methods

Data Driven Methods

In the world of big data scientists today have a real opportunity to push the limits scientific inquiry in ways that were never before possible. We have the collection methods and computing power available to analyze huge datasets and make observations in minutes that would have taken decades just a few years ago. However, many areas of science are not being strategic with this new power. Instead, many areas of science simply seem to be plugging variables into huge data sets and haphazardly looking for correlations and associations. Judea Pearl is critical of this approach to science in The Book of Why and uses the Genome-wide association study (GWAS) to demonstrate the shortcomings of this approach.
 
 
Pearl writes, “It is important to notice the word association in the term GWAS. This method does not prove causality; it only identifies genes associated with a certain disease in the given sample. It is a data-driven rather than hypothesis-driven method, and this presents problems for causal inference.”
 
 
In the 1950s and 1960s, Pearl explains, R. A. Fisher was skeptical that smoking caused cancer and argued that the correlation between smoking and cancer could have simply been the result of a hidden variable. He suggested it was possible for a gene to exist that both predisposed people to smoke and predisposed people to develop lung cancer. Pearl writes that such a smoking gene was indeed discovered in 2008 through the GWAS, but Pearl also notes that the existence of such a gene doesn’t actually provide us with any causal mechanism between people’s genes and smoking behavior or cancer development.  The smoking gene was not discovered by a hypothesis driven method but rather by data driven methods. Researchers simply looked at massive genomic datasets to see if any genes correlated between people who smoke and people who develop lung cancer. The smoking gene stood out in that study.
 
 
Pearl continues to say that causal investigations have shown that the gene in question is important for nicotine receptors  in lung cells, positing a causal pathway to smoking predispositions and the gene. However, causal studies also indicate that the gene increases your chance of developing lung cancer by less than doubling the chance of cancer. “This is serious business, no doubt, but it does not compare to the danger you face if you are a regular smoker,” writes Pearl. Smoking is associated with a 10 times increase in the risk of developing lung cancer, while the smoking gene only accounts for a less than double risk increase. The GWAS tells us that the gene is involved in cancer, but we can’t make any causal conclusions from just an association. We have to go deeper to understand its causality and to relate that to other factors that we can study. This helps us contextualize the information from the GWAS.
 
 
Much of science is still like the GWAS, looking for associations and hoping to be able to identify a causal pathway as was done with the smoking gene. In some cases these data driven methods can pay off by pointing the way for researchers to start looking for hypothesis driven methods, but we should recognize that data driven methods themselves don’t answer our questions and only represent correlations, not underlying causal structures. This is important because studies and findings based on just associations can be misleading. Discovering a smoking gene and not explaining the actual causal relationship or impact could harm people’s health, especially if they decided that they would surely develop cancer because they had the gene. Association studies ultimately can be misleading, misused, misunderstood, and dangerous, and that is part of why Pearl suggests a need to move beyond simple association studies. 

Patterns of Associated Ideas

Patterns of Associated Ideas

In Thinking Fast and Slow, Daniel Kahneman argues that our brains try to conserve energy by operating on what he calls System 1. The part of our brain that is intuitive, automatic, and makes quick assessments of the world is System 1. It doesn’t require intense focus, it quickly scans our environment, and it simply ignores stimuli that are not crucially important to our survival or the task at hand. System 1 is our low-power resting mode, saving energy so that when we need to, we can activate System 2 for more important mental tasks.

 

Without our conscious recognition, System 1 builds mental mental models of the world that shape the narrative that we use to understand everything that happens around us. It develops simple association and expectations for things like when we eat, what we expect people to look like, and how we expect the world to react when we move through it. Kahneman writes, “as these links are formed and strengthened, the pattern of associated ideas comes to represent the structure of events in your life, and determines your interpretations of the present as well as your expectations of the future.”

 

It isn’t uncommon for people different people to watch the same TV show, read the same news article, or witness the same event and walk away with completely different interpretations. We might not like a TV show that everyone else loves. We might reach a vastly different conclusion from reading a news article about global warming, and we might interpret the actions or words of another person completely differently. Part of why we don’t all see things the same, Kahneman might argue, is because we have all trained our System 1 in unique ways. We have different patterns of associated ideas that we use to fit information into a comprehensive narrative.

 

If you never have interactions with people who are different than you are, then you might be surprised when people don’t behave the way you expect. When you have a limited background and experience, then your System 1 will develop a pattern of associated ideas that might not generalize to situations that are new for you. How you see and understand the world is in some ways automatic, determined by the pattern of associated ideas that your System 1 has built over the years. It is unique to you, and won’t fit perfectly with the associated ideas that other people develop.

 

We don’t have control over System 1. If we active our System 2, we can start  to influence what factors stand out to System 1, but under normal circumstances, System 1 will move along building the world that fits its experiences and expectations. This works if we want to move through the world on auto-pilot with few new experiences, but if we want to be more engaged in the world and want to better understand the variety of humanity that exists in the world, our System 1 on its own will never be enough, and it will continually let us down.
Associative Thinking

Associative Thinking

I had a few linguistics classes in college and I remember really enjoying studies about associative thinking, or priming, where one word would trigger thoughts about another related thing. If you read stop sign, and someone then asked you to name a color, you are likely to say red. Our minds hover around a set of words associated with a topic, and words further away from the topic probably won’t register as quickly. If you read jellyfish then your mind is going to be set up for more ocean words like water, seaweed, or Nemo. Words like cactus, x-ray, or eviction, would take an extra second for your brain to register because they don’t seem to belong with jellyfish. If you watch Family Feud, then you will get to see great examples of linguistic priming and associative thinking in process. The first person in a family will say their answer, and it will be hard for the rest of the family to jump into a different category to get the final item on the list.

 

Associative thinking is even more interesting and complicated than just linguistic priming. Daniel Kahneman writes about it in his book Thinking Fast and Slow:

 

“An idea that has been activated does not merely evoke one other idea. It activates many ideas, which in turn activate others. Furthermore, only a few of the activated ideas will register in consciousness; most of the work of associative thinking is silent, hidden from our conscious selves. The notion that we have limited access to the workings of our minds is difficult to accept because, naturally, it is alien to our experience, but it is true: you know far less about yourself than you feel you do.”

 

Associative thinking reveals a lot about our minds that we don’t have access to. This is why implicit association tests (IAT) have been used to measure things like racial bias in individuals and societies. Your conscious mind might know it is wrong to think of people of color as criminals, but your unconscious mind might implicitly connect words like crime, drugs, or violence to certain racial groups. Even though you can consciously overcome these biases, your immediate reaction to other people might be enough to show them that you don’t trust them and might reveal implicit fears or negative biases. A clenching fist, a narrowing gaze, or an almost imperceptible backing away from someone might not be conscious, but might be enough for someone to register a sense of unease.

 

I don’t know enough about the benefits of racial bias training to say if it is effective in counteracting these implicit associations or immediate and unconscious reactions. I don’t know just how truly bad it is that we harbor such implicit associations, but I think it is important that we recognize they are there. It is important to know how the brain works, and important that we think about how much thinking takes place behind the scenes, without us recognizing it. Self-awareness and knowledge about associative thinking can help us understand just how we behave and interact with others, so that hopefully we can bring our best selves to the conversations and interactions we have with people who are different from us.
Thinking Statistically

Thinking Statistically

In Thinking Fast and Slow, Daniel Kahneman personifies two modes of thought as System 1 and System 2. System 1 is fast. It takes in information, processes it rapidly, and doesn’t always make us cognizant of the information we took in. It reacts to the world around us on an intuitive level, isn’t good at math, but is great at positioning us for catching a football.

 

System 2 is slow. Its is deliberate, calculating, and uses a lot of energy to maintain. Because it requires so much energy, we don’t actually active it very often, not unless we really need to. What is worse, System 2 can only operate on the information (unless we have a lot of time to pause specifically for information intake) that System 1 takes in, meaning, it processes incomplete information.

 

System 1 and System 2 are important to keep in mind when we start to to think statistically, something our minds are not good at. When we think back to the 2016 US Presidential election, we can see how hard statistical thinking is. Clinton was favored to win, but there was a statistical chance that Trump would win, as happened. The chance was small, but that didn’t mean the models were all wrong when he did win, it just means that the most likely event forecasted didn’t materialize. We had trouble thinking statistically about win percentages going into the election, and had trouble understanding an unlikely outcome after it happened.

 

“Why is it so difficult for us to think statistically?” Kahneman asks in his book, “We easily think associatively, wee think metaphorically, we think causally, but statistics requires thinking about many things at once, which is something that System 1 is not designed to do.”

 

System 1 operates quickly and cheaply. It takes less energy and effort to run on System 1, but because it is subject to bias and because it makes judgments on incomplete information, it is not reliable for important decisions and calculations based on nuance. We have to engage System 2 to be great at thinking statistically, but statistical thinking still trips up System 2 because it is hard to think about multiple competing outcomes at the same time and weight them appropriately. In Risk Savvy, Gerd Gigerenzer shows that statistical thinking can be substantially improved and that we really can think statistically, but that we need some help from visual aids and tools so that our minds can grasp statistical concepts better. We have to help System 1 so that it can set up System 2 for success if we want to be good at thinking statistically.

 

From the framework that Kahneman lays out, a quick reacting System 1 running on power save mode with limited informational processing power and System 2 operating on incomplete information aggregated by System 1, statistical thinking is nearly impossible. System 1 can’t bring in enough information for System 2 to analyze appropriately. As a result, we fall back on biases or maybe substitute an easier question over the challenging statistical question. Gigerenzer argues that we can think statistically, but that we need the appropriate framing and cues for System 1, so that System 2 can understand the number crunching and leg work that is needed. In the end, statistical thinking doesn’t happen quickly, and requires an ability to hold competing and conflicting information in the mind at the same time, making it hard for us to think statistically rather than anecdotally or metaphorically.