# Correlation does not prove causation, but lack of correlation doesn’t *disprove* causation either.

*To read other articles in this series, click here.*

*UPDATE: After some feedback, I’ve added a bunch of graphs to the number lists below to help clarify.*

What does it mean when someone says “correlation does not prove causation?” It’s a common phrase uttered by individuals who deny that climate change is happening, that it is dominated by the industrial emissions of greenhouse gases like carbon dioxide (CO_{2}), and that the changes will be disruptive to both ecosystems and human society (aka industrial climate disruption, global warming, or climate change). In order to understand why these deniers of industrial climate disruption are wrong, we first have to understand what they’re talking about when they’re talking about correlation, causation, and the relationships between the two.

## What is correlation?

Correlation is a measurement of how related two different sets of numbers are to each other. For example, let’s say we have two sets of numbers as follows:

0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10

0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20

We can tell by looking at the two lists that the second set of numbers is exactly two times (2x) the first set. In fact, they’re 100% correlated because each number in the second list is **exactly** 2x the first set. We can add one to each number, giving us a third set of numbers:

1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21

In this case, the second numbers are still 100% correlated because they’re **exactly** 2x+1 the first set of numbers.

But standard correlation assumes that the relationship between two sets of numbers is linear, of the form Y = mX + b (where ‘*m*‘ is the slope and ‘*b*‘ is the offset – 2 and 1 respectively in the example above). If we break that linear relationship, the correlation drops. In the case where we have a new set of numbers where each entry is equal to the first set times itself (x^{2}), the correlation drops from 100% (usually displayed by math programs as “1”) to 0.963.

0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100

If the linear relationship has a negative slope (in the case where we have negative numbers from 0 to -10, for example), then the relationship is -1 (-100%). And if there’s no linear relationship between two sets of numbers, correlation will be exactly 0. As a practical matter, only very large lists of random numbers have 0 correlation.

## What is causation?

“Causation” is just a fancy mathematical/scientific word for a cause-effect relationship between two things. In our first example above, the second list of numbers was exactly 2x the first list of numbers because I defined them that way. If I changed the first set to be something other than the whole numbers between 0 and 10 inclusive, the second set would change too. And because the third and fourth sets are also changing with the first, we can say that they’re all **caused** by the first set of numbers and the mathematical relationship I defined for each of them.

So where does the whole “correlation does not equal causation” thing come from?

Let’s say we have two lists of numbers where one list is twice that of the other, like so:

0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10

0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20

We know from above that they’ll be 100% correlated with each other. But let’s also say that the first one is days of the month while the second is the number of people sitting in a waiting area. Clearly we can’t simply multiply the days of the week by 2 and magically convert a day into a person (or 10 days into 20 people). So in this example the two sets of numbers are 100% correlated, but that’s just a coincidence of the two sets of numbers.

This is one of the more complicated problems in science, and especially climate science. There’s a high degree of correlation between rising CO_{2}levels and the rising global temperatures, but that might just be a coincidence of the numbers. Thankfully, there’s a bunch of scientists who have taken it upon themselves to figure out exactly how to determine if the relationship between CO

_{2}and temperature is merely coincidence or whether one is caused by the other. The published results of this research are called “attribution studies,” and the increase in global temperature was attributed to CO

_{2}by 2000 and has been checked and rechecked many times since. At this point the attribution studies have become so strong that there’s essentially no question that CO

_{2}is the main driver of global warming.

## What about periods where there isn’t much correlation between two sets of numbers?

There are some people who reject the attribution studies, however, even though it pretty much takes breaking the laws of physics for CO_{2}

*not*to be the main driver of global warming. One of the justifications offered to reject CO

_{2}as the main cause is by claiming (incorrectly) that there wasn’t any correlation between CO

_{2}and global temperature rise over the last 15-20 years. If they were right, wouldn’t the lack of correlation mean that there can’t be causation?

Not at all. In fact, there are cases where causation is well known, but there is very little obvious correlation. To see how this would work, let’s look at those lists of numbers again. We’ll start from the original list of numbers:

0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10

But to the second list, let’s add a random number that’s either -1, 0, or 1:

-1, 3 , 4, 6, 8, 11, 12, 14, 15, 19, 21

The additional noise created by the random numbers reduces the correlation from 1 to 0.994. The correlation isn’t perfect anymore, but it’s so close that we can still easily see that the two numbers are still correlated and, since we know that I defined the mathematical relationship myself, we know that there’s still causation. But what if we make it so that the random number we add to the second list is between -10 and +10? Now the list becomes this:

8, 9, -2, 14, 16, 0, 5, 24, 15, 12, 29

In this case, the correlation fell from 1 to 0.57. That’s more correlated than random numbers, but it’s clearly not very good. So even though we **know** that there’s a causal relationship between the first and third lists of numbers, we can’t see it in the math. Or rather, we can’t see it *yet*.

If we added more data, however, the correlation slowly goes back up. Add 10 more numbers and the correlation increase to about 0.9. Add 20 numbers and the correlation increases to 0.93 or so. Add 30 numbers and the correlation increases to 0.97 or so. And so on.

The reason for this is that eventually, the numbers are so large that the amount of error compared to the amount of signal goes down. At the number 10, adding an error of +/-10 whole numbers is a whopping 100% error. But at the number 100, +/-10 whole numbers of error is only a 10% error. Adding more data will always reveal a cause-effect relationship eventually, but it may take a huge list of numbers to accurately see that the second set of data was caused by the first. In the real world, huge lists of numbers can take lots of time to gather – decades or even centuries in the case of climate science.

One alternative is to figure out how to turn big errors into small errors. In the case above, that would be like figuring out how to turn +/-10 whole numbers of error into +/-1 whole numbers of error. When we can reduce the error somehow, we can detect the real cause-effect relationship much more quickly. There are several ways to do that in real systems, but those methods are beyond the scope of this particular article.

From the examples above we can see that short duration changes can hide a known cause-effect relationship and make it *appear* that the effect is no longer brought about by the cause.

_{2}that influence the Earth’s average temperature. Some of those things include whether we’re in an El Nino or La Nina, whether there was a major volcanic eruption recently, what part of the 11 year-long solar cycle we’re in, the time of year, and how much pollution is emitted by burning coal, oil, and biomass (wood, dung, etc). When scientists look at El Nino/La Nina, for example, they find that there is a very high correlation between those episodes and short duration (on the order of a few years) changes in the Earth’s temperature. The Earth heats up during an El Nino because the tropical Pacific ocean is releasing stored energy back into the atmosphere, and the Earth cools down during a La Ninia because the tropical Pacific is absorbing energy from the atmosphere. The same is true of how the Earth’s temperature cools after a volcanic eruption (due to the release of large amounts of sulfur dioxide into the upper atmosphere). But over the course of decades, the correlation breaks down because something else is driving a longer-term warming trend.

And attribution studies have shown that the “something else” is CO_{2}.

So the next time you come across someone claiming that “correlation does not prove causation,” remember to look at the data that they’re using to make that claim. If there’s not a *lot* of data, or if they’re using a small subset of the available data (say the last 15-20 years of global temperatures instead of the complete record of 136 years), be skeptical. Because lack of correlation doesn’t *disprove* causation either.

**NOTE:** All the numbers I use above can be verified by simply putting the lists of numbers I used into an Excel spreadsheet and using the “correl” function. If you want to check how things work with error added too, I used the “randbetween” function for that. Note that the random number generator means that you probably won’t ever get the exact list of “random” numbers I had above, but you’ll still be able to see how the correlations go up as you add data.

Categories: Education, Science/Technology