Correlation and Causation


A well-designed graph organizes its data in a way that allows the reader to see the main conclusion the graph maker has drawn from their data set. A particularly clever graph might present enough information for the reader to draw two, three, or even more, conclusions from the one graph.

When designing a graph it is important to keep in mind what conclusions you want the reader to draw from the graph—and then to design the graph in such a way as to draw the reader’s attention to the data that lead to that conclusion. When reading someone else’s graph it is important to look at the graph with a mind to determining what conclusions are being presented to you. For instance, if any control data sets are plotted, what alternative explanations do they rule out? What conclusions are left to explain the experimental group data?

One way to highlight a particular conclusion you want your reader to draw from your data plot is to draw lines through the data to illustrate a correlation between your data variables.

A correlation is a measure of how strongly one variable is related to another. If, for instance, weight were perfectly correlated with height (it isn’t), then if one person were twice as tall as another, they would also be exactly twice as heavy. And if one person were 50% as tall as another, that person would also be weigh exactly 50% as much.

One reason to determine if certain variables are correlated is to investigate whether one causes the other. For instance, if smoking causes lung cancer, than we expect to find that incidences of lung cancer should correlate well with smoking rates. Groups with more smokers should have higher rates of lung cancer. Groups with fewer smokers should have lower rates of lung cancer, etc. If you are trying to prove one thing “causes” another (known as having a “causal relationship”) then one line of evidence would be that those two things have a strong correlation with one another.

One important thing to remember about correlated data is that correlation does not prove causation. That is to say, if you have causation you will definitely have correlation, but just having correlation does not always mean that there is a causal relationship.

Sometimes two things are correlated because one causes another. For instance, we are now certain that lung cancer rates are correlated with smoking rates because smoking does cause cancer.

But sometimes two things are correlated because they are both influenced by a third variable that you might or might not be aware of. For instance, there might be a correlation between ice cream sales and physical assaults. When ice cream sales are higher, assault rates are higher, and when ice cream sales are lower, assault rates are lower. However, it is rather unlikely that eating ice cream causes people to get into fights. More likely there is a third variable we are missing. Perhaps it is heat. On hotter days more people buy ice cream, and also on hotter days more people are short-tempered and get into more fights.

The correlation between ice cream sales and assaults is evidence that ice cream might cause fights, but it isn’t definitive proof and more evidence of that hypothesis would be needed to convince anyone. (In the case of the connection between lung cancer and smoking, the evidence started out as correlations, but eventually came to include many other types of evidence, which is why we now accept the causal relationship.)

When you are plotting data, showing that two variables correlate well is interesting, and can be used as one piece of evidence of perhaps a causal relationship, but the correlation by itself will never be enough. Often a correlation is the first step in establishing the causal relationship.


Lab 2 Exercises 2.5

  1. Plot the data below in the grid of squares provided. You will have two data sets, one for men and one for women. Plot them both on the same axes.
  2. Keep in mind the rules established in earlier exercises for deciding on whether it should be a scatter plot or bar graph, for determining which variable should be the X variable and which should be the Y variable, and for the features of a well-designed graph. Hint: Country is not one of your two variables!
  3. Design your graph so that you emphasize to the reader that you have concluded that for women smoking is correlated to increase incidences of lung cancer regardless of which country they come from, but that for men of different countries the relationship between smoking and incidences of lung cancer is more complex.
  4. Here is the data. The grid where you will plot the data is below.
Country Percentage of the population who smokes in averages Deaths from lung cancer / Deaths per 100,000
Males Females Males Females
China 53.4 4.0 22.7 10.5
France 33.0 21.0 73.3 14.4
Malaysia 49.2 3.5 5.6 2.3
New Zealand 25.1 24.8 47.3 29.2
South Africa 43.8 11.7 13.8 5.4
Trinidad and Tobago 42.1 8.0 12.3 4.2