The Correlation Coefficient r

Learning Outcomes

  • Describe the strength and direction of a linear relationship from a correlation coefficient

Recall: Summation

The symbol Σ (Sigma) means to “add up” or sum everything that follows. For example, Σ(x) means to add all of the variables, x.

The Correlation Coefficient r

Besides looking at the scatter plot and seeing that a line seems reasonable, how can you tell if the line is a good predictor? Use the correlation coefficient as another indicator (besides the scatterplot) of the strength of the relationship between x and y.

The correlation coefficient, r, developed by Karl Pearson in the early 1900s, is numerical and provides a measure of strength and direction of the linear association between the independent variable x and the dependent variable y.

The correlation coefficient is calculated as

[latex]{r}=\dfrac{{ {n}\sum{({x}{y})}-{(\sum{x})}{(\sum{y})} }} {{ \sqrt{\left[{n}\sum{x}^{2}-(\sum{x})^2\right]\left[{n}\sum{y}^{2}-(\sum{y})^2\right]}}}[/latex]

where n = the number of data points.

Recall: ORDER OF OPERATIONS

Please Excuse My Dear Aunt Sally
parentheses exponents multiplication division addition subtraction
[latex]( \ )[/latex] [latex]x^2[/latex] [latex]\times \ \mathrm{or} \ \div[/latex] [latex]+ \ \mathrm{or} \ -[/latex]

1st find the numerator, calculate [latex]n \sum (xy)[/latex] and [latex](\sum x)(\sum y)[/latex], then subtract them.

Step 1: To calculate [latex]n \sum (xy)[/latex] work inside the parentheses first by multiplying each data point, the [latex]x[/latex] multiplied by the [latex]y[/latex], this is called the product. Then add the product of each data point. Then, multiply by [latex]n[/latex], the number of data points.

Step 2: To calculate [latex](\sum x)(\sum y)[/latex] sum all of the independent [latex](x)[/latex] variables then sum all of the dependent variables and multiply these two sums together

Step 3: Subtract, Step 1 – Step 2.

2nd find the denominator, you will end up taking the square root of the entire bottom, a square root can be understood as a parenthesis.

Step 4: To calculate [latex][ n \sum (x^2)- (\sum x)^2][/latex], you can use some numbers found in the first step, you have already calculated [latex](\sum x)[/latex] so square the second number and subtract it from the first number, [latex]n \sum (x^2)[/latex], which is the sum of every independent variable squared and then multiplied by [latex]n[/latex], the number of data points.

Step 5: Then calculate [latex][ n \sum (y^2)-(\sum y)^2][/latex] , repeat the same process in step 4, but with the dependent variables instead.

Step 6: Multiply the value you got in Step 4 and Step 5.

Step 7: Find the square root of the value you found in step 6.

3rd take the numerator and divide by the denominator.

If you suspect a linear relationship between x and y, then r can measure how strong the linear relationship is.

What the VALUE of r tells us:

  • The value of r is always between –1 and +1: –1 ≤ r ≤ 1.
  • The size of the correlation indicates the strength of the linear relationship between x and y. Values of r close to –1 or to +1 indicate a stronger linear relationship between x and y.
  • If r = 0 there is absolutely no linear relationship between x and y (no linear correlation).
  • If r = 1, there is perfect positive correlation. If r = –1, there is perfect negative correlation. In both these cases, all of the original data points lie in a straight line. Of course, in the real world, this will not generally happen.

What the SIGN of r tells us:

  • A positive value of r means that when x increases, y tends to increase, and when x decreases, y tends to decrease (positive correlation).
  • A negative value of r means that when x increases, y tends to decrease, and when x decreases, y tends to increase (negative correlation).
  • The sign of r is the same as the sign of the slope, b, of the best-fit line.

Note

A strong correlation does not suggest that x causes or y causes x. We say “correlation does not imply causation.”

Three scatter plots with lines of best fit. The first scatterplot shows points ascending from the lower left to the upper right. The line of best fit has positive slope. The second scatter plot shows points descending from the upper left to the lower right. The line of best fit has negative slope. The third scatter plot of points form a horizontal pattern. The line of best fit is a horizontal line.

(a) A scatter plot showing data with a positive correlation. 0 < r < 1

(b) A scatter plot showing data with a negative correlation. –1 < r < 0

(c) A scatter plot showing data with zero correlation. r = 0

The formula for r looks formidable. However, computer spreadsheets, statistical software, and many calculators can quickly calculate r. The correlation coefficient is the bottom item in the output screens for the LinRegTTest on the TI-83, TI-83+, or TI-84+ calculator (see the previous section for instructions).