Hypothesis Testing: Correlations

Hypothesis Tests with the Pearson Correlation

We test the correlation coefficient to determine whether the linear relationship in the sample data effectively models the relationship in the population.

Learning Objectives

Use a hypothesis test in order to determine the significance of Pearson’s correlation coefficient.

Key Takeaways

Key Points

  • Pearson’s correlation coefficient, [latex]\text{r}[/latex], tells us about the strength of the linear relationship between [latex]\text{x}[/latex] and [latex]\text{y}[/latex] points on a regression plot.
  • The hypothesis test lets us decide whether the value of the population correlation coefficient [latex]\rho[/latex] is “close to 0” or “significantly different from 0” based on the sample correlation coefficient [latex]\text{r}[/latex] and the sample size [latex]\text{n}[/latex].
  • If the test concludes that the correlation coefficient is significantly different from 0, we say that the correlation coefficient is “significant”.
  • If the test concludes that the correlation coefficient is not significantly different from 0 (it is close to 0), we say that correlation coefficient is “not significant”.

Key Terms

  • Pearson’s correlation coefficient: a measure of the linear correlation (dependence) between two variables [latex]\text{X}[/latex] and [latex]\text{Y}[/latex], giving a value between [latex]+1[/latex] and [latex]-1[/latex] inclusive, where 1 is total positive correlation, 0 is no correlation, and [latex]-1[/latex] is negative correlation

Testing the Significance of the Correlation Coefficient

Pearson’s correlation coefficient, [latex]\text{r}[/latex], tells us about the strength of the linear relationship between [latex]\text{x}[/latex] and [latex]\text{y}[/latex] points on a regression plot. However, the reliability of the linear model also depends on how many observed data points are in the sample. We need to look at both the value of the correlation coefficient [latex]\text{r}[/latex] and the sample size [latex]\text{n}[/latex], together. We perform a hypothesis test of the “significance of the correlation coefficient” to decide whether the linear relationship in the sample data is strong enough to use to model the relationship in the population.

The hypothesis test lets us decide whether the value of the population correlation coefficient [latex]\rho[/latex] is “close to 0” or “significantly different from 0”. We decide this based on the sample correlation coefficient [latex]\text{r}[/latex] and the sample size [latex]\text{n}[/latex].

If the test concludes that the correlation coefficient is significantly different from 0, we say that the correlation coefficient is “significant.”

Conclusion: “There is sufficient evidence to conclude that there is a significant linear relationship between [latex]\text{x}[/latex] and [latex]\text{y}[/latex] because the correlation coefficient is significantly different from 0.”

What the conclusion means: There is a significant linear relationship between [latex]\text{x}[/latex] and [latex]\text{y}[/latex]. We can use the regression line to model the linear relationship between [latex]\text{x}[/latex] and [latex]\text{y}[/latex] in the population.

If the test concludes that the correlation coefficient is not significantly different from 0 (it is close to 0), we say that correlation coefficient is “not significant. ”

Conclusion: “There is insufficient evidence to conclude that there is a significant linear relationship between [latex]\text{x}[/latex] and [latex]\text{y}[/latex] because the correlation coefficient is not significantly different from 0. ”

What the conclusion means: There is not a significant linear relationship between [latex]\text{x}[/latex] and [latex]\text{y}[/latex]. Therefore we can NOT use the regression line to model a linear relationship between [latex]\text{x}[/latex] and [latex]\text{y}[/latex] in the population.

Performing the Hypothesis Test

Our null hypothesis will be that the correlation coefficient IS NOT significantly different from 0. There IS NOT a significant linear relationship (correlation) between [latex]\text{x}[/latex] and [latex]\text{y}[/latex] in the population. Our alternative hypothesis will be that the population correlation coefficient IS significantly different from 0. There IS a significant linear relationship (correlation) between [latex]\text{x}[/latex] and [latex]\text{y}[/latex] in the population.

Using a Table of Critical Values to Make a Decision

The 95% critical values of the sample correlation coefficient table shown in gives us a good idea of whether the computed value of [latex]\text{r}[/latex] is significant or not. Compare [latex]\text{r}[/latex] to the appropriate critical value in the table. If [latex]\text{r}[/latex] is not between the positive and negative critical values, then the correlation coefficient is significant. If [latex]\text{r}[/latex] is significant, then we can use the line for prediction.

image

95% Critical Values of the Sample Correlation Coefficient Table: This table gives us a good idea of whether the computed value of r is significant or not.

As an example, suppose you computed [latex]\text{r}=0.801[/latex] using [latex]\text{n}=10[/latex] data points. [latex]\text{df} = \text{n}-2 =10-2 = 8[/latex]. The critical values associated with [latex]\text{df}=8[/latex] are [latex]\pm 0.632[/latex]. If [latex]\text{r}[/latex] is less than the negative critical value or [latex]\text{r}[/latex] is greater than the positive critical value, then [latex]\text{r}[/latex] is significant. Since [latex]\text{r}=0.801[/latex] and [latex]0.801 > 0.632[/latex], [latex]\text{r}[/latex] is significant and the line may be used for prediction.

Assumptions in Testing the Significance of the Correlation Coefficient

Testing the significance of the correlation coefficient requires that certain assumptions about the data are satisfied. The premise of this test is that the data are a sample of observed points taken from a larger population. We have not examined the entire population because it is not possible or feasible to do so. We are examining the sample to draw a conclusion about whether the linear relationship that we see between [latex]\text{x}[/latex] and [latex]\text{y}[/latex] in the sample data provides strong enough evidence so that we can conclude that there is a linear relationship between [latex]\text{x}[/latex] and [latex]\text{y}[/latex] in the population.

The assumptions underlying the test of significance are:

  • There is a linear relationship in the population that models the average value of [latex]\text{y}[/latex] for varying values of [latex]\text{x}[/latex]. In other words, the expected value of [latex]\text{y}[/latex] for each particular value lies on a straight line in the population. (We do not know the equation for the line for the population. Our regression line from the sample is our best estimate of this line in the population. )
  • The [latex]\text{y}[/latex] values for any particular [latex]\text{x}[/latex] value are normally distributed about the line. This implies that there are more [latex]\text{y}[/latex] values scattered closer to the line than are scattered farther away. Assumption one above implies that these normal distributions are centered on the line: the means of these normal distributions of [latex]\text{y}[/latex] values lie on the line.
  • The standard deviations of the population [latex]\text{y}[/latex] values about the line are equal for each value of [latex]\text{x}[/latex]. In other words, each of these normal distributions of [latex]\text{y}[/latex] values has the same shape and spread about the line.
  • The residual errors are mutually independent (no pattern).