Correlation

An Intuitive Approach to Relationships

Correlation refers to any of a broad class of statistical relationships involving dependence.

Learning Objectives

Recognize the fundamental meanings of correlation and dependence.

Key Takeaways

Key Points

  • Dependence refers to any statistical relationship between two random variables or two sets of data.
  • Correlations are useful because they can indicate a predictive relationship that can be exploited in practice.
  • Formally, dependence refers to any situation in which random variables do not satisfy a mathematical condition of probabilistic independence.
  • In loose usage, correlation can refer to any departure of two or more random variables from independence, but technically it refers to any of several more specialized types of relationship between mean values.

Key Terms

  • correlation: One of the several measures of the linear statistical relationship between two random variables, indicating both the strength and direction of the relationship.

Researchers often want to know how two or more variables are related. For example, is there a relationship between the grade on the second math exam a student takes and the grade on the final exam? If there is a relationship, what is it and how strong is it? As another example, your income may be determined by your education and your profession. The amount you pay a repair person for labor is often determined by an initial amount plus an hourly fee. These are all examples of a statistical factor known as correlation. Note that the type of data described in these examples is bivariate (“bi” for two variables). In reality, statisticians use multivariate data, meaning many variables. As in our previous example, your income may be determined by your education, profession, years of experience or ability.

Correlation and Dependence

Dependence refers to any statistical relationship between two random variables or two sets of data. Correlation refers to any of a broad class of statistical relationships involving dependence. Familiar examples of dependent phenomena include the correlation between the physical statures of parents and their offspring and the correlation between the demand for a product and its price. Correlations are useful because they can indicate a predictive relationship that can be exploited in practice.

For example, an electrical utility may produce less power on a mild day based on the correlation between electricity demand and weather. In this example, there is a causal relationship, because extreme weather causes people to use more electricity for heating or cooling; however, statistical dependence is not sufficient to demonstrate the presence of such a causal relationship (i.e., correlation does not imply causation).

Formally, dependence refers to any situation in which random variables do not satisfy a mathematical condition of probabilistic independence. In loose usage, correlation can refer to any departure of two or more random variables from independence, but technically it refers to any of several more specialized types of relationship between mean values.

image

Correlation: This graph shows a positive correlation between world population and total carbon emissions.

Scatter Diagram

A scatter diagram is a type of mathematical diagram using Cartesian coordinates to display values for two variables in a set of data.

Learning Objectives

Demonstrate the role that scatter diagrams play in revealing correlation.

Key Takeaways

Key Points

  • The controlled parameter, or independent variable, is customarily plotted along the horizontal axis, while the measured or dependent variable is customarily plotted along the vertical axis.
  • If no dependent variable exists, either type of variable can be plotted on either axis, and a scatter plot will illustrate only the degree of correlation between two variables.
  • A scatter plot shows the direction and strength of a relationship between the variables.
  • You can determine the strength of the relationship by looking at the scatter plot and seeing how close the points are to a line.
  • When you look at a scatterplot, you want to notice the overall pattern and any deviations from the pattern.

Key Terms

  • trend line: A line on a graph, drawn through points that vary widely, that shows the general trend of a real-world function (often generated using linear regression).
  • Cartesian coordinate: The coordinates of a point measured from an origin along a horizontal axis from left to right (the [latex]\text{x}[/latex]-axis) and along a vertical axis from bottom to top (the [latex]\text{y}[/latex]-axis).

A scatter plot, or diagram, is a type of mathematical diagram using Cartesian coordinates to display values for two variables in a set of data. The data is displayed as a collection of points, each having the value of one variable determining the position on the horizontal axis, and the value of the other variable determining the position on the vertical axis.

In the case of an experiment, a scatter plot is used when a variable exists that is below the control of the experimenter. The controlled parameter (or independent variable) is customarily plotted along the horizontal axis, while the measured (or dependent variable) is customarily plotted along the vertical axis. If no dependent variable exists, either type of variable can be plotted on either axis, and a scatter plot will illustrate only the degree of correlation (not causation) between two variables. This is the context in which we view scatter diagrams.

Relevance to Correlation

A scatter plot shows the direction and strength of a relationship between the variables. A clear direction happens given one of the following:

  • High values of one variable occurring with high values of the other variable or low values of one variable occurring with low values of the other variable.
  • High values of one variable occurring with low values of the other variable.

You can determine the strength of the relationship by looking at the scatter plot and seeing how close the points are to a line, a power function, an exponential function, or to some other type of function. When you look at a scatterplot, you want to notice the overall pattern and any deviations from the pattern. The following scatterplot examples illustrate these concepts.

image

Scatter Plot Patterns: An illustration of the various patterns that scatter plots can visualize.

Trend Lines

To study the correlation between the variables, one can draw a line of best fit (known as a “trend line”). An equation for the correlation between the variables can be determined by established best-fit procedures. For a linear correlation, the best-fit procedure is known as linear regression and is guaranteed to generate a correct solution in a finite time. No universal best-fit procedure is guaranteed to generate a correct solution for arbitrary relationships.

Other Uses of Scatter Plots

A scatter plot is also useful to show how two comparable data sets agree with each other. In this case, an identity line (i.e., a [latex]\text{y}=\text{x}[/latex] line or [latex]1:1[/latex] line) is often drawn as a reference. The more the two data sets agree, the more the scatters tend to concentrate in the vicinity of the identity line. If the two data sets are numerically identical, the scatters fall on the identity line exactly.

One of the most powerful aspects of a scatter plot, however, is its ability to show nonlinear relationships between variables. Furthermore, if the data is represented by a mixed model of simple relationships, these relationships will be visually evident as superimposed patterns.

Coefficient of Correlation

The correlation coefficient is a measure of the linear dependence between two variables [latex]\text{X}[/latex] and [latex]\text{Y}[/latex], giving a value between [latex]+1[/latex] and [latex]-1[/latex].

Learning Objectives

Compute Pearson’s product-moment correlation coefficient.

Key Takeaways

Key Points

  • The correlation coefficient was developed by Karl Pearson from a related idea introduced by Francis Galton in the 1880s.
  • Pearson’s correlation coefficient between two variables is defined as the covariance of the two variables divided by the product of their standard deviations.
  • Pearson’s correlation coefficient when applied to a sample is commonly represented by the letter [latex]\text{r}[/latex].
  • The size of the correlation [latex]\text{r}[/latex] indicates the strength of the linear relationship between [latex]\text{x}[/latex] and [latex]\text{y}[/latex].
  • Values of [latex]\text{r}[/latex] close to [latex]-1[/latex] or to [latex]+1[/latex] indicate a stronger linear relationship between [latex]\text{x}[/latex] and [latex]\text{y}[/latex].

Key Terms

  • covariance: A measure of how much two random variables change together.
  • correlation: One of the several measures of the linear statistical relationship between two random variables, indicating both the strength and direction of the relationship.

The most common coefficient of correlation is known as the Pearson product-moment correlation coefficient, or Pearson’s [latex]\text{r}[/latex]. It is a measure of the linear correlation (dependence) between two variables [latex]\text{X}[/latex] and [latex]\text{Y}[/latex], giving a value between [latex]+1[/latex] and [latex]-1[/latex]. It is widely used in the sciences as a measure of the strength of linear dependence between two variables. It was developed by Karl Pearson from a related idea introduced by Francis Galton in the 1880s.

Pearson’s correlation coefficient between two variables is defined as the covariance of the two variables divided by the product of their standard deviations. The form of the definition involves a “product moment”, that is, the mean (the first moment about the origin) of the product of the mean-adjusted random variables; hence the modifier product-moment in the name.

Pearson’s correlation coefficient when applied to a population is commonly represented by the Greek letter [latex]\rho[/latex] (rho) and may be referred to as the population correlation coefficient or the population Pearson correlation coefficient.

Pearson’s correlation coefficient when applied to a sample is commonly represented by the letter [latex]\text{r}[/latex] and may be referred to as the sample correlation coefficient or the sample Pearson correlation coefficient. The formula for [latex]\text{r}[/latex] is as follows:

[latex]\displaystyle \text{r} = \frac{\displaystyle{\frac{\sum \text{xy}}{\text{n}}} - \bar{\text{x}}\bar{\text{y}}}{\text{s}_\text{x} \text{s}_\text{y}} \left(\frac{\text{n}}{\text{n}-1}\right)[/latex]

An equivalent expression gives the correlation coefficient as the mean of the products of the standard scores. Based on a sample of paired data [latex](\text{X}_\text{i}, \text{Y}_\text{i})[/latex], the sample Pearson correlation coefficient is shown in:

[latex]\displaystyle \text{r} = \frac{1}{\text{n}-1} \sum_{\text{i}=1}^\text{n} \left(\frac{\text{X}_\text{i}-\bar{\text{X}}}{\text{s}_\text{X}} \right)\left(\frac{\text{Y}_\text{i}-\bar{\text{Y}}}{\text{s}_\text{Y}} \right)[/latex]

Mathematical Properties

  • The value of [latex]\text{r}[/latex] is always between [latex]-1[/latex] and [latex]+1[/latex]: [latex]-1\leq \text{r} \leq 1[/latex].
  • The size of the correlation [latex]\text{r}[/latex] indicates the strength of the linear relationship between [latex]\text{x}[/latex] and [latex]\text{y}[/latex]. Values of [latex]\text{r}[/latex] close to [latex]-1[/latex] or [latex]+1[/latex] indicate a stronger linear relationship between [latex]\text{x}[/latex] and [latex]\text{y}[/latex].
  • If [latex]\text{r}=0[/latex] there is absolutely no linear relationship between [latex]\text{x}[/latex] and [latex]\text{y}[/latex] (no linear correlation).
  • A positive value of [latex]\text{r}[/latex] means that when [latex]\text{x}[/latex] increases, [latex]\text{y}[/latex] tends to increase and when [latex]\text{x}[/latex] decreases, [latex]\text{y}[/latex] tends to decrease (positive correlation).
  • A negative value of [latex]\text{r}[/latex] means that when [latex]\text{x}[/latex] increases, [latex]\text{y}[/latex] tends to decrease and when [latex]\text{x}[/latex] decreases, [latex]\text{y}[/latex] tends to increase (negative correlation).
  • If [latex]\text{r}=1[/latex], there is perfect positive correlation. If [latex]\text{r}=-1[/latex], there is perfect negative correlation. In both these cases, all of the original data points lie on a straight line. Of course, in the real world, this will not generally happen.
  • The Pearson correlation coefficient is symmetric.

Another key mathematical property of the Pearson correlation coefficient is that it is invariant to separate changes in location and scale in the two variables. That is, we may transform [latex]\text{X}[/latex] to [latex]\text{a}+\text{bX}[/latex] and transform [latex]\text{Y}[/latex] to [latex]\text{c}+\text{dY}[/latex], where [latex]\text{a}[/latex], [latex]\text{b}[/latex], [latex]\text{c}[/latex], and [latex]\text{d}[/latex] are constants, without changing the correlation coefficient. This fact holds for both the population and sample Pearson correlation coefficients.

Example

Consider the following example data set of scores on a third exam and scores on a final exam:

image

Example: This table shows an example data set of scores on a third exam and scores on a final exam.

To find the correlation of this data we need the summary statistics; means, standard deviations, sample size, and the sum of the product of [latex]\text{x}[/latex] and [latex]\text{y}[/latex].

To find ([latex]\text{xy}[/latex]), multiply the [latex]\text{x}[/latex] and [latex]\text{y}[/latex] in each ordered pair together then sum these products. For this problem, [latex]\sum \text{xy} = 125,500[/latex]. To find the correlation coefficient we need the mean of [latex]\text{x}[/latex], the mean of [latex]\text{y}[/latex], the standard deviation of [latex]\text{x}[/latex] and the standard deviation of [latex]\text{y}[/latex].

[latex]\text{x} = 69.1818 \\ \text{y} = 160.4545 \\ \text{s}_\text{x} = 2.85721 \\ \text{s}_\text{y} = 20.8008 \\ \sum \text{xy} = 122,500[/latex]

Put the summary statistics into the correlation coefficient formula and solve for [latex]\text{r}[/latex], the correlation coefficient.

[latex]\displaystyle \text{r}=\frac { \frac { 122,500 }{ 11 } -\left( 69.1818 \right) \left( 160.4545 \right) }{ \left( 2.85721 \right) \left( 20.8008 \right) } \left( \frac { 11 }{ 11-1 } \right) =0.06632[/latex]

Coefficient of Determination

The coefficient of determination provides a measure of how well observed outcomes are replicated by a model.

Learning Objectives

Interpret the properties of the coefficient of determination in regard to correlation.

Key Takeaways

Key Points

  • The coefficient of determination, [latex]\text{r}^2[/latex], is a statistic whose main purpose is either the prediction of future outcomes or the testing of hypotheses on the basis of other related information.
  • The most general definition of the coefficient of determination is illustrated in, where [latex]\text{SS}_\text{err}[/latex] is the residual sum of squares and [latex]\text{SS}_\text{tot}[/latex] is the total sum of squares.
  • [latex]\text{r}^2[/latex], when expressed as a percent, represents the percent of variation in the dependent variable y that can be explained by variation in the independent variable [latex]\text{x}[/latex] using the regression (best fit) line.
  • [latex]1-\text{r}^2[/latex] when expressed as a percent, represents the percent of variation in [latex]\text{y}[/latex] that is NOT explained by variation in [latex]\text{x}[/latex] using the regression line. This can be seen as the scattering of the observed data points about the regression line.

Key Terms

  • correlation coefficient: Any of the several measures indicating the strength and direction of a linear relationship between two random variables.
  • regression: An analytic method to measure the association of one or more independent variables with a dependent variable.

The coefficient of determination (denoted [latex]\text{r}^2[/latex]) is a statistic used in the context of statistical models. Its main purpose is either the prediction of future outcomes or the testing of hypotheses on the basis of other related information. It provides a measure of how well observed outcomes are replicated by the model, as the proportion of total variation of outcomes explained by the model. Values for [latex]\text{r}^2[/latex] can be calculated for any type of predictive model, which need not have a statistical basis.

The Math

A data set will have observed values and modelled values, sometimes known as predicted values. The “variability” of the data set is measured through different sums of squares, such as:

  • the total sum of squares (proportional to the sample variance);
  • the regression sum of squares (also called the explained sum of squares); and
  • the sum of squares of residuals, also called the residual sum of squares.

The most general definition of the coefficient of determination is:

[latex]\displaystyle \text{r}^2 = 1-\frac{\text{SS}_\text{err}}{\text{SS}_\text{tot}}[/latex]

where [latex]\text{SS}_\text{err}[/latex] is the residual sum of squares and [latex]\text{SS}_\text{tot}[/latex] is the total sum of squares.

Properties and Interpretation of [latex]\text{r}^2[/latex]

The coefficient of determination is actually the square of the correlation coefficient. It is is usually stated as a percent, rather than in decimal form. In context of data, [latex]\text{r}^2[/latex] can be interpreted as follows:

  • [latex]\text{r}^2[/latex], when expressed as a percent, represents the percent of variation in the dependent variable [latex]\text{y}[/latex] that can be explained by variation in the independent variable [latex]\text{x}[/latex] using the regression (best fit) line.
  • [latex]1-\text{r}^2[/latex] when expressed as a percent, represents the percent of variation in [latex]\text{y}[/latex] that is NOT explained by variation in [latex]\text{x}[/latex] using the regression line. This can be seen as the scattering of the observed data points about the regression line.

So [latex]\text{r}^2[/latex] is a statistic that will give some information about the goodness of fit of a model. In regression, the [latex]\text{r}^2[/latex] coefficient of determination is a statistical measure of how well the regression line approximates the real data points. An [latex]\text{r}^2[/latex] of 1 indicates that the regression line perfectly fits the data.

In many (but not all) instances where [latex]\text{r}^2[/latex] is used, the predictors are calculated by ordinary least-squares regression: that is, by minimizing [latex]\text{SS}_\text{err}[/latex]. In this case, [latex]\text{r}^2[/latex] increases as we increase the number of variables in the model. This illustrates a drawback to one possible use of [latex]\text{r}^2[/latex], where one might keep adding variables to increase the [latex]\text{r}^2[/latex] value. For example, if one is trying to predict the sales of a car model from the car’s gas mileage, price, and engine power, one can include such irrelevant factors as the first letter of the model’s name or the height of the lead engineer designing the car because the [latex]\text{r}^2[/latex] will never decrease as variables are added and will probably experience an increase due to chance alone. This leads to the alternative approach of looking at the adjusted [latex]\text{r}^2[/latex]. The explanation of this statistic is almost the same as [latex]\text{r}^2[/latex] but it penalizes the statistic as extra variables are included in the model.

Note that [latex]\text{r}^2[/latex] does not indicate whether:

  • the independent variables are a cause of the changes in the dependent variable;
  • omitted-variable bias exists;
  • the correct regression was used;
  • the most appropriate set of independent variables has been chosen;
  • there is collinearity present in the data on the explanatory variables; or
  • the model might be improved by using transformed versions of the existing set of independent variables.

Example

Consider the third exam/final exam example introduced in the previous section. The correlation coefficient is [latex]\text{r}=0.6631[/latex]. Therefore, the coefficient of determination is [latex]\text{r}^2 = 0.6631^2 = 0.4397[/latex].

The interpretation of [latex]\text{r}^2[/latex] in the context of this example is as follows. Approximately 44% of the variation (0.4397 is approximately 0.44) in the final exam grades can be explained by the variation in the grades on the third exam. Therefore approximately 56% of the variation ([latex]1-0.44=0.56[/latex]) in the final exam grades can NOT be explained by the variation in the grades on the third exam.

Line of Best Fit

The trend line (line of best fit) is a line that can be drawn on a scatter diagram representing a trend in the data.

Learning Objectives

Illustrate the method of drawing a trend line and what it represents.

Key Takeaways

Key Points

  • A trend line could simply be drawn by eye through a set of data points, but more properly its position and slope are calculated using statistical techniques like linear regression.
  • Trend lines are often used to argue that a particular action or event (such as training, or an advertising campaign) caused observed changes at a point in time.
  • The mathematical process which determines the unique line of best fit is based on what is called the method of least squares.
  • The line of best fit is drawn by (1) having the same number of data points on each side of the line – i.e., the line is in the median position; and (2) NOT going from the first data to the last – since extreme data often deviate from the general trend and this will give a biased sense of direction.

Key Terms

  • trend: the long-term movement in time series data after other components have been accounted for

The trend line, or line of best fit, is a line that can be drawn on a scatter diagram representing a trend in the data. It tells whether a particular data set has increased or decreased over a period of time. A trend line could simply be drawn by eye through a set of data points, but more properly its position and slope are calculated using statistical techniques like linear regression. Trend lines typically are straight lines, although some variations use higher degree polynomials depending on the degree of curvature desired in the line.

Trend lines are often used to argue that a particular action or event (such as training, or an advertising campaign) caused observed changes at a point in time. This is a simple technique, and does not require a control group, experimental design, or a sophisticated analysis technique. However, it suffers from a lack of scientific validity in cases where other potential changes can affect the data.

The mathematical process which determines the unique line of best fit is based on what is called the method of least squares – which explains why this line is sometimes called the least squares line. This method works by:

  1. finding the difference of each data [latex]\text{Y}[/latex] value from the line;
  2. squaring all the differences;
  3. summing all the squared differences;
  4. repeating this process for all positions of the line until the smallest sum of squared differences is reached.

Drawing a Trend Line

The line of best fit is drawn by:

  • having the same number of data points on each side of the line – i.e., the line is in the median position;
  • NOT going from the first data to the last data – since extreme data often deviate from the general trend and this will give a biased sense of direction.

The closeness (or otherwise) of the cloud of data points to the line suggests the concept of spread or dispersion.

The graph below shows what happens when we draw the line of best fit from the first data to the last data – it does not go through the median position as there is one data above and three data below the blue line. This is a common mistake to avoid.

image

Trend Line Mistake: This graph shows what happens when we draw the line of best fit from the first data to the last data.

To determine the equation for the line of best fit:

  1. draw the scatterplot on a grid and draw the line of best fit;
  2. select two points on the line which are, as near as possible, on grid intersections so that you can accurately estimate their position;
  3. calculate the gradient ([latex]\text{B}[/latex]) of the line using the formula: [latex]\text{gradient}=\frac { \text{difference in vertical measures}}{\text{difference in horizontal measures}}[/latex]
  4. write the partial equation;
  5. substitute one of the chosen points into the partial equation to evaluate the “[latex]\text{A}[/latex]” term;
  6. write the full equation of the line.

Example

Consider the data in the graph below:

image

Example Graph: This graph will be used in our example for drawing a trend line.

To determine the equation for the line of best fit:

  • a computer application has calculated and plotted the line of best fit for the data – it is shown as a black line – and it is in the median position with 3 data on one side and 3 data on the other side;
  • the two points chosen on the line are [latex](50, 700)[/latex] and [latex](110, 1100)[/latex];
  • calculate the gradient ([latex]\text{B}[/latex]) of the line using the formula:

[latex]\displaystyle \text{gradient}=\frac { 1100-700 }{ 110-50 } =6.67[/latex]

  • the part equation:

[latex]\displaystyle \hat { \text{Y} } =\text{A}+\left( \frac { 400 }{ 60 } \right) \text{X}[/latex]

  • substitute the point [latex](50, 700)[/latex] into the equation:

[latex]\displaystyle 700=\text{A}+\left( \frac { 400 }{ 60 } \right) 50[/latex]

[latex]\displaystyle700=\text{A}+\frac { 20,000 }{ 60 }[/latex]

[latex]366.67 =\text{A}[/latex]

  • write the full equation of the line:

[latex]\hat { \text{Y} } =366.67+6.67\text{X}[/latex]

Other Types of Correlation Coefficients

Other types of correlation coefficients include intraclass correlation and the concordance correlation coefficient.

Learning Objectives

Distinguish the intraclass and concordance correlation coefficients from previously discussed correlation coefficients.

Key Takeaways

Key Points

  • The intraclass correlation is a descriptive statistic that can be used when quantitative measurements are made on units that are organized into groups.
  • It describes how strongly units in the same group resemble each other.
  • The concordance correlation coefficient measures the agreement between two variables (e.g., to evaluate reproducibility or for inter-rater reliability).
  • Whereas Pearson’s correlation coefficient is immune to whether the biased or unbiased version for estimation of the variance is used, the concordance correlation coefficient is not.

Key Terms

  • concordance: Agreement, accordance, or consonance.
  • random effect model: A kind of hierarchical linear model assuming that the dataset being analyzed consists of a hierarchy of different populations whose differences relate to that hierarchy.

Intraclass Correlation

The intraclass correlation (or the intraclass correlation coefficient, abbreviated ICC) is a descriptive statistic that can be used when quantitative measurements are made on units that are organized into groups. It describes how strongly units in the same group resemble each other. While it is viewed as a type of correlation, unlike most other correlation measures it operates on data structured as groups rather than data structured as paired observations.

The intraclass correlation is commonly used to quantify the degree to which individuals with a fixed degree of relatedness (e.g., full siblings) resemble each other in terms of a quantitative trait. Another prominent application is the assessment of consistency or reproducibility of quantitative measurements made by different observers measuring the same quantity.

The intraclass correlation can be regarded within the framework of analysis of variance (ANOVA), and more recently it has been regarded in the framework of a random effect model. Most of the estimators can be defined in terms of the random effects model in:

[latex]\text{Y}_{\text{ij}} = \mu + \alpha_\text{j} + \epsilon_{\text{ij}}[/latex]

where [latex]\text{Y}_{\text{ij}}[/latex] is the [latex]\text{i}[/latex]th observation in the [latex]\text{j}[/latex]th group, [latex]\mu[/latex] is an unobserved overall mean, [latex]\alpha_\text{j}[/latex] is an unobserved random effect shared by all values in group [latex]\text{j}[/latex], and [latex]\epsilon_{\text{ij}}[/latex] is an unobserved noise term. For the model to be identified, the [latex]\alpha_\text{j}[/latex] and [latex]\epsilon_{\text{ij}}[/latex] are assumed to have expected value zero and to be uncorrelated with each other. Also, the [latex]\alpha_\text{j}[/latex] are assumed to be identically distributed, and the [latex]\epsilon_{\text{ij}}[/latex] are assumed to be identically distributed. The variance of [latex]\alpha_\text{j}[/latex] is denoted [latex]\sigma_{\alpha}^2[/latex] and the variance of [latex]\epsilon_{\text{ij}}[/latex] is denoted [latex]\sigma_{\epsilon}^2[/latex]. The population ICC in this framework is shown below:

[latex]\displaystyle \frac{\sigma_{\alpha}^2}{\sigma_{\alpha}^2 + \sigma_{\epsilon}^2}[/latex]

Relationship to Pearson’s Correlation Coefficient

One key difference between the two statistics is that in the ICC, the data are centered and scaled using a pooled mean and standard deviation; whereas in the Pearson correlation, each variable is centered and scaled by its own mean and standard deviation. This pooled scaling for the ICC makes sense because all measurements are of the same quantity (albeit on units in different groups). For example, in a paired data set where each “pair” is a single measurement made for each of two units (e.g., weighing each twin in a pair of identical twins) rather than two different measurements for a single unit (e.g., measuring height and weight for each individual), the ICC is a more natural measure of association than Pearson’s correlation.

An important property of the Pearson correlation is that it is invariant to application of separate linear transformations to the two variables being compared. Thus, if we are correlating [latex]\text{X}[/latex] and [latex]\text{Y}[/latex], where, say, [latex]\text{Y}=2\text{X}+1[/latex], the Pearson correlation between [latex]\text{X}[/latex] and [latex]\text{Y}[/latex] is 1: a perfect correlation. This property does not make sense for the ICC, since there is no basis for deciding which transformation is applied to each value in a group. However if all the data in all groups are subjected to the same linear transformation, the ICC does not change.

Concordance Correlation Coefficient

The concordance correlation coefficient measures the agreement between two variables (e.g., to evaluate reproducibility or for inter-rater reliability). The formula is written as:

[latex]\rho_\text{c} = \dfrac{2\rho\sigma_\text{x}\sigma_\text{y}}{\sigma_\text{x}^2+\sigma_\text{y}^2+(\mu_\text{x} - \mu_\text{y})^2}[/latex]

where [latex]{ \mu }_{ \text{x} }[/latex] and [latex]{ \mu }_{ \text{y} }[/latex] are the means for the two variables and [latex]{ { \sigma }^{ 2 } }_{ \text{x} }[/latex] and [latex]{ { \sigma }^{ 2 } }_{ \text{y} }[/latex] are the corresponding variances.

Relation to Other Measures of Correlation

Whereas Pearson’s correlation coefficient is immune to whether the biased or unbiased version for estimation of the variance is used, the concordance correlation coefficient is not.

The concordance correlation coefficient is nearly identical to some of the measures called intraclass correlations. Comparisons of the concordance correlation coefficient with an “ordinary” intraclass correlation on different data sets will find only small differences between the two correlations.

Variation and Prediction Intervals

A prediction interval is an estimate of an interval in which future observations will fall with a certain probability given what has already been observed.

Learning Objectives

Formulate a prediction interval and compare it to other types of statistical intervals.

Key Takeaways

Key Points

  • A prediction interval bears the same relationship to a future observation that a frequentist confidence interval or Bayesian credible interval bears to an unobservable population parameter.
  • In Bayesian terms, a prediction interval can be described as a credible interval for the variable itself, rather than for a parameter of the distribution thereof.
  • The concept of prediction intervals need not be restricted to the inference of just a single future sample value but can be extended to more complicated cases.
  • Since prediction intervals are only concerned with past and future observations, rather than unobservable population parameters, they are advocated as a better method than confidence intervals by some statisticians.

Key Terms

  • confidence interval: A type of interval estimate of a population parameter used to indicate the reliability of an estimate.
  • credible interval: An interval in the domain of a posterior probability distribution used for interval estimation.

In predictive inference, a prediction interval is an estimate of an interval in which future observations will fall, with a certain probability, given what has already been observed. A prediction interval bears the same relationship to a future observation that a frequentist confidence interval or Bayesian credible interval bears to an unobservable population parameter. Prediction intervals predict the distribution of individual future points, whereas confidence intervals and credible intervals of parameters predict the distribution of estimates of the true population mean or other quantity of interest that cannot be observed. Prediction intervals are also present in forecasts; however, some experts have shown that it is difficult to estimate the prediction intervals of forecasts that have contrary series. Prediction intervals are often used in regression analysis.

For example, let’s say one makes the parametric assumption that the underlying distribution is a normal distribution and has a sample set [latex]\{\text{X}_1, \dots, \text{X}_\text{n}\}[/latex]. Then, confidence intervals and credible intervals may be used to estimate the population mean [latex]\mu[/latex] and population standard deviation [latex]\sigma[/latex] of the underlying population, while prediction intervals may be used to estimate the value of the next sample variable, [latex]\text{X}_{\text{n}+1}[/latex].

Alternatively, in Bayesian terms, a prediction interval can be described as a credible interval for the variable itself, rather than for a parameter of the distribution thereof.

The concept of prediction intervals need not be restricted to the inference of just a single future sample value but can be extended to more complicated cases. For example, in the context of river flooding, where analyses are often based on annual values of the largest flow within the year, there may be interest in making inferences about the largest flood likely to be experienced within the next 50 years.

Since prediction intervals are only concerned with past and future observations, rather than unobservable population parameters, they are advocated as a better method than confidence intervals by some statisticians.

Prediction Intervals in the Normal Distribution

Given a sample from a normal distribution, whose parameters are unknown, it is possible to give prediction intervals in the frequentist sense — i.e., an interval [latex][\text{a}, \text{b}][/latex] based on statistics of the sample such that on repeated experiments, [latex]\text{X}_{\text{n}+1}[/latex] falls in the interval the desired percentage of the time.

A general technique of frequentist prediction intervals is to find and compute a pivotal quantity of the observables [latex]\text{X}_1, \dots, \text{X}_\text{n}, \text{X}_{\text{n}+1}[/latex] – meaning a function of observables and parameters whose probability distribution does not depend on the parameters – that can be inverted to give a probability of the future observation [latex]\text{X}_{\text{n}+1}[/latex] falling in some interval computed in terms of the observed values so far. The usual method of constructing pivotal quantities is to take the difference of two variables that depend on location, so that location cancels out, and then take the ratio of two variables that depend on scale, so that scale cancels out. The most familiar pivotal quantity is the Student’s [latex]\text{t}[/latex]-statistic, which can be derived by this method.

A prediction interval [latex][\text{l}, \text{u}][/latex] for a future observation [latex]\text{X}[/latex] in a normal distribution [latex]\text{N}(\mu, \sigma^2)[/latex] with known mean and variance may easily be calculated from the formula:

[latex]\displaystyle \begin{align} \gamma&=\text{P}(\text{l}< \text{X}< \text{u}) \\ &=\text{P}\left(\frac{\text{l}-\mu}{\sigma}< \frac{\text{X}-\mu}{\sigma}< \frac{\text{u}-\mu}{\sigma}\right)\\& =\text{P}\left(\frac{\text{l}-\mu}{\sigma}< \text{Z}< \frac{\text{u}-\mu}{\sigma}\right) \end{align}[/latex]

where:

[latex]\displaystyle \text{Z}=\frac { \text{X}-\mu }{ \sigma }[/latex]

the standard score of X, is standard normal distributed. The prediction interval is conventionally written as:

[latex]\left[ \mu -\text{z}\sigma,\quad \mu +\text{z}\sigma \right][/latex]

For example, to calculate the 95% prediction interval for a normal distribution with a mean ([latex]\mu[/latex]) of 5 and a standard deviation ([latex]\sigma[/latex]) of 1, then [latex]\text{z}[/latex] is approximately 2. Therefore, the lower limit of the prediction interval is approximately [latex]5 - (1\cdot2) = 3[/latex], and the upper limit is approximately 7, thus giving a prediction interval of approximately 3 to 7.

image

Standard Score and Prediction Interval: Prediction interval (on the [latex]\text{y}[/latex]-axis) given from [latex]\text{z}[/latex] (the quantile of the standard score, on the [latex]\text{x}[/latex]-axis). The [latex]\text{y}[/latex]-axis is logarithmically compressed (but the values on it are not modified).

Rank Correlation

A rank correlation is a statistic used to measure the relationship between rankings of ordinal variables or different rankings of the same variable.

Learning Objectives

Define rank correlation and illustrate how it differs from linear correlation.

Key Takeaways

Key Points

  • A rank correlation coefficient measures the degree of similarity between two rankings and can be used to assess the significance of the relation between them.
  • If one the variable decreases as the other increases, the rank correlation coefficients will be negative.
  • An increasing rank correlation coefficient implies increasing agreement between rankings.

Key Terms

  • Spearman’s rank correlation coefficient: A nonparametric measure of statistical dependence between two variables that assesses how well the relationship between two variables can be described using a monotonic function.
  • rank correlation coefficient: A measure of the degree of similarity between two rankings that can be used to assess the significance of the relation between them.
  • Kendall’s rank correlation coefficient: A statistic used to measure the association between two measured quantities; specifically, it measures the similarity of the orderings of the data when ranked by each of the quantities.

A rank correlation is any of several statistics that measure the relationship between rankings of different ordinal variables or different rankings of the same variable. In this context, a “ranking” is the assignment of the labels “first”, “second”, “third”, et cetera, to different observations of a particular variable. A rank correlation coefficient measures the degree of similarity between two rankings and can be used to assess the significance of the relation between them.

If, for example, one variable is the identity of a college basketball program and another variable is the identity of a college football program, one could test for a relationship between the poll rankings of the two types of program. One could then ask, do colleges with a higher-ranked basketball program tend to have a higher-ranked football program? A rank correlation coefficient can measure that relationship, and the measure of significance of the rank correlation coefficient can show whether the measured relationship is small enough to be likely to be a coincidence.

If there is only one variable—for example, the identity of a college football program—but it is subject to two different poll rankings (say, one by coaches and one by sportswriters), then the similarity of the two different polls’ rankings can be measured with a rank correlation coefficient.

Rank Correlation Coefficients

Rank correlation coefficients, such as Spearman’s rank correlation coefficient and Kendall’s rank correlation coefficient, measure the extent to which as one variable increases the other variable tends to increase, without requiring that increase to be represented by a linear relationship.

image

Spearman’s Rank Correlation: This graph shows a Spearman rank correlation of 1 and a Pearson correlation coefficient of 0.88. A Spearman correlation of 1 results when the two variables being compared are monotonically related, even if their relationship is not linear. In contrast, this does not give a perfect Pearson correlation.

If as the one variable increases the other decreases, the rank correlation coefficients will be negative. It is common to regard these rank correlation coefficients as alternatives to Pearson’s coefficient, used either to reduce the amount of calculation or to make the coefficient less sensitive to non-normality in distributions. However, this view has little mathematical basis, as rank correlation coefficients measure a different type of relationship than the Pearson product-moment correlation coefficient. They are best seen as measures of a different type of association rather than as alternative measure of the population correlation coefficient.

An increasing rank correlation coefficient implies increasing agreement between rankings. The coefficient is inside the interval [latex][-1, 1][/latex] and assumes the value:

  • [latex]-1[/latex] if the disagreement between the two rankings is perfect: one ranking is the reverse of the other;
  • 0 if the rankings are completely independent; or
  • 1 if the agreement between the two rankings is perfect: the two rankings are the same.

Nature of Rank Correlation

To illustrate the nature of rank correlation, and its difference from linear correlation, consider the following four pairs of numbers [latex](\text{x}, \text{y})[/latex]:

[latex](0, 1) \\ (10, 100) \\ (101, 500) \\ (102, 2000)[/latex]

As we go from each pair to the next pair, [latex]\text{x}[/latex] increases, and so does [latex]\text{y}[/latex]. This relationship is perfect, in the sense that an increase in [latex]\text{x}[/latex] is always accompanied by an increase in [latex]\text{y}[/latex]. This means that we have a perfect rank correlation and both Spearman’s correlation coefficient and Kendall’s correlation coefficient are 1. In this example, the Pearson product-moment correlation coefficient is 0.7544, indicating that the points are far from lying on a straight line.

In the same way, if [latex]\text{y}[/latex] always decreases when [latex]\text{x}[/latex] increases, the rank correlation coefficients will be [latex]-1[/latex] while the Pearson product-moment correlation coefficient may or may not be close to [latex]-1[/latex]. This depends on how close the points are to a straight line. However, in the extreme case of perfect rank correlation, when the two coefficients are both equal (being both [latex]+1[/latex] or both [latex]-1[/latex]), this is not in general so, and values of the two coefficients cannot meaningfully be compared. For example, for the three pairs [latex](1, 1)[/latex], [latex](2, 3)[/latex], [latex](3, 2)[/latex], Spearman’s coefficient is [latex]\frac{1}{2}[/latex], while Kendall’s coefficient is [latex]\frac{1}{3}[/latex].