The Regression Line

Slope and Intercept

In the regression line equation the constant [latex]\text{m}[/latex] is the slope of the line and [latex]\text{b}[/latex] is the [latex]\text{y}[/latex]-intercept.

Learning Objectives

Model the relationship between variables in regression analysis

Key Takeaways

Key Points

  • Linear regression is an approach to modeling the relationship between a dependent variable [latex]\text{y}[/latex] and 1 or more independent variables denoted [latex]\text{X}[/latex].
  • The mathematical function of the regression line is expressed in terms of a number of parameters, which are the coefficients of the equation, and the values of the independent variable.
  • The coefficients are numeric constants by which variable values in the equation are multiplied or which are added to a variable value to determine the unknown.
  • In the regression line equation, [latex]\text{x}[/latex] and [latex]\text{y}[/latex] are the variables of interest in our data, with [latex]\text{y}[/latex] the unknown or dependent variable and [latex]\text{x}[/latex] the known or independent variable.

Key Terms

  • slope: the ratio of the vertical and horizontal distances between two points on a line; zero if the line is horizontal, undefined if it is vertical.
  • intercept: the coordinate of the point at which a curve intersects an axis

Regression Analysis

Regression analysis is the process of building a model of the relationship between variables in the form of mathematical equations. The general purpose is to explain how one variable, the dependent variable, is systematically related to the values of one or more independent variables. An independent variable is so called because we imagine its value varying freely across its range, while the dependent variable is dependent upon the values taken by the independent. The mathematical function is expressed in terms of a number of parameters that are the coefficients of the equation, and the values of the independent variable. The coefficients are numeric constants by which variable values in the equation are multiplied or which are added to a variable value to determine the unknown. A simple example is the equation for the regression line which follows:

[latex]\text{y}=\text{mx}+\text{b}[/latex]

Here, by convention, [latex]\text{x}[/latex] and [latex]\text{y}[/latex] are the variables of interest in our data, with [latex]\text{y}[/latex] the unknown or dependent variable and [latex]\text{x}[/latex] the known or independent variable. The constant [latex]\text{m}[/latex] is slope of the line and [latex]\text{b}[/latex] is the [latex]\text{y}[/latex]– intercept — the value where the line cross the [latex]\text{y}[/latex] axis. So, [latex]\text{m}[/latex] and [latex]\text{b}[/latex] are the coefficients of the equation.

Linear regression is an approach to modeling the relationship between a scalar dependent variable [latex]\text{y}[/latex] and one or more explanatory (independent) variables denoted [latex]\text{X}[/latex]. The case of one explanatory variable is called simple linear regression. For more than one explanatory variable, it is called multiple linear regression. (This term should be distinguished from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable).

Two Regression Lines

ANCOVA can be used to compare regression lines by testing the effect of a categorial value on a dependent variable, controlling the continuous covariate.

Learning Objectives

Assess ANCOVA for analysis of covariance

Key Takeaways

Key Points

  • Researchers, such as those working in the field of biology, commonly wish to compare regressions and determine causal relationships between two variables.
  • Covariance is a measure of how much two variables change together and how strong the relationship is between them.
  • ANCOVA evaluates whether population means of a dependent variable (DV) are equal across levels of a categorical independent variable (IV), while statistically controlling for the effects of other continuous variables that are not of primary interest, known as covariates (CV).
  • ANCOVA can also be used to increase statistical power or adjust preexisting differences.
  • It is also possible to see similar slopes between lines but a different intercept, which can be interpreted as a difference in magnitudes but not in the rate of change.

Key Terms

  • covariance: A measure of how much two random variables change together.
  • statistical power: the probability that a statistical test will reject a false null hypothesis, that is, that it will not make a type II error, producing a false negative

Researchers, such as those working in the field of biology, commonly wish to compare regressions and determine causal relationships between two variables. For example, comparing slopes between groups is a method that could be used by a biologist to assess different growth patterns of the development of different genetic factors between groups. Any difference between these factors should result in the presence of differing slopes in the two regression lines.

A method known as analysis of covariance (ANCOVA) can be used to compare two, or more, regression lines by testing the effect of a categorial value on a dependent variable while controlling for the effect of a continuous covariate.

ANCOVA

Covariance is a measure of how much two variables change together and how strong the relationship is between them. Analysis of covariance (ANCOVA) is a general linear model which blends ANOVA and regression. ANCOVA evaluates whether population means of a dependent variable (DV) are equal across levels of a categorical independent variable (IV), while statistically controlling for the effects of other continuous variables that are not of primary interest, known as covariates (CV). Therefore, when performing ANCOVA, we are adjusting the DV means to what they would be if all groups were equal on the CV.

Uses

Increase Power. ANCOVA can be used to increase statistical power (the ability to find a significant difference between groups when one exists) by reducing the within-group error variance.

image

ANCOVA: This pie chart shows the partitioning of variance within ANCOVA analysis.

In order to understand this, it is necessary to understand the test used to evaluate differences between groups, the [latex]\text{F}[/latex]-test. The [latex]\text{F}[/latex]-test is computed by dividing the explained variance between groups (e.g., gender difference) by the unexplained variance within the groups. Thus:

[latex]\text{F}=\dfrac{\text{MS}_{\text{between}}}{\text{MS}_{\text{within}}}[/latex]

If this value is larger than a critical value, we conclude that there is a significant difference between groups. When we control for the effect of CVs on the DV, we remove it from the denominator making [latex]\text{F}[/latex] larger, thereby increasing your power to find a significant effect if one exists.

Adjusting Preexisting Differences. Another use of ANCOVA is to adjust for preexisting differences in nonequivalent (intact) groups. This controversial application aims at correcting for initial group differences (prior to group assignment) that exists on DV among several intact groups. In this situation, participants cannot be made equal through random assignment, so CVs are used to adjust scores and make participants more similar than without the CV.

Assumptions

There are five assumptions that underlie the use of ANCOVA and affect interpretation of the results:

  1. Normality of Residuals. The residuals (error terms) should be normally distributed.
  2. Homogeneity of Variances. The error variances should be equal for different treatment classes.
  3. Homogeneity of Regression Slopes. The slopes of the different regression lines should be equal (in our current context, this assumption is what will be tested).
  4. Linearity of Regression. The regression relationship between the dependent variable and concomitant variables must be linear.
  5. Independence of Error terms. The error terms should be uncorrelated.

The Test

In the context of ANCOVA, regression lines are compared by studying the interaction between the treatment effect and the independent variable. If the interaction (i.e., the [latex]\text{F}[/latex] statistic mentioned above) is significantly different from zero, we will see differing slopes between the regression lines.

It is also possible to see similar slopes between lines but a different intercept. Differing intercepts can be interpreted as a difference in magnitudes but not in the rate of change. Differing slopes would imply differing rates of change and possibly differing magnitudes, as well.

Least-Squares Regression

The criteria for determining the least squares regression line is that the sum of the squared errors is made as small as possible.

Learning Objectives

Describe how OLS are implemented in linear regression

Key Takeaways

Key Points

  • Linear regression dictates that if there is a linear relationship between two variables, you can then use one variable to predict values on the other variable.
  • The least squares regression method minimizes the sum of squared vertical distances between the observed responses in the dataset and the responses predicted by the linear approximation.
  • Least squares regression provides minimum- variance, mean- unbiased estimation when the errors have finite variances.

Key Terms

  • least squares regression: a statistical technique, based on fitting a straight line to the observed data. It is used for estimating changes in a dependent variable which is in a linear relationship with one or more independent variables
  • sum of squared errors: a mathematical approach to determining the dispersion of data points; found by squaring the distance between each data point and the line of best fit and then summing all of the squares
  • homoscedastic: if all random variables in a sequence or vector have the same finite variance

Least Squares Regression

The process of fitting the best- fit line is called linear regression. Finding the best fit line is based on the assumption that the data are scattered about a straight line. The criteria for the best fit line is that the sum of squared errors (SSE) is made as small as possible. Any other potential line would have a higher SSE than the best fit line. Therefore, this best fit line is called the least squares regression line.

Here is a scatter plot that shows a correlation between ordinary test scores and final exam test scores for a statistics class:

image

Test Score Scatter Plot: This graph shows the various scattered data points of test scores.

The following figure shows how a best fit line can be drawn through the scatter plot graph:.

image

Best Fit Line: This shows how the scatter plots form a best fit line, implying there may be correlation.

Ordinary Least Squares Regression

Ordinary Least Squares (OLS) regression (or simply “regression”) is a useful tool for examining the relationship between two or more interval/ratio variables assuming there is a linear relationship between said variables. If the relationship is not linear, OLS regression may not be the ideal tool for the analysis, or modifications to the variables/analysis may be required. If there is a linear relationship between two variables, you can use one variable to predict values of the other variable. For example, because there is a linear relationship between height and weight, if you know someone’s height, you can better estimate their weight. Using a basic line formula, you can calculate predicted values of your dependent variable using your independent variable, allowing you to make better predictions.

This method minimizes the sum of squared vertical distances between the observed responses in the dataset and the responses predicted by the linear approximation. The resulting estimator can be expressed by a simple formula, especially in the case of a single regressor on the right-hand side. The OLS estimator is consistent when the regressors are exogenous and there is no perfect multicollinearity. It is considered optimal in the class of linear unbiased estimators when the errors are homoscedastic and serially uncorrelated. Under these conditions, the method of OLS provides minimum-variance, mean-unbiased estimation when the errors have finite variances. Under the additional assumption that the errors are normally distributed, OLS is the maximum likelihood estimator. OLS is used in fields such as economics (econometrics), political science, and electrical engineering (control theory and signal processing), among others

Model Assumptions

Standard linear regression models with standard estimation techniques make a number of assumptions.

Learning Objectives

Contrast standard estimation techniques for standard linear regression

Key Takeaways

Key Points

  • There are five major assumptions made by standard linear regression models.
  • The arrangement, or probability distribution, of the predictor variables [latex]\text{x}[/latex] has a major influence on the precision of estimates of [latex]\beta[/latex].
  • Extensions of the major assumptions make the estimation procedure more complex and time-consuming, and may even require more data in order to get an accurate model.

Key Terms

  • exogeneity: a condition in linear regression wherein the variable is independent of all other response values

Standard linear regression models with standard estimation techniques make a number of assumptions about the predictor variables, the response variables, and their relationship. Numerous extensions have been developed that allow each of these assumptions to be relaxed (i.e. reduced to a weaker form), and in some cases eliminated entirely. Some methods are general enough that they can relax multiple assumptions at once, and in other cases this can be achieved by combining different extensions. Generally, these extensions make the estimation procedure more complex and time-consuming, and may even require more data in order to get an accurate model.

The following are the major assumptions made by standard linear regression models with standard estimation techniques (e.g. ordinary least squares ):

Weak exogeneity. This essentially means that the predictor variables [latex]\text{x}[/latex] can be treated as fixed values rather than random variables. This means, for example, that the predictor variables are assumed to be error -free; that is, they are not contaminated with measurement errors. Although unrealistic in many settings, dropping this assumption leads to significantly more difficult errors-in-variables models.

Linearity. This means that the mean of the response variable is a linear combination of the parameters (regression coefficients) and the predictor variables. Note that this assumption is far less restrictive than it may at first seem. Because the predictor variables are treated as fixed values (see above), linearity is really only a restriction on the parameters. The predictor variables themselves can be arbitrarily transformed, and in fact multiple copies of the same underlying predictor variable can be added, each one transformed differently. This trick is used, for example, in polynomial regression, which uses linear regression to fit the response variable as an arbitrary polynomial function (up to a given rank) of a predictor variable. This makes linear regression an extremely powerful inference method. In fact, models such as polynomial regression are often “too powerful” in that they tend to overfit the data. As a result, some kind of regularization must typically be used to prevent unreasonable solutions coming out of the estimation process.

Constant variance (aka homoscedasticity ). This means that different response variables have the same variance in their errors, regardless of the values of the predictor variables. In practice, this assumption is invalid (i.e. the errors are heteroscedastic) if the response variables can vary over a wide scale. In order to determine for heterogeneous error variance, or when a pattern of residuals violates model assumptions of homoscedasticity (error is equally variable around the ‘best-fitting line ‘ for all points of [latex]\text{x}[/latex]), it is prudent to look for a “fanning effect” between residual error and predicted values. This is to say there will be a systematic change in the absolute or squared residuals when plotted against the predicting outcome. Error will not be evenly distributed across the regression line. Heteroscedasticity will result in the averaging over of distinguishable variances around the points to get a single variance that is inaccurately representing all the variances of the line. In effect, residuals appear clustered and spread apart on their predicted plots for larger and smaller values for points along the linear regression line, and the mean squared error for the model will be wrong. Typically, for example, a response variable whose mean is large will have a greater variance than one whose mean is small.

Independence of errors. This assumes that the errors of the response variables are uncorrelated with each other. (Actual statistical independence is a stronger condition than mere lack of correlation and is often not needed, although it can be exploited if it is known to hold. ) Some methods (e.g. generalized least squares) are capable of handling correlated errors, although they typically require significantly more data unless some sort of regularization is used to bias the model towards assuming uncorrelated errors. Bayesian linear regression is a general way of handling this issue.

Lack of multicollinearity in the predictors. For standard least squares estimation methods, the design matrix [latex]\text{X}[/latex] must have full column rank [latex]\text{p}[/latex]; otherwise, we have a condition known as multicollinearity in the predictor variables. This can be triggered by having two or more perfectly correlated predictor variables (e.g. if the same predictor variable is mistakenly given twice, either without transforming one of the copies or by transforming one of the copies linearly). It can also happen if there is too little data available compared to the number of parameters to be estimated (e.g. fewer data points than regression coefficients). Beyond these assumptions, several other statistical properties of the data strongly influence the performance of different estimation methods:

The statistical relationship between the error terms and the regressors plays an important role in determining whether an estimation procedure has desirable sampling properties such as being unbiased and consistent.

The arrangement, or probability distribution, of the predictor variables [latex]\text{x}[/latex] has a major influence on the precision of estimates of [latex]\beta[/latex]. Sampling and design of experiments are highly-developed subfields of statistics that provide guidance for collecting data in such a way as to achieve a precise estimate of [latex]\beta[/latex].

image

Simple Linear Regression: A graphical representation of a best fit line for simple linear regression.

Making Inferences About the Slope

The slope of the best fit line tells us how the dependent variable [latex]\text{y}[/latex] changes for every one unit increase in the independent variable [latex]\text{x}[/latex], on average.

Learning Objectives

Infer how variables are related based on the slope of a regression line

Key Takeaways

Key Points

  • It is important to interpret the slope of the line in the context of the situation represented by the data.
  • A fitted linear regression model can be used to identify the relationship between a single predictor variable [latex]\text{x}[/latex] and the response variable [latex]\text{y}[/latex] when all the other predictor variables in the model are “held fixed”.
  • The interpretation of [latex]\text{m}[/latex] (slope) is the expected change in [latex]\text{y}[/latex] for a one-unit change in [latex]\text{x}[/latex] when the other covariates are held fixed.

Key Terms

  • slope: the ratio of the vertical and horizontal distances between two points on a line; zero if the line is horizontal, undefined if it is vertical.
  • covariate: a variable that is possibly predictive of the outcome under study
  • intercept: the coordinate of the point at which a curve intersects an axis

Making Inferences About the Slope

The slope of the regression line describes how changes in the variables are related. It is important to interpret the slope of the line in the context of the situation represented by the data. You should be able to write a sentence interpreting the slope in plain English.

The slope of the best fit line tells us how the dependent variable [latex]\text{y}[/latex] changes for every one unit increase in the independent variable [latex]\text{x}[/latex], on average.

Remember the equation for a line is:

[latex]\text{y} = \text{mx}+\text{b}[/latex]

where [latex]\text{y}[/latex] is the dependent variable, [latex]\text{x}[/latex] is the independent variable, [latex]\text{m}[/latex] is the slope, and [latex]\text{b}[/latex] is the intercept.

A fitted linear regression model can be used to identify the relationship between a single predictor variable, [latex]\text{x}[/latex], and the response variable, [latex]\text{y}[/latex], when all the other predictor variables in the model are “held fixed”. Specifically, the interpretation of [latex]\text{m}[/latex] is the expected change in [latex]\text{y}[/latex] for a one-unit change in [latex]\text{x}[/latex] when the other covariates are held fixed—that is, the expected value of the partial derivative of [latex]\text{y}[/latex] with respect to [latex]\text{x}[/latex]. This is sometimes called the unique effect of [latex]\text{x}[/latex] on [latex]\text{y}[/latex]. In contrast, the marginal effect of [latex]\text{x}[/latex] on [latex]\text{y}[/latex] can be assessed using a correlation coefficient or simple linear regression model relating [latex]\text{x}[/latex] to [latex]\text{y}[/latex]; this effect is the total derivative of [latex]\text{y}[/latex] with respect to [latex]\text{x}[/latex].

Care must be taken when interpreting regression results, as some of the regressors may not allow for marginal changes (such as dummy variables, or the intercept term), while others cannot be held fixed.

It is possible that the unique effect can be nearly zero even when the marginal effect is large. This may imply that some other covariate captures all the information in [latex]\text{x}[/latex], so that once that variable is in the model, there is no contribution of [latex]\text{x}[/latex] to the variation in [latex]\text{y}[/latex]. Conversely, the unique effect of [latex]\text{x}[/latex] can be large while its marginal effect is nearly zero. This would happen if the other covariates explained a great deal of the variation of [latex]\text{y}[/latex], but they mainly explain said variation in a way that is complementary to what is captured by [latex]\text{x}[/latex]. In this case, including the other variables in the model reduces the part of the variability of [latex]\text{y}[/latex] that is unrelated to [latex]\text{x}[/latex], thereby strengthening the apparent relationship with [latex]\text{x}[/latex].

The meaning of the expression “held fixed” may depend on how the values of the predictor variables arise. If the experimenter directly sets the values of the predictor variables according to a study design, the comparisons of interest may literally correspond to comparisons among units whose predictor variables have been “held fixed” by the experimenter. Alternatively, the expression “held fixed” can refer to a selection that takes place in the context of data analysis. In this case, we “hold a variable fixed” by restricting our attention to the subsets of the data that happen to have a common value for the given predictor variable. This is the only interpretation of “held fixed” that can be used in an observational study.

The notion of a “unique effect” is appealing when studying a complex system where multiple interrelated components influence the response variable. In some cases, it can literally be interpreted as the causal effect of an intervention that is linked to the value of a predictor variable. However, it has been argued that in many cases multiple regression analysis fails to clarify the relationships between the predictor variables and the response variables when the predictors are correlated with each other and are not assigned following a study design.

Regression Toward the Mean: Estimation and Prediction

Regression toward the mean says that if a variable is extreme on its 1st measurement, it will tend to be closer to the average on its 2nd.

Learning Objectives

Explain regression towards the mean for variables that are extreme on their first measurement

Key Takeaways

Key Points

  • The conditions under which regression toward the mean occurs depend on the way the term is mathematically defined.
  • Regression toward the mean is a significant consideration in the design of experiments.
  • Statistical regression toward the mean is not a causal phenomenon.

Key Terms

  • bivariate distribution: gives the probability that both of two random variables fall in a particular range or discrete set of values specified for that variable

In statistics, regression toward (or to) the mean is the phenomenon that if a variable is extreme on its first measurement, it will tend to be closer to the average on its second measurement—and, paradoxically, if it is extreme on its second measurement, it will tend to be closer to the average on its first. To avoid making wrong inferences, regression toward the mean must be considered when designing scientific experiments and interpreting data.

The conditions under which regression toward the mean occurs depend on the way the term is mathematically defined. Sir Francis Galton first observed the phenomenon in the context of simple linear regression of data points. However, a less restrictive approach is possible. Regression towards the mean can be defined for any bivariate distribution with identical marginal distributions. Two such definitions exist. One definition accords closely with the common usage of the term “regression towards the mean”. Not all such bivariate distributions show regression towards the mean under this definition. However, all such bivariate distributions show regression towards the mean under the other definition.

Historically, what is now called regression toward the mean has also been called reversion to the mean and reversion to mediocrity.

Consider a simple example: a class of students takes a 100-item true/false test on a subject. Suppose that all students choose randomly on all questions. Then, each student’s score would be a realization of one of a set of independent and identically distributed random variables, with a mean of 50. Naturally, some students will score substantially above 50 and some substantially below 50 just by chance. If one takes only the top scoring 10% of the students and gives them a second test on which they again choose randomly on all items, the mean score would again be expected to be close to 50. Thus the mean of these students would “regress” all the way back to the mean of all students who took the original test. No matter what a student scores on the original test, the best prediction of his score on the second test is 50.

If there were no luck or random guessing involved in the answers supplied by students to the test questions, then all students would score the same on the second test as they scored on the original test, and there would be no regression toward the mean.

Most realistic situations fall between these two extremes: for example, one might consider exam scores as a combination of skill and luck. In this case, the subset of students scoring above average would be composed of those who were skilled and had not especially bad luck, together with those who were unskilled, but were extremely lucky. On a retest of this subset, the unskilled will be unlikely to repeat their lucky break, while the skilled will have a second chance to have bad luck. Hence, those who did well previously are unlikely to do quite as well in the second test.

The following is a second example of regression toward the mean. A class of students takes two editions of the same test on two successive days. It has frequently been observed that the worst performers on the first day will tend to improve their scores on the second day, and the best performers on the first day will tend to do worse on the second day. The phenomenon occurs because student scores are determined in part by underlying ability and in part by chance. For the first test, some will be lucky, and score more than their ability, and some will be unlucky and score less than their ability. Some of the lucky students on the first test will be lucky again on the second test, but more of them will have (for them) average or below average scores. Therefore a student who was lucky on the first test is more likely to have a worse score on the second test than a better score. Similarly, students who score less than the mean on the first test will tend to see their scores increase on the second test.

Regression toward the mean is a significant consideration in the design of experiments.

The concept of regression toward the mean can be misused very easily.In the student test example above, it was assumed implicitly that what was being measured did not change between the two measurements. Suppose, however, that the course was pass/fail and students were required to score above 70 on both tests to pass. Then the students who scored under 70 the first time would have no incentive to do well, and might score worse on average the second time. The students just over 70, on the other hand, would have a strong incentive to study and concentrate while taking the test. In that case one might see movement away from 70, scores below it getting lower and scores above it getting higher. It is possible for changes between the measurement times to augment, offset or reverse the statistical tendency to regress toward the mean.

Statistical regression toward the mean is not a causal phenomenon. A student with the worst score on the test on the first day will not necessarily increase her score substantially on the second day due to the effect. On average, the worst scorers improve, but that is only true because the worst scorers are more likely to have been unlucky than lucky. To the extent that a score is determined randomly, or that a score has random variation or error, as opposed to being determined by the student’s academic ability or being a “true value”, the phenomenon will have an effect.

image

Sir Francis Galton: Sir Frances Galton first observed the phenomenon of regression towards the mean in genetics research.