Multiple Regression

Multiple Regression Models

Multiple regression is used to find an equation that best predicts the [latex]\text{Y}[/latex] variable as a linear function of the multiple [latex]\text{X}[/latex] variables.

Learning Objectives

Describe how multiple regression can be used to predict an unknown [latex]\text{Y}[/latex] value based on a corresponding set of [latex]\text{X}[/latex] values or understand functional relationships between the dependent and independent variables.

Key Takeaways

Key Points

  • One use of multiple regression is prediction or estimation of an unknown [latex]\text{Y}[/latex] value corresponding to a set of [latex]\text{X}[/latex] values.
  • A second use of multiple regression is to try to understand the functional relationships between the dependent and independent variables, to try to see what might be causing the variation in the dependent variable.
  • The main null hypothesis of a multiple regression is that there is no relationship between the [latex]\text{X}[/latex] variables and the [latex]\text{Y}[/latex] variables–i.e. that the fit of the observed [latex]\text{Y}[/latex] values to those predicted by the multiple regression equation is no better than what you would expect by chance.

Key Terms

  • multiple regression: regression model used to find an equation that best predicts the [latex]\text{Y}[/latex] variable as a linear function of multiple [latex]\text{X}[/latex] variables
  • null hypothesis: A hypothesis set up to be refuted in order to support an alternative hypothesis; presumed true until statistical evidence in the form of a hypothesis test indicates otherwise.

When To Use Multiple Regression

You use multiple regression when you have three or more measurement variables. One of the measurement variables is the dependent ([latex]\text{Y}[/latex]) variable. The rest of the variables are the independent ([latex]\text{X}[/latex]) variables. The purpose of a multiple regression is to find an equation that best predicts the [latex]\text{Y}[/latex] variable as a linear function of the [latex]\text{X}[/latex] variables.

Multiple Regression For Prediction

One use of multiple regression is prediction or estimation of an unknown [latex]\text{Y}[/latex] value corresponding to a set of [latex]\text{X}[/latex] values. For example, let’s say you’re interested in finding a suitable habitat to reintroduce the rare beach tiger beetle, Cicindela dorsalis dorsalis, which lives on sandy beaches on the Atlantic coast of North America. You’ve gone to a number of beaches that already have the beetles and measured the density of tiger beetles (the dependent variable) and several biotic and abiotic factors, such as wave exposure, sand particle size, beach steepness, density of amphipods and other prey organisms, etc. Multiple regression would give you an equation that would relate the tiger beetle density to a function of all the other variables. Then, if you went to a beach that didn’t have tiger beetles and measured all the independent variables (wave exposure, sand particle size, etc.), you could use the multiple regression equation to predict the density of tiger beetles that could live there if you introduced them.

image

Atlantic Beach Tiger Beetle: This is the Atlantic beach tiger beetle (Cicindela dorsalis dorsalis), which is the subject of the multiple regression study in this atom.

Multiple Regression For Understanding Causes

A second use of multiple regression is to try to understand the functional relationships between the dependent and independent variables, to try to see what might be causing the variation in the dependent variable. For example, if you did a regression of tiger beetle density on sand particle size by itself, you would probably see a significant relationship. If you did a regression of tiger beetle density on wave exposure by itself, you would probably see a significant relationship. However, sand particle size and wave exposure are correlated; beaches with bigger waves tend to have bigger sand particles. Maybe sand particle size is really important, and the correlation between it and wave exposure is the only reason for a significant regression between wave exposure and beetle density. Multiple regression is a statistical way to try to control for this; it can answer questions like, “If sand particle size (and every other measured variable) were the same, would the regression of beetle density on wave exposure be significant? ”

Null Hypothesis

The main null hypothesis of a multiple regression is that there is no relationship between the [latex]\text{X}[/latex] variables and the [latex]\text{Y}[/latex] variables– in other words, that the fit of the observed [latex]\text{Y}[/latex] values to those predicted by the multiple regression equation is no better than what you would expect by chance. As you are doing a multiple regression, there is also a null hypothesis for each [latex]\text{X}[/latex] variable, meaning that adding that [latex]\text{X}[/latex] variable to the multiple regression does not improve the fit of the multiple regression equation any more than expected by chance.

Estimating and Making Inferences About the Slope

The purpose of a multiple regression is to find an equation that best predicts the [latex]\text{Y}[/latex] variable as a linear function of the [latex]\text{X}[/latex] variables.

Learning Objectives

Discuss how partial regression coefficients (slopes) allow us to predict the value of [latex]\text{Y}[/latex] given measured [latex]\text{X}[/latex] values.

Key Takeaways

Key Points

  • Partial regression coefficients (the slopes ) and the intercept are found when creating an equation of regression so that they minimize the squared deviations between the expected and observed values of [latex]\text{Y}[/latex].
  • If you had the partial regression coefficients and measured the [latex]\text{X}[/latex] variables, you could plug them into the equation and predict the corresponding value of [latex]\text{Y}[/latex].
  • The standard partial regression coefficient is the number of standard deviations that [latex]\text{Y}[/latex] would change for every one standard deviation change in [latex]\text{X}_1[/latex], if all the other [latex]\text{X}[/latex] variables could be kept constant.

Key Terms

  • standard partial regression coefficient: the number of standard deviations that [latex]\text{Y}[/latex] would change for every one standard deviation change in [latex]\text{X}_1[/latex], if all the other [latex]\text{X}[/latex] variables could be kept constant
  • partial regression coefficient: a value indicating the effect of each independent variable on the dependent variable with the influence of all the remaining variables held constant. Each coefficient is the slope between the dependent variable and each of the independent variables
  • p-value: The probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true.

You use multiple regression when you have three or more measurement variables. One of the measurement variables is the dependent ([latex]\text{Y}[/latex]) variable. The rest of the variables are the independent ([latex]\text{X}[/latex]) variables. The purpose of a multiple regression is to find an equation that best predicts the [latex]\text{Y}[/latex] variable as a linear function of the [latex]\text{X}[/latex]variables.

How It Works

The basic idea is that an equation is found like this:

[latex]\text{Y}_{\text{exp}} = \text{a}+ \text{b}_1\text{X}_1 + \text{b}_2\text{X}_2 + \text{b}_3\text{X}_3 + \cdots[/latex]

The [latex]\text{Y}_{\text{exp}}[/latex] is the expected value of [latex]\text{Y}[/latex] for a given set of [latex]\text{X}[/latex] values. [latex]\text{b}_1[/latex] is the estimated slope of a regression of [latex]\text{Y}[/latex] on [latex]\text{X}_1[/latex], if all of the other [latex]\text{X}[/latex] variables could be kept constant. This concept applies similarly for [latex]\text{b}_2[/latex], [latex]\text{b}_3[/latex], et cetera. [latex]\text{a}[/latex] is the intercept. Values of [latex]\text{b}_1[/latex], et cetera, (the “partial regression coefficients”) and the intercept are found so that they minimize the squared deviations between the expected and observed values of [latex]\text{Y}[/latex].

How well the equation fits the data is expressed by [latex]\text{R}^2[/latex], the “coefficient of multiple determination. ” This can range from 0 (for no relationship between the [latex]\text{X}[/latex] and [latex]\text{Y}[/latex] variables) to 1 (for a perfect fit, i.e. no difference between the observed and expected [latex]\text{Y}[/latex] values). The [latex]\text{p}[/latex]-value is a function of the [latex]\text{R}^2[/latex], the number of observations, and the number of [latex]\text{X}[/latex] variables.

Importance of Slope (Partial Regression Coefficients)

When the purpose of multiple regression is prediction, the important result is an equation containing partial regression coefficients (slopes). If you had the partial regression coefficients and measured the [latex]\text{X}[/latex] variables, you could plug them into the equation and predict the corresponding value of [latex]\text{Y}[/latex]. The magnitude of the partial regression coefficient depends on the unit used for each variable. It does not tell you anything about the relative importance of each variable.

When the purpose of multiple regression is understanding functional relationships, the important result is an equation containing standard partial regression coefficients, like this:

[latex]\text{y}'_{\text{exp}} = \text{a}+ \text{b}'_1\text{x}'_1+ b'_2\text{x}'_2 + b'_3\text{x}'_3 + \cdots[/latex]

Where [latex]\text{b}'_1[/latex] is the standard partial regression coefficient of [latex]\text{y}[/latex] on [latex]\text{X}_1[/latex]. It is the number of standard deviations that [latex]\text{Y}[/latex] would change for every one standard deviation change in [latex]\text{X}_1[/latex], if all the other [latex]\text{X}[/latex] variables could be kept constant. The magnitude of the standard partial regression coefficients tells you something about the relative importance of different variables; [latex]\text{X}[/latex] variables with bigger standard partial regression coefficients have a stronger relationship with the [latex]\text{Y}[/latex] variable.

image

Linear Regression: A graphical representation of a best fit line for simple linear regression.

Evaluating Model Utility

The results of multiple regression should be viewed with caution.

Learning Objectives

Evaluate the potential drawbacks of multiple regression.

Key Takeaways

Key Points

  • You should examine the linear regression of the dependent variable on each independent variable, one at a time, examine the linear regressions between each pair of independent variables, and consider what you know about the subject matter.
  • You should probably treat multiple regression as a way of suggesting patterns in your data, rather than rigorous hypothesis testing.
  • If independent variables [latex]\text{A}[/latex] and [latex]\text{B}[/latex] are both correlated with [latex]\text{Y}[/latex], and [latex]\text{A}[/latex] and [latex]\text{B}[/latex] are highly correlated with each other, only one may contribute significantly to the model, but it would be incorrect to blindly conclude that the variable that was dropped from the model has no significance.

Key Terms

  • independent variable: in an equation, any variable whose value is not dependent on any other in the equation
  • dependent variable: in an equation, the variable whose value depends on one or more variables in the equation
  • multiple regression: regression model used to find an equation that best predicts the [latex]\text{Y}[/latex] variable as a linear function of multiple [latex]\text{X}[/latex] variables

Multiple regression is beneficial in some respects, since it can show the relationships between more than just two variables; however, it should not always be taken at face value.

It is easy to throw a big data set at a multiple regression and get an impressive-looking output. But many people are skeptical of the usefulness of multiple regression, especially for variable selection, and you should view the results with caution. You should examine the linear regression of the dependent variable on each independent variable, one at a time, examine the linear regressions between each pair of independent variables, and consider what you know about the subject matter. You should probably treat multiple regression as a way of suggesting patterns in your data, rather than rigorous hypothesis testing.

If independent variables [latex]\text{A}[/latex] and [latex]\text{B}[/latex] are both correlated with [latex]\text{Y}[/latex], and [latex]\text{A}[/latex] and [latex]\text{B}[/latex] are highly correlated with each other, only one may contribute significantly to the model, but it would be incorrect to blindly conclude that the variable that was dropped from the model has no biological importance. For example, let’s say you did a multiple regression on vertical leap in children five to twelve years old, with height, weight, age, and score on a reading test as independent variables. All four independent variables are highly correlated in children, since older children are taller, heavier, and more literate, so it’s possible that once you’ve added weight and age to the model, there is so little variation left that the effect of height is not significant. It would be biologically silly to conclude that height had no influence on vertical leap. Because reading ability is correlated with age, it’s possible that it would contribute significantly to the model; this might suggest some interesting followup experiments on children all of the same age, but it would be unwise to conclude that there was a real effect of reading ability and vertical leap based solely on the multiple regression.

image

Linear Regression: Random data points and their linear regression.

Using the Model for Estimation and Prediction

Standard multiple regression involves several independent variables predicting the dependent variable.

Learning Objectives

Analyze the predictive value of multiple regression in terms of the overall model and how well each independent variable predicts the dependent variable.

Key Takeaways

Key Points

  • In addition to telling us the predictive value of the overall model, standard multiple regression tells us how well each independent variable predicts the dependent variable, controlling for each of the other independent variables.
  • Significance levels of 0.05 or lower are typically considered significant, and significance levels between 0.05 and 0.10 would be considered marginal.
  • An independent variable that is a significant predictor of a dependent variable in simple linear regression may not be significant in multiple regression.

Key Terms

  • significance level: A measure of how likely it is to draw a false conclusion in a statistical test, when the results are really just random variations.
  • multiple regression: regression model used to find an equation that best predicts the [latex]\text{Y}[/latex] variable as a linear function of multiple [latex]\text{X}[/latex] variables

Using Multiple Regression for Prediction

Standard multiple regression is the same idea as simple linear regression, except now we have several independent variables predicting the dependent variable. Imagine that we wanted to predict a person’s height from the gender of the person and from the weight. We would use standard multiple regression in which gender and weight would be the independent variables and height would be the dependent variable. The resulting output would tell us a number of things. First, it would tell us how much of the variance in height is accounted for by the joint predictive power of knowing a person’s weight and gender. This value is denoted by [latex]\text{R}^2[/latex]. The output would also tell us if the model allows the prediction of a person’s height at a rate better than chance. This is denoted by the significance level of the model. Within the social sciences, a significance level of 0.05 is often considered the standard for what is acceptable. Therefore, in our example, if the statistic is 0.05 (or less), then the model is considered significant. In other words, there is only a 5 in a 100 chance (or less) that there really is not a relationship between height, weight and gender. If the significance level is between 0.05 and 0.10, then the model is considered marginal. In other words, the model is fairly good at predicting a person’s height, but there is between a 5-10% probability that there really is not a relationship between height, weight and gender.

In addition to telling us the predictive value of the overall model, standard multiple regression tells us how well each independent variable predicts the dependent variable, controlling for each of the other independent variables. In our example, the regression analysis would tell us how well weight predicts a person’s height, controlling for gender, as well as how well gender predicts a person’s height, controlling for weight.

To see if weight is a “significant” predictor of height, we would look at the significance level associated with weight. Again, significance levels of 0.05 or lower would be considered significant, and significance levels between 0.05 and 0.10 would be considered marginal. Once we have determined that weight is a significant predictor of height, we would want to more closely examine the relationship between the two variables. In other words, is the relationship positive or negative? In this example, we would expect that there would be a positive relationship. In other words, we would expect that the greater a person’s weight, the greater the height. (A negative relationship is present in the case in which the greater a person’s weight, the shorter the height. ) We can determine the direction of the relationship between weight and height by looking at the regression coefficient associated with weight.

A similar procedure shows us how well gender predicts height. As with weight, we would check to see if gender is a significant predictor of height, controlling for weight. The difference comes when determining the exact nature of the relationship between gender and height. That is, it does not make sense to talk about the effect on height as gender increases or decreases, since gender is not a continuous variable.

Conclusion

As mentioned, the significance levels given for each independent variable indicate whether that particular independent variable is a significant predictor of the dependent variable, over and above the other independent variables. Because of this, an independent variable that is a significant predictor of a dependent variable in simple linear regression may not be significant in multiple regression (i.e., when other independent variables are added into the equation). This could happen because the covariance that the first independent variable shares with the dependent variable could overlap with the covariance that is shared between the second independent variable and the dependent variable. Consequently, the first independent variable is no longer uniquely predictive and would not be considered significant in multiple regression. Because of this, it is possible to get a highly significant [latex]\text{R}^2[/latex], but have none of the independent variables be significant.

image

Multiple Regression: This image shows data points and their linear regression. Multiple regression is the same idea as single regression, except we deal with more than one independent variables predicting the dependent variable.

Interaction Models

In regression analysis, an interaction may arise when considering the relationship among three or more variables.

Learning Objectives

Outline the problems that can arise when the simultaneous influence of two variables on a third is not additive.

Key Takeaways

Key Points

  • If two variables of interest interact, the relationship between each of the interacting variables and a third “dependent variable” depends on the value of the other interacting variable.
  • In practice, the presence of interacting variables makes it more difficult to predict the consequences of changing the value of a variable, particularly if the variables it interacts with are hard to measure or difficult to control.
  • The interaction between an explanatory variable and an environmental variable suggests that the effect of the explanatory variable has been moderated or modified by the environmental variable.

Key Terms

  • interaction variable: A variable constructed from an original set of variables to try to represent either all of the interaction present or some part of it.

In statistics, an interaction may arise when considering the relationship among three or more variables, and describes a situation in which the simultaneous influence of two variables on a third is not additive. Most commonly, interactions are considered in the context of regression analyses.

The presence of interactions can have important implications for the interpretation of statistical models. If two variables of interest interact, the relationship between each of the interacting variables and a third “dependent variable” depends on the value of the other interacting variable. In practice, this makes it more difficult to predict the consequences of changing the value of a variable, particularly if the variables it interacts with are hard to measure or difficult to control.

The notion of “interaction” is closely related to that of “moderation” that is common in social and health science research: the interaction between an explanatory variable and an environmental variable suggests that the effect of the explanatory variable has been moderated or modified by the environmental variable.

Interaction Variables in Modeling

An interaction variable is a variable constructed from an original set of variables in order to represent either all of the interaction present or some part of it. In exploratory statistical analyses, it is common to use products of original variables as the basis of testing whether interaction is present with the possibility of substituting other more realistic interaction variables at a later stage. When there are more than two explanatory variables, several interaction variables are constructed, with pairwise-products representing pairwise-interactions and higher order products representing higher order interactions.

A simple setting in which interactions can arise is a two- factor experiment analyzed using Analysis of Variance (ANOVA). Suppose we have two binary factors [latex]\text{A}[/latex] and [latex]\text{B}[/latex]. For example, these factors might indicate whether either of two treatments were administered to a patient, with the treatments applied either singly, or in combination. We can then consider the average treatment response (e.g. the symptom levels following treatment) for each patient, as a function of the treatment combination that was administered. The following table shows one possible situation:

image

Interaction Model 1: A table showing no interaction between the two treatments — their effects are additive.

In this example, there is no interaction between the two treatments — their effects are additive. The reason for this is that the difference in mean response between those subjects receiving treatment [latex]\text{A}[/latex] and those not receiving treatment [latex]\text{A}[/latex] is [latex]-2[/latex], regardless of whether treatment [latex]\text{B}[/latex] is administered ([latex]-2 = 4-6[/latex]) or not ([latex]-2 = 5-7[/latex]). Note: It automatically follows that the difference in mean response between those subjects receiving treatment [latex]\text{B}[/latex] and those not receiving treatment [latex]\text{B}[/latex] is the same, regardless of whether treatment [latex]\text{A}[/latex] is administered ([latex]7=6=5-4[/latex]).

image

Interaction Model 2: A table showing an interaction between the treatments — their effects are not additive.

In contrast, if the average responses as in are observed, then there is an interaction between the treatments — their effects are not additive. Supposing that greater numbers correspond to a better response, in this situation treatment [latex]\text{B}[/latex] is helpful on average if the subject is not also receiving treatment [latex]\text{A}[/latex], but is more helpful on average if given in combination with treatment [latex]\text{A}[/latex]. Treatment [latex]\text{A}[/latex] is helpful on average regardless of whether treatment [latex]\text{B}[/latex] is also administered, but it is more helpful in both absolute and relative terms if given alone, rather than in combination with treatment [latex]\text{B}[/latex].

Polynomial Regression

The goal of polynomial regression is to model a non-linear relationship between the independent and dependent variables.

Learning Objectives

Explain how the linear and nonlinear aspects of polynomial regression make it a special case of multiple linear regression.

Key Takeaways

Key Points

  • Polynomial regression is a higher order form of linear regression in which the relationship between the independent variable x and the dependent variable [latex]\text{y}[/latex] is modeled as an [latex]\text{n}[/latex]th order polynomial.
  • Polynomial regression models are usually fit using the method of least squares.
  • Although polynomial regression is technically a special case of multiple linear regression, the interpretation of a fitted polynomial regression model requires a somewhat different perspective.

Key Terms

  • least squares: a standard approach to find the equation of regression that minimizes the sum of the squares of the errors made in the results of every single equation
  • polynomial regression: a higher order form of linear regression in which the relationship between the independent variable [latex]\text{x}[/latex] and the dependent variable [latex]\text{y}[/latex] is modeled as an [latex]\text{n}[/latex]th order polynomial
  • orthogonal: statistically independent, with reference to variates

Polynomial Regression

Polynomial regression is a higher order form of linear regression in which the relationship between the independent variable [latex]\text{x}[/latex] and the dependent variable [latex]\text{y}[/latex] is modeled as an [latex]\text{n}[/latex]th order polynomial. Polynomial regression fits a nonlinear relationship between the value of [latex]\text{x}[/latex] and the corresponding conditional mean of [latex]\text{y}[/latex], denoted [latex]\text{E}(\text{y}\ | \ \text{x})[/latex], and has been used to describe nonlinear phenomena such as the growth rate of tissues, the distribution of carbon isotopes in lake sediments, and the progression of disease epidemics. Although polynomial regression fits a nonlinear model to the data, as a statistical estimation problem it is linear, in the sense that the regression function [latex]\text{E}(\text{y}\ | \ \text{x})[/latex] is linear in the unknown parameters that are estimated from the data. For this reason, polynomial regression is considered to be a special case of multiple linear regression.

History

Polynomial regression models are usually fit using the method of least-squares. The least-squares method minimizes the variance of the unbiased estimators of the coefficients, under the conditions of the Gauss–Markov theorem. The least-squares method was published in 1805 by Legendre and in 1809 by Gauss. The first design of an experiment for polynomial regression appeared in an 1815 paper of Gergonne. In the 20th century, polynomial regression played an important role in the development of regression analysis, with a greater emphasis on issues of design and inference. More recently, the use of polynomial models has been complemented by other methods, with non-polynomial models having advantages for some classes of problems.

Interpretation

Although polynomial regression is technically a special case of multiple linear regression, the interpretation of a fitted polynomial regression model requires a somewhat different perspective. It is often difficult to interpret the individual coefficients in a polynomial regression fit, since the underlying monomials can be highly correlated. For example, [latex]\text{x}[/latex] and [latex]\text{x}^2[/latex] have correlation around 0.97 when [latex]\text{x}[/latex] is uniformly distributed on the interval [latex](0, 1)[/latex]. Although the correlation can be reduced by using orthogonal polynomials, it is generally more informative to consider the fitted regression function as a whole. Point-wise or simultaneous confidence bands can then be used to provide a sense of the uncertainty in the estimate of the regression function.

Alternative Approaches

Polynomial regression is one example of regression analysis using basis functions to model a functional relationship between two quantities. A drawback of polynomial bases is that the basis functions are “non-local,” meaning that the fitted value of [latex]\text{y}[/latex] at a given value [latex]\text{x}=\text{x}_0[/latex] depends strongly on data values with [latex]\text{x}[/latex] far from [latex]\text{x}_0[/latex]. In modern statistics, polynomial basis-functions are used along with new basis functions, such as splines, radial basis functions, and wavelets. These families of basis functions offer a more parsimonious fit for many types of data.

The goal of polynomial regression is to model a non-linear relationship between the independent and dependent variables (technically, between the independent variable and the conditional mean of the dependent variable). This is similar to the goal of non-parametric regression, which aims to capture non-linear regression relationships. Therefore, non-parametric regression approaches such as smoothing can be useful alternatives to polynomial regression. Some of these methods make use of a localized form of classical polynomial regression. An advantage of traditional polynomial regression is that the inferential framework of multiple regression can be used.

image

Polynomial Regression: A cubic polynomial regression fit to a simulated data set.

Qualitative Variable Models

Dummy, or qualitative variables, often act as independent variables in regression and affect the results of the dependent variables.

Learning Objectives

Break down the method of inserting a dummy variable into a regression analysis in order to compensate for the effects of a qualitative variable.

Key Takeaways

Key Points

  • In regression analysis, the dependent variables may be influenced not only by quantitative variables (income, output, prices, etc.), but also by qualitative variables (gender, religion, geographic region, etc.).
  • A dummy variable (also known as a categorical variable, or qualitative variable) is one that takes the value 0 or 1 to indicate the absence or presence of some categorical effect that may be expected to shift the outcome.
  • One type of ANOVA model, applicable when dealing with qualitative variables, is a regression model in which the dependent variable is quantitative in nature but all the explanatory variables are dummies (qualitative in nature).
  • Qualitative regressors, or dummies, can have interaction effects between each other, and these interactions can be depicted in the regression model.

Key Terms

  • qualitative variable: Also known as categorical variable; has no natural sense of ordering; takes on names or labels.
  • ANOVA Model: Analysis of variance model; used to analyze the differences between group means and their associated procedures in which the observed variance in a particular variable is partitioned into components attributable to different sources of variation.

In statistics, particularly in regression analysis, a dummy variable (also known as a categorical variable, or qualitative variable) is one that takes the value 0 or 1 to indicate the absence or presence of some categorical effect that may be expected to shift the outcome. Dummy variables are used as devices to sort data into mutually exclusive categories (such smoker/non-smoker, etc.).

Dummy variables are “proxy” variables, or numeric stand-ins for qualitative facts in a regression model. In regression analysis, the dependent variables may be influenced not only by quantitative variables (income, output, prices, etc.), but also by qualitative variables (gender, religion, geographic region, etc.). A dummy independent variable (also called a dummy explanatory variable), which for some observation has a value of 0 will cause that variable’s coefficient to have no role in influencing the dependent variable, while when the dummy takes on a value 1 its coefficient acts to alter the intercept.

For example, if gender is one of the qualitative variables relevant to a regression, then the categories included under the gender variable would be female and male. If female is arbitrarily assigned the value of 1, then male would get the value 0. The intercept (the value of the dependent variable if all other explanatory variables hypothetically took on the value zero) would be the constant term for males but would be the constant term plus the coefficient of the gender dummy in the case of females.

ANOVA Models

Analysis of variance (ANOVA) models are a collection of statistical models used to analyze the differences between group means and their associated procedures (such as “variation” among and between groups). One type of ANOVA model, applicable when dealing with qualitative variables, is a regression model in which the dependent variable is quantitative in nature but all the explanatory variables are dummies (qualitative in nature).

This type of ANOVA modelcan have differing numbers of qualitative variables. An example with one qualitative variable might be if we wanted to run a regression to find out if the average annual salary of public school teachers differs among three geographical regions in a country. An example with two qualitative variables might be if hourly wages were explained in terms of the qualitative variables marital status (married / unmarried) and geographical region (North / non-North).

image

ANOVA Model: Graph showing the regression results of the ANOVA model example: Average annual salaries of public school teachers in 3 regions of a country.

Qualitative regressors, or dummies, can have interaction effects between each other, and these interactions can be depicted in the regression model. For example, in a regression involving determination of wages, if two qualitative variables are considered, namely, gender and marital status, there could be an interaction between marital status and gender.

Models with Both Quantitative and Qualitative Variables

A regression model that contains a mixture of quantitative and qualitative variables is called an Analysis of Covariance (ANCOVA) model.

Learning Objectives

Demonstrate how to conduct an Analysis of Covariance, its assumptions, and its use in regression models containing a mixture of quantitative and qualitative variables.

Key Takeaways

Key Points

  • ANCOVA is a general linear model which blends ANOVA and regression. It evaluates whether population means of a dependent variable (DV) are equal across levels of a categorical independent variable (IV), while statistically controlling for the effects of covariates (CV).
  • ANCOVA can be used to increase statistical power and to adjust for preexisting differences in nonequivalent (intact) groups.
  • There are five assumptions that underlie the use of ANCOVA and affect interpretation of the results: normality of residuals, homogeneity of variances, homogeneity of regression slopes, linearity of regression, and independence of error terms.
  • When conducting ANCOVA, one should: test multicollinearity, test the homogeneity of variance assumption, test the homogeneity of regression slopes assumption, run ANCOVA analysis, and run follow-up analyses.

Key Terms

  • ANOVA Model: Analysis of variance; used to analyze the differences between group means and their associated procedures (such as “variation” among and between groups), in which the observed variance in a particular variable is partitioned into components attributable to different sources of variation.
  • covariance: A measure of how much two random variables change together.
  • concomitant: Happening at the same time as something else, especially because one thing is related to or causes the other (i.e., concurrent).
  • ANCOVA model: Analysis of covariance; a general linear model which blends ANOVA and regression; evaluates whether population means of a dependent variable (DV) are equal across levels of a categorical independent variable (IV), while statistically controlling for the effects of other continuous variables that are not of primary interest, known as covariates.

A regression model that contains a mixture of both quantitative and qualitative variables is called an Analysis of Covariance (ANCOVA) model. ANCOVA models are extensions of ANOVA models. They are the statistic control for the effects of quantitative explanatory variables (also called covariates or control variables).

Covariance is a measure of how much two variables change together and how strong the relationship is between them. Analysis of covariance (ANCOVA) is a general linear model which blends ANOVA and regression. ANCOVA evaluates whether population means of a dependent variable (DV) are equal across levels of a categorical independent variable (IV), while statistically controlling for the effects of other continuous variables that are not of primary interest, known as covariates (CV). Therefore, when performing ANCOVA, we are adjusting the DV means to what they would be if all groups were equal on the CV.

Uses of ANCOVA

ANCOVA can be used to increase statistical power (the ability to find a significant difference between groups when one exists) by reducing the within-group error variance.

ANCOVA can also be used to adjust for preexisting differences in nonequivalent (intact) groups. This controversial application aims at correcting for initial group differences (prior to group assignment) that exists on DV among several intact groups. In this situation, participants cannot be made equal through random assignment, so CVs are used to adjust scores and make participants more similar than without the CV. However, even with the use of covariates, there are no statistical techniques that can equate unequal groups. Furthermore, the CV may be so intimately related to the IV that removing the variance on the DV associated with the CV would remove considerable variance on the DV, rendering the results meaningless.

Assumptions of ANCOVA

There are five assumptions that underlie the use of ANCOVA and affect interpretation of the results:

  1. Normality of Residuals. The residuals (error terms) should be normally distributed.
  2. Homogeneity of Variances. The error variances should be equal for different treatment classes.
  3. Homogeneity of Regression Slopes. The slopes of the different regression lines should be equal.
  4. Linearity of Regression. The regression relationship between the dependent variable and concomitant variables must be linear.
  5. Independence of Error Terms. The error terms should be uncorrelated.

Conducting an ANCOVA

  • Test Multicollinearity. If a CV is highly related to another CV (at a correlation of.5 or more), then it will not adjust the DV over and above the other CV. One or the other should be removed since they are statistically redundant.
  • Test the Homogeneity of Variance Assumption. This is most important after adjustments have been made, but if you have it before adjustment you are likely to have it afterwards.
  • Test the Homogeneity of Regression Slopes Assumption. To see if the CV significantly interacts with the IV, run an ANCOVA model including both the IV and the CVxIV interaction term. If the CVxIV interaction is significant, ANCOVA should not be performed. Instead, consider using a moderated regression analysis, treating the CV and its interaction as another IV. Alternatively, one could use mediation analyses to determine if the CV accounts for the IV’s effect on the DV.
  • Run ANCOVA Analysis. If the CVxIV interaction is not significant, rerun the ANCOVA without the CVxIV interaction term. In this analysis, you need to use the adjusted means and adjusted MSerror. The adjusted means refer to the group means after controlling for the influence of the CV on the DV.
  • Follow-up Analyses. If there was a significant main effect, there is a significant difference between the levels of one IV, ignoring all other factors. To find exactly which levels differ significantly from one another, one can use the same follow-up tests as for the ANOVA. If there are two or more IVs, there may be a significant interaction, so that the effect of one IV on the DV changes depending on the level of another factor. One can investigate the simple main effects using the same methods as in a factorial ANOVA.
image

ANCOVA Model: Graph showing the regression results of an ANCOVA model example: Public school teacher’s salary (Y) in relation to state expenditure per pupil on public schools.

Comparing Nested Models

Multilevel (nested) models are appropriate for research designs where data for participants are organized at more than one level.

Learning Objectives

Outline how nested models allow us to examine multilevel data.

Key Takeaways

Key Points

  • Three types of nested models include the random intercepts model, the random slopes model, and the random intercept and slopes model.
  • Nested models are used under the assumptions of linearity, normality, homoscedasticity, and independence of observations.
  • The units of analysis is a nested model are usually individuals (at a lower level ) who are nested within contextual/aggregate units (at a higher level).

Key Terms

  • nested model: statistical model of parameters that vary at more than one level
  • homoscedasticity: A property of a set of random variables where each variable has the same finite variance.
  • covariance: A measure of how much two random variables change together.

Multilevel models, or nested models, are statistical models of parameters that vary at more than one level. These models can be seen as generalizations of linear models (in particular, linear regression); although, they can also extend to non-linear models. Though not a new idea, they have been much more popular following the growth of computing power and the availability of software.

Multilevel models are particularly appropriate for research designs where data for participants are organized at more than one level (i.e., nested data). The units of analysis are usually individuals (at a lower level) who are nested within contextual/aggregate units (at a higher level). While the lowest level of data in multilevel models is usually an individual, repeated measurements of individuals may also be examined. As such, multilevel models provide an alternative type of analysis for univariate or multivariate analysis of repeated measures. Individual differences in growth curves may be examined. Furthermore, multilevel models can be used as an alternative to analysis of covariance (ANCOVA), where scores on the dependent variable are adjusted for covariates (i.e., individual differences) before testing treatment differences. Multilevel models are able to analyze these experiments without the assumptions of homogeneity-of-regression slopes that is required by ANCOVA.

Types of Models

Before conducting a multilevel model analysis, a researcher must decide on several aspects, including which predictors are to be included in the analysis, if any. Second, the researcher must decide whether parameter values (i.e., the elements that will be estimated) will be fixed or random. Fixed parameters are composed of a constant over all the groups, whereas a random parameter has a different value for each of the groups. Additionally, the researcher must decide whether to employ a maximum likelihood estimation or a restricted maximum likelihood estimation type.

  • Random intercepts model. A random intercepts model is a model in which intercepts are allowed to vary; therefore, the scores on the dependent variable for each individual observation are predicted by the intercept that varies across groups. This model assumes that slopes are fixed (the same across different contexts). In addition, this model provides information about intraclass correlations, which are helpful in determining whether multilevel models are required in the first place.
  • Random slopes model. A random slopes model is a model in which slopes are allowed to vary; therefore, the slopes are different across groups. This model assumes that intercepts are fixed (the same across different contexts).
  • Random intercepts and slopes model. A model that includes both random intercepts and random slopes is likely the most realistic type of model; although, it is also the most complex. In this model, both intercepts and slopes are allowed to vary across groups, meaning that they are different in different contexts.

Assumptions

Multilevel models have the same assumptions as other major general linear models, but some of the assumptions are modified for the hierarchical nature of the design (i.e., nested data).

  • Linearity. The assumption of linearity states that there is a rectilinear (straight-line, as opposed to non-linear or U-shaped) relationship between variables.
  • Normality. The assumption of normality states that the error terms at every level of the model are normally distributed.
  • Homoscedasticity. The assumption of homoscedasticity, also known as homogeneity of variance, assumes equality of population variances.
  • Independence of observations. Independence is an assumption of general linear models, which states that cases are random samples from the population and that scores on the dependent variable are independent of each other.

Uses of Multilevel Models

Multilevel models have been used in education research or geographical research to estimate separately the variance between pupils within the same school and the variance between schools. In psychological applications, the multiple levels are items in an instrument, individuals, and families. In sociological applications, multilevel models are used to examine individuals embedded within regions or countries. In organizational psychology research, data from individuals must often be nested within teams or other functional units.

image

Nested Model: An example of a simple nested set.

Stepwise Regression

Stepwise regression is a method of regression modeling in which the choice of predictive variables is carried out by an automatic procedure.

Learning Objectives

Evaluate and criticize stepwise regression approaches that automatically choose predictive variables.

Key Takeaways

Key Points

  • Forward selection involves starting with no variables in the model, testing the addition of each variable using a chosen model comparison criterion, adding the variable (if any) that improves the model the most, and repeating this process until none improves the model.
  • Backward elimination involves starting with all candidate variables, testing the deletion of each variable using a chosen model comparison criterion, deleting the variable that improves the model the most by being deleted, and repeating this process until no further improvement is possible.
  • Bidirectional elimination is a combination of forward selection and backward elimination, testing at each step for variables to be included or excluded.
  • One of the main issues with stepwise regression is that it searches a large space of possible models. Hence it is prone to overfitting the data.

Key Terms

  • Akaike information criterion: a measure of the relative quality of a statistical model, for a given set of data, that deals with the trade-off between the complexity of the model and the goodness of fit of the model
  • Bayesian information criterion: a criterion for model selection among a finite set of models that is based, in part, on the likelihood function
  • Bonferroni point: how significant the best spurious variable should be based on chance alone

Stepwise regression is a method of regression modeling in which the choice of predictive variables is carried out by an automatic procedure. Usually, this takes the form of a sequence of [latex]\text{F}[/latex]-tests; however, other techniques are possible, such as [latex]\text{t}[/latex]-tests, adjusted [latex]\text{R}[/latex]-square, Akaike information criterion, Bayesian information criterion, Mallows’s [latex]\text{C}_\text{p}[/latex], or false discovery rate. The frequent practice of fitting the final selected model, followed by reporting estimates and confidence intervals without adjusting them to take the model building process into account, has led to calls to stop using stepwise model building altogether — or to at least make sure model uncertainty is correctly reflected.

image

Stepwise Regression: This is an example of stepwise regression from engineering, where necessity and sufficiency are usually determined by [latex]\text{F}[/latex]-tests.

Main Approaches

  • Forward selection involves starting with no variables in the model, testing the addition of each variable using a chosen model comparison criterion, adding the variable (if any) that improves the model the most, and repeating this process until none improves the model.
  • Backward elimination involves starting with all candidate variables, testing the deletion of each variable using a chosen model comparison criterion, deleting the variable (if any) that improves the model the most by being deleted, and repeating this process until no further improvement is possible.
  • Bidirectional elimination, a combination of the above, tests at each step for variables to be included or excluded.

Another approach is to use an algorithm that provides an automatic procedure for statistical model selection in cases where there is a large number of potential explanatory variables and no underlying theory on which to base the model selection. This is a variation on forward selection, in which a new variable is added at each stage in the process, and a test is made to check if some variables can be deleted without appreciably increasing the residual sum of squares (RSS).

Selection Criterion

One of the main issues with stepwise regression is that it searches a large space of possible models. Hence it is prone to overfitting the data. In other words, stepwise regression will often fit much better in- sample than it does on new out-of-sample data. This problem can be mitigated if the criterion for adding (or deleting) a variable is stiff enough. The key line in the sand is at what can be thought of as the Bonferroni point: namely how significant the best spurious variable should be based on chance alone. Unfortunately, this means that many variables which actually carry signal will not be included.

Model Accuracy

A way to test for errors in models created by stepwise regression is to not rely on the model’s [latex]\text{F}[/latex]-statistic, significance, or multiple-r, but instead assess the model against a set of data that was not used to create the model. This is often done by building a model based on a sample of the dataset available (e.g., 70%) and use the remaining 30% of the dataset to assess the accuracy of the model. Accuracy is often measured as the standard error between the predicted value and the actual value in the hold-out sample. This method is particularly valuable when data is collected in different settings.

Criticism

Stepwise regression procedures are used in data mining, but are controversial. Several points of criticism have been made:

  • The tests themselves are biased, since they are based on the same data.
  • When estimating the degrees of freedom, the number of the candidate independent variables from the best fit selected is smaller than the total number of final model variables, causing the fit to appear better than it is when adjusting the [latex]\text{r}^2[/latex] value for the number of degrees of freedom. It is important to consider how many degrees of freedom have been used in the entire model, not just count the number of independent variables in the resulting fit.
  • Models that are created may be too-small than the real models in the data.

Checking the Model and Assumptions

There are a number of assumptions that must be made when using multiple regression models.

Learning Objectives

Paraphrase the assumptions made by multiple regression models of linearity, homoscedasticity, normality, multicollinearity and sample size.

Key Takeaways

Key Points

  • The assumptions made during multiple regression are similar to the assumptions that must be made during standard linear regression models.
  • The data in a multiple regression scatterplot should be fairly linear.
  • The different response variables should have the same variance in their errors, regardless of the values of the predictor variables ( homoscedasticity ).
  • The residuals (predicted value minus the actual value) should follow a normal curve.
  • Independent variables should not be overly correlated with one another (they should have a regression coefficient less than 0.7).
  • There should be at least 10 to 20 times as many observations (cases, respondents) as there are independent variables.

Key Terms

  • Multicollinearity: Statistical phenomenon in which two or more predictor variables in a multiple regression model are highly correlated, meaning that one can be linearly predicted from the others with a non-trivial degree of accuracy.
  • homoscedasticity: A property of a set of random variables where each variable has the same finite variance.

When working with multiple regression models, a number of assumptions must be made. These assumptions are similar to those of standard linear regression models. The following are the major assumptions with regard to multiple regression models:

  • Linearity. When looking at a scatterplot of data, it is important to check for linearity between the dependent and independent variables. If the data does not appear as linear, but rather in a curve, it may be necessary to transform the data or use a different method of analysis. Fortunately, slight deviations from linearity will not greatly affect a multiple regression model.
  • Constant variance (aka homoscedasticity). Different response variables have the same variance in their errors, regardless of the values of the predictor variables. In practice, this assumption is invalid (i.e., the errors are heteroscedastic) if the response variables can vary over a wide scale. In order to determine for heterogeneous error variance, or when a pattern of residuals violates model assumptions of homoscedasticity (error is equally variable around the ‘best-fitting line ‘ for all points of x), it is prudent to look for a “fanning effect” between residual error and predicted values. That is, there will be a systematic change in the absolute or squared residuals when plotted against the predicting outcome. Error will not be evenly distributed across the regression line. Heteroscedasticity will result in the averaging over of distinguishable variances around the points to yield a single variance (inaccurately representing all the variances of the line). In effect, residuals appear clustered and spread apart on their predicted plots for larger and smaller values for points along the linear regression line; the mean squared error for the model will be incorrect.
  • Normality. The residuals (predicted value minus the actual value) should follow a normal curve. Once again, this need not be exact, but it is a good idea to check for this using either a histogram or a normal probability plot.
  • Multicollinearity. Independent variables should not be overly correlated with one another (they should have a regression coefficient less than 0.7).
  • Sample size. Most experts recommend that there should be at least 10 to 20 times as many observations (cases, respondents) as there are independent variables, otherwise the estimates of the regression line are probably unstable and unlikely to replicate if the study is repeated.
image

Linear Regression: Random data points and their linear regression.

Some Pitfalls: Estimability, Multicollinearity, and Extrapolation

Some problems with multiple regression include multicollinearity, variable selection, and improper extrapolation assumptions.

Learning Objectives

Examine how the improper choice of explanatory variables, the presence of multicollinearity between variables, and extrapolation of poor quality can negatively effect the results of a multiple linear regression.

Key Takeaways

Key Points

  • Multicollinearity between explanatory variables should always be checked using variance inflation factors and/or matrix correlation plots.
  • Despite the fact that automated stepwise procedures for fitting multiple regression were discredited years ago, they are still widely used and continue to produce overfitted models containing various spurious variables.
  • A key issue seldom considered in depth is that of choice of explanatory variables (i.e., if the data does not exist, it might be better to actually gather some).
  • Typically, the quality of a particular method of extrapolation is limited by the assumptions about the regression function made by the method.

Key Terms

  • collinearity: the condition of lying in the same straight line
  • spurious variable: a mathematical relationship in which two events or variables have no direct causal connection, yet it may be wrongly inferred that they do, due to either coincidence or the presence of a certain third, unseen factor (referred to as a “confounding factor” or “lurking variable”)
  • Multicollinearity: a phenomenon in which two or more predictor variables in a multiple regression model are highly correlated, so that the coefficient estimates may change erratically in response to small changes in the model or data

Until recently, any review of literature on multiple linear regression would tend to focus on inadequate checking of diagnostics because, for years, linear regression was used inappropriately for data that were really not suitable for it. The advent of generalized linear modelling has reduced such inappropriate use.

A key issue seldom considered in depth is that of choice of explanatory variables. There are several examples of fairly silly proxy variables in research – for example, using habitat variables to “describe” badger densities. Sometimes, if the data does not exist, it might be better to actually gather some – in the badger case, number of road kills would have been a much better measure. In a study on factors affecting unfriendliness/aggression in pet dogs, the fact that their chosen explanatory variables explained a mere 7% of the variability should have prompted the authors to consider other variables, such as the behavioral characteristics of the owners.

In addition, multicollinearity between explanatory variables should always be checked using variance inflation factors and/or matrix correlation plots. Although it may not be a problem if one is (genuinely) only interested in a predictive equation, it is crucial if one is trying to understand mechanisms. Independence of observations is another very important assumption. While it is true that non-independence can now be modeled using a random factor in a mixed effects model, it still cannot be ignored.

image

Matrix Correlation Plot: This figure shows a very nice scatterplot matrix, with histograms, kernel density overlays, absolute correlations, and significance asterisks (0.05, 0.01, 0.001).

Perhaps the most important issue to consider is that of variable selection and model simplification. Despite the fact that automated stepwise procedures for fitting multiple regression were discredited years ago, they are still widely used and continue to produce overfitted models containing various spurious variables. As with collinearity, this is less important if one is only interested in a predictive model – but even when researchers say they are only interested in prediction, we find they are usually just as interested in the relative importance of the different explanatory variables.

Quality of Extrapolation

Typically, the quality of a particular method of extrapolation is limited by the assumptions about the regression function made by the method. If the method assumes the data are smooth, then a non-smooth regression function will be poorly extrapolated.

Even for proper assumptions about the function, the extrapolation can diverge strongly from the regression function. This divergence is a specific property of extrapolation methods and is only circumvented when the functional forms assumed by the extrapolation method (inadvertently or intentionally due to additional information) accurately represent the nature of the function being extrapolated.