Checking Model Assumptions Using Graphs

Multiple regression methods using the model

[latex]\displaystyle\hat{y}=\beta_0+\beta_1x_1+\beta_2x_2+\dots+\beta_kx_k[/latex]

generally depend on the following four assumptions:

the residuals of the model are nearly normal,
the variability of the residuals is nearly constant,
the residuals are independent, and
each variable is linearly related to the outcome.

Diagnostic plots can be used to check each of these assumptions. We will consider the model from the Mario Kart auction data, and check whether there are any notable concerns:

[latex]\displaystyle\widehat{\text{price}}=36.05+5.18\times\text{cond_new}+1.12\times\text{stock_photo}+7.30\times\text{wheels}[/latex]

Normal probability plot. A normal probability plot of the residuals is shown in Figure 1. While the plot exhibits some minor irregularities, there are no outliers that might be cause for concern. In a normal probability plot for residuals, we tend to be most worried about residuals that appear to be outliers, since these indicate long tails in the distribution of residuals.

A scatter plot of data points. The data has a tight line of best fit, with a few outliers in the top right corner of the plot.

Figure 1. A normal probability plot of the residuals is helpful in identifying observations that might be outliers.

Absolute values of residuals against fitted values. A plot of the absolute value of the residuals against their corresponding fitted values [latex]\left(\displaystyle\hat{y}_i\right)[/latex] is shown in Figure 2.

This plot is helpful to check the condition that the variance of the residuals is approximately constant. We don’t see any obvious deviations from constant variance in this example.

The y axis of this plot is the absolute value of residuals, and the x axis is the fitted values.

Figure 2. Comparing the absolute value of the residuals against the fitted values [latex]\left(\displaystyle\hat{y}_i\right)[/latex] is helpful in identifying deviations from the constant variance assumption.

Residuals in order of their data collection. A plot of the residuals in the order their corresponding auctions were observed is shown in Figure 3. Such a plot is helpful in identifying any connection between cases that are close to one another, e.g. we could look for declining prices over time or if there was a time of the day when auctions tended to fetch a higher price. Here we see no structure that indicates a problem.^[1]

A scatter plot. The y axis is the residuals, and the x axis is the order of collection.

Figure 3. Plotting residuals in the order that their corresponding observations were collected helps identify connections between successive observations. If it seems that consecutive observations tend to be close to each other, this indicates the independence assumption of the observations would fail.

Residuals against each predictor variable. We consider a plot of the residuals against the cond_new variable, the residuals against the stock photo variable, and the residuals against the wheels variable. These plots are shown in Figure 4. For the two-level condition variable, we are guaranteed not to see any remaining trend, and instead we are checking that the variability doesn’t ﬂuctuate across groups, which it does not. However, looking at the stock photo variable, we find that there is some difference in the variability of the residuals in the two groups. Additionally, when we consider the residuals against the wheels variable, we see some possible structure. There appears to be curvature in the residuals, indicating the relationship is probably not linear.

Plots comparing the residuals to the condition, photo type, and number of wheels.

Figure 4. For the condition and stock photo variables, we check for differences in the distribution shape or variability of the residuals. In the case of the stock photos variable, we see a little less variability in the unique photo group than the stock photo group. For numerical predictors, we also check for trends or other structure. We see some slight bowing in the residuals against the wheels variable in the bottom plot.

It is necessary to summarize diagnostics for any model fit. If the diagnostics support the model assumptions, this would improve credibility in the findings. If the diagnostic assessment shows remaining underlying structure in the residuals, we should try to adjust the model to account for that structure. If we are unable to do so, we may still report the model but must also note its shortcomings. In the case of the auction data, we report that there appears to be non-constant variance in the stock photo variable and that there may be a nonlinear relationship between the total price and the number of wheels included for an auction. This information would be important to buyers and sellers who may review the analysis, and omitting this information could be a setback to the very people who the model might assist.

Be Wary

“All models are wrong, but some are useful”

—George E.P. Box

The truth is that no model is perfect. However, even imperfect models can be useful. Reporting a ﬂawed model can be reasonable so long as we are clear and report the model’s shortcomings.

Caution: Don’t report results when assumptions are grossly violated

While there is a little leeway in model assumptions, don’t go too far. If model assumptions are very clearly violated, consider a new model, even if it means learning more statistical methods or hiring someone who can help.

TIP: Confidence intervals in multiple regression

Confidence intervals for coefficients in multiple regression can be computed using the same formula as in the single predictor model:

[latex]\displaystyle{b}_i\pm{t}^*_{df}SE_{b_i}[/latex]

where t*_df is the appropriate t-value corresponding to the confidence level and model degrees of freedom, df = n − k − 1.

An especially rigorous check would use time series methods. For instance, we could check whether consecutive residuals are correlated. Doing so with these residuals yields no statistically significant correlations. ↵

Module 14: Multiple and Logistic Regression

Be Wary

Caution: Don’t report results when assumptions are grossly violated

TIP: Confidence intervals in multiple regression

Candela Citations