Solutions to Selected Exercises

Introduction to Multiple Regression

Exercise 1: Baby Weights, Part I

  1. [latex]\widehat{\text{baby_weight}}=123.05-8.94\times\text{smoke}[/latex]
  2. The estimated body weight of babies born to smoking mothers is 8.94 ounces lower than babies born to non-smoking mothers. Smoker: 123.05 − 8.94 × 1 = 114.11 ounces. Non-smoker: 123.05 − 8.94 × 0 = 123.05 ounces.
  3. H0: β1 = 0.
    HA: β1 6= 0.
    T = −8.65, and the p-value is approximately 0.
    Since the p-value is very small, we reject H0. The data provide strong evidence that the true slope parameter is different than 0 and that there is an association between birth weight and smoking. Furthermore, having rejected H0, we can conclude that smoking is associated with lower birth weights.

Exercise 3: Baby weights, Part III

  1. [latex]\widehat{\text{baby_weight}}=-80.41+0.44\times\text{gestation}-3.33\times\text{parity}-0.01\times\text{age}+1.15\times\text{height}+0.05\times\text{weight}-8.40\times\text{smoke}[/latex].
  2. βgestation: The model predicts a 0.44 ounce increase in the birth weight of the baby for each additional day of pregnancy, all else held constant.
    βage: The model predicts a 0.01 ounce decrease in the birth weight of the baby for each additional year in mother’s age, all else held constant.
  3. Parity might be correlated with one of the other variables in the model, which complicates model estimation
  4. [latex]\begin{array}\widehat{\text{baby_weight}}=120.58.\hfill&{e}=120-120.58=-0.58\end{array}[/latex]. The model over-predicts this baby’s birth weight.
  5. R2adj = 0.2468.

Exercise 5: GPA

(a) (-0.32, 0.16). We are 95% confident that male students on average have GPAs 0.32 points lower to 0.16 points higher than females when controlling for the other variables in the model.

(b) Yes, since the p-value is larger than 0.05 in all cases (not including the intercept).

Model Selection

Exercise 7: Baby weights, Part IV

Remove age.

Exercise 9: Baby weights, Part V

Based on the p-value alone, either gestation or smoke should be added to the model first. However, since the adjusted R2 with gestation is higher, it would be preferable to add gestation in the first step of the forward-selection algorithm. (Other explanations are possible. For instance, it would be reasonable to only use the adjusted R2.)

Exercise 11: Movie lovers, Part I

She should use p-value selection since she is interested in finding out about significant predictors, not just optimizing predictions.

Checking Model Assumptions Using Graphs

Exercise 13: Baby weights, Part V

Nearly normal residuals: The normal probability plot shows a nearly normal distribution of the residuals, however, there are some minor irregularities at the tails. With a data set so large, these would not be a concern.

Constant variability of residuals: The scatter-plot of the residuals versus the fitted values does not show any overall structure. However, values that have very low or very high fitted values appear to also have somewhat larger outliers. In addition, the residuals do appear to have constant variability between the two parity and smoking status groups, though these items are relatively minor.

Independent residuals: The scatterplot of residuals versus the order of data collection shows a random scatter, suggesting that there is no apparent structures related to the order the data were collected.

Linear relationships between the response variable and numerical explanatory variables: The residuals vs. height and weight of mother are randomly distributed around 0. The residuals vs. length of gestation plot also does not show any clear or strong remaining structures, with the possible exception of very short or long gestations. The rest of the residuals do appear to be randomly distributed around 0.

All concerns raised here are relatively mild. There are some outliers, but there is so much data that the influence of such observations will be minor.

Introduction to Logistic Regression

Exercise 15: Possum classification, Part I

(a) There are a few potential outliers, e.g. on the left in the total length variable, but nothing that will be of serious concern in a data set this large.

(b) When coefficient estimates are sensitive to which variables are included in the model, this typically indicates that some variables are collinear. For example, a possum’s gender may be related to its head length, which would explain why the coefficient (and p-value) for sex male changed when we removed the head length variable. Likewise, a possum’s skull width is likely to be related to its head length, probably even much more closely related than the head length was to gender.

Exercise 17: Possum classification, Part II

(a) The logistic model relating ˆpi to the predictors may be written as log 33.5095 − 1.4207 × sex malei − 0.2787 × skull widthi + 0.5687×total lengthi −1.8057× tail lengthi. Only total length has a positive association with a possum being from Victoria.

(b) ˆp = 0.0062. While the probability is very near zero, we have not run diagnostics on the model. We might also be a little skeptical that the model will remain accurate for a possum found in a US zoo. For example, perhaps the zoo selected a possum with specific characteristics but only looked in one region. On the other hand, it is encouraging that the possum was caught in the wild. (Answers regarding the reliability of the model probability will vary.)