17A InClass

Question 1

1) What factors do you think determine high school students’ science test scores?

The dataset that we will be using in this in-class activity is called “High School and Beyond” and contains information about high school student achievement scores on math, science, reading, writing, and social studies tests. The dataset contains information about 200 high school students and 10 variables for each student. The data collected about each student includes the following: identification number, whether the student is male or female, race, socio-economic status, school type, program type, and scores from tests of reading, writing, math, science, and social studies. Descriptions of the variables are as follows:

A person in a wheelchair and wearing a disposable mask sitting at an outdoor table working on a laptop.

Credit: iStock/Courtney Hale

Variable name	Definition
id	Identification number of the student
female	Gender of the student (0 = male, 1 = female)
race	Ethnic background of the student (1 = Hispanic, 2 = Asian, 3 = Black, 4 = White)
ses	Socio-economic status of the student (1 = low, 2 = medium, 3 = high)
schtyp	School type (1 = public, 2 = private)
prog	Program type (1 = general, 2 = academic preparatory, 3 = vocational/technical)
read	Score from test of reading
write	Score from test of writing
math	Score from test of math
science	Score from test of science
socst	Score from test of social studies

Questions 2–4: We are interested in answering the question,“Is there a relationship between science scores for high school students and math and reading scores?”

Question 2

2) Based on the question, what is the response variable? Identify the variable name from the dataset.

Question 3

3) What are the explanatory variables? Identify the variable names from the dataset.

Question 4

4) In simple linear regression, you have one response variable and one explanatory variable. Explain what the purpose of the simple linear regression model is.

Question 5

5) Using the following scatterplot of math and science scores, what do you notice about the relationship?

A scatterplot titled “Scatterplot of Math and Science Scores for High School Students.” The x-axis is labeled “math test score” and the y-axis is labeled “science test score.” Points with higher x-values also tend to have higher y-values, with moderate consistency.

Question 6

6) Using the scatterplot of reading and science scores, what do you notice about the relationship?

A scatterplot titled “Scatterplot of Reading and Science Scores for High School Students.” The x-axis is labeled “reading test score” and the y-axis is labeled “science test score.” Points with higher x-values also tend to have higher y-values, with moderate consistency.

A linear regression model with two or more explanatory variables is called a multiple linear regression model. Since there is more than one explanatory variable, the model is no longer a line. In fact, we can include 𝑝 explanatory variables in our model. The equation for the estimated model that uses 𝑝 variables is

𝑦̂=𝑎+𝑏₁∙𝑥₁+𝑏₂∙𝑥₂+⋯+𝑏_𝑝∙𝑥_𝑝

where 𝑏₁, 𝑏₂, …, 𝑏_𝑝 are the regression coefficients for explanatory variables 𝑥₁, 𝑥₂,…, 𝑥_𝑝, respectively. In multiple linear regression, 𝑏₁, 𝑏₂, …, 𝑏_𝑝 are called partial slopes.

Question 7

7) Using the following results, write the multiple linear regression equation for predicting science test scores using the explanatory variables of math and reading scores. Round the estimates to two decimal places.

	Estimate
Intercept	11.61550
*math*	0.4172
*read*	0.36542

We can interpret the regression coefficients for each explanatory variable in the model in terms of the relationship with the response variable. The explanation is very similar to what we have seen in simple linear regression models. However, since it is a partial slope, we have to make sure that we hold any other explanatory variables constant in our interpretation. For example, for the following regression equation,

𝑦̂=𝑎+𝑏₁∙𝑥₁+𝑏₂∙𝑥₂+⋯+𝑏_𝑝∙𝑥_𝑝

the partial slope, 𝑏₁, represents the expected change in the response variable, 𝑦, for every one unit increase in 𝑥₁, holding explanatory variables 𝑥₁, 𝑥₂, …, 𝑥_𝑝 constant.

Question 8

8) What is the interpretation of the coefficient for the explanatory variable of mathscoresin the context of the dataset?

The coefficient of determination, 𝑅2, is used to determine the percentage of variability in the response variable that is accounted for by the explanatory variables. In this activity, we will call the value the unadjusted 𝑅2. In simple linear regression, we would interpret the 𝑅2 value as the percentage of the variation in the response variable that can be explained by the linear relationship with the explanatory variable. For multiple linear regression, the interpretation is similar, but now the variation in the response variable is explained by the linear relationship with multiple explanatory variables.

Question 9

9) The unadjusted 𝑅2 value for this model is 0.4782. Interpret the unadjusted value of𝑅2for this model.

Question 10

10) The simple linear regression model with math alone has an 𝑅2value of 39.8%.The simple linear regression model with reading alone has an 𝑅2 value of 39.7%. Explain why the total amount of variability explained by the model is not: 39.8% +39.7% =79.5%

We can assess whether or not it is reasonable to fit a linear regression model using residual plots, similar to simple linear regression. In multiple linear regression, the y-axis has the residual values and the x-axis has the explanatory variables and/or the fitted values. For a multiple linear regression model, you create a residual plot for each continuous explanatory variable, as well as the fitted value.We would expect to see the residual values appear randomly scattered across the x-values with no clear patterns(e.g., residual plots that display a curvature violate the linearity condition). Residual plots that increase or decrease in magnitude (distance from zero) violate the constant variance condition. The residual plot of the residuals vs. predicted values account for all the variables in the model. Residual plots of the residuals vs. individual exploratory variables allow us to identify a potential source of a violation. The normality condition is beyond the scope of this course.

Question 11

11) Looking atthethreeresidual plotsthat follow,is it reasonable to fit a linear regression model to thesedata? Explain.

A residual plot titled “Residuals vs. Fitted,” with “Fitted Value” on the x-axis and “Residual” on the y-axis. The points appear to have no pattern.

A residual plot titled “Residuals vs. Math Test Scores,” with “Math Test Scores” on the x-axis and “Residuals” on the y-axis. The points appear to have no pattern.

A residual plot titled “Residuals vs. Reading Test Scores,” with “Reading Test Scores” on the x-axis and “Residual” on the y-axis. The points appear to have no pattern.

Module 17