Forming Connections in 6.A: Exploring Lines of Best Fit

Objectives for this activity

During this activity you will:

  • Identify the explanatory and response variables given the context of a study.
  • Decide when linear regression is appropriate and when it is not appropriate.
  • Use data analysis tools to generate appropriate scatterplots and the line of best fit.
  • Use data analysis tools to identify the equation of the line of best fit and the correlation coefficient r.

Straight Talk About Lines

Two people smiling and looking at a whiteboard

In the What to Know page for this section, you learned definitions of the explanatory and response variables in a bivariate dataset and developed an understanding of when a linear relationship may exist between them. You should have enough background now to identify scenarios of bivariate data for which a linear regression analysis might be appropriate and to calculate and write the equation of a line of best fit. We’ll continue to extend that knowledge in this activity as we practice these skills and learn about a measure of strength in a linear relationship: the correlation coefficient, [latex]r[/latex]. Along the way, you’ll deepen your understanding of the concept of a line of best fit and the method of linear regression analysis on a given dataset.

Explanatory and Response Variables

During the previous What to Know, you were asked to look over your notes from [WTK 5A] and write down three different examples where you had listed an explanatory variable that could be used to predict a response variable. Both of those variables should have been quantitative for the purposes of this activity. You were asked to identify one set with a positive association, one with no (or almost no) association, and one with a negative association.

Please retrieve those examples from your notes now and discuss them with a partner to answer Question 1. For added interest, “test” your partner to see if they can identify the explanatory and response variables in your scenario.

Question 1

In the preview assignment, you were asked to think about a few scenarios in which an explanatory variable could be used to predict a response variable.

 

Part A: Share your favorite scenario with a partner and take turns identifying the explanatory and response variables.

 

Part B: Draw scatterplots describing the two scenarios, and then sketch the line of best fit for each scatterplot.

When analyzing bivariate data, it is important to first clearly identify the explanatory and response variables and plot the data to identify any visually obvious trends.

Let’s look at an example.

Guidance

[Intro: Form into groups of four to continue this activity. As you answer Questions 2 through 5, try to begin to establish for your group a list of steps involved in analyzing bivariate data. We know that the first step is to clearly identify the explanatory and response variables, ensuring they are both quantitative for a Least Squares Regression analysis, and responsibly obtain data. Recall that you should ensure your sampling methods are random and bias is minimized as much as possible. Then, in the second step, you would plot the data to visually assess any present trends. What steps follow after these when performing the analysis? Are there any concerns that arise for you as you follow the example below involving student test scores? Regroup after Question 5 to compare your group’s list of steps and concerns with others before moving further in the activity. ]

Linear Regression Analysis

George, a current student, got a 36 out of 50 on the first midterm (C-). He asked his instructor, “If I don’t change my study approach, how do you predict I will do on the final exam?”

One way to answer this question is to look at the bivariate data of student scores from a previous class. In this case, we choose a random sample of past students who did not seek out additional tutoring and/or support between the midterm and the final.

The following is a dataset from a random sample of past students who did not seek out advice on study skills or additional tutoring between the midterm and the final exam. To protect their anonymity, only first names are shown.

Student First Name Midterm Score

(out of 50 points)

Final Exam Score

(out of 100 points)

Joe 42 64
Barak 52 94
Hillary 44 87
Donald 25 46
Cher 41 73
Katy 39 73
Taylor 33 53
Miley 40 77
Justin 35 60
Snoop 31 62
Bruno 37 71
Kanye 49 95
Leonardo 38 70
Rosie 45 80
Maya 49 80
Tyra 48 82
Selena 50 81

Using Technology in Analysis

question 2

Identify the explanatory and response variables.

Go to the Linear Regression tool at https://dcmathpathways.shinyapps.io/LinearRegression/ and plot the data using the following inputs:

  • Under “Enter Data,” select “Enter Own.”
  • Name the X (explanatory) and Y (response) variables appropriately.
  • Copy and paste the data from DCMP_STAT_6A_Student_Scores [link this spreadsheet here] or enter the data in the table by hand. Make sure the explanatory variable is in the first column and the response variable is in the second column.
  • Under “Plot Options,” select “Regression Line.”
  • Click “Submit Data” button.

 

question 3

Do you think the line of best fit is a good model of the relationship between midterm and final exam score? Explain.

question 4

Write the equation of the least squares regression line using appropriate notation.

 

Part A: Is the relationship positive or negative?

 

Part B: What is the value of r? Does this value indicate that the linear relationship between the two variables will be strong, moderate, or weak?

question 5

Do you think George should be nervous about the final exam?

Guidance

[Summary: How did you do with a list of steps for performing a LSR analysis? Generally, the steps can be listed as follows:
Step 1) Identify the explanatory and response variables, then gather data as needed.
Step 2) Plot the data on a scatterplot, placing the explanatory variable along the horizontal (x) axis and the response variable along the vertical (y) axis.

Step 3) Visually confirm that the data seem to follow a linear paggern.

Step 4) Calculate and sketch the line of best fit in the plot and obtain the correlation coefficient [latex]r[/latex]. Visually confirm that the line appears to pass through the data as closely as possible, minimizing how much the data points deviate from the line.

You may not have listed a Step 5, which we will cover in [6E] later, during which you’ll interpret the coefficients (correlation and determination), assess the model accuracy and fit, and make appropriate predictions.

Did you note any concerns about the data collection? Hopefully you discussed the need to make the data anonymous or de-identified since it involved student grades.

Continue to work in groups for the remainder of the activity. As you consider the opening question in Question 6, try not to spend too much time debating the issue. You might even take a moment to discuss the implications in Part A of answering both “yes” and “no.”

Line of Best Fit and The Correlation Coefficient

question 6

Now, consider the following question: “Can steady driving speed be used to predict fuel efficiency?”

 

Part A: If you answered “yes,” do you think the relationship between driving speed and fuel efficiency would be positive or negative? If you answered “no,” explain.

 

Part B: Identify the explanatory and response variables.

question 7

Go to the Linear Regression tool at https://dcmathpathways.shinyapps.io/LinearRegression/ and plot the data using the following inputs:

  • Under “Enter Data,” select “From Textbook.”
  • Under “Choose Dataset,” select “Fuel Efficiency and Speed.”

 

Part A: Is the relationship positive or negative?

 

Part B: Find the correlation coefficient. Does this value indicate that the linear relationship between the two variables will be strong, moderate, or weak?

question 8

Is a least squares regression line a reasonable model for the relationship between driving speed and fuel efficiency?

Guidance

[Wrap-up: Did you clearly state and justify your conclusions to answer Question 8? If you found that the LSR line is not a reasonable model for the relationship, clearly state why you believe this using both a visual analysis and the value of [latex]r[/latex]. In this case, all analysis indicates that a linear model would fail to make reasonable predictions for this dataset. There is no linear relationship. As you end the activity, take a look back at the objectives and point out the places where they appeared in the questions. ]