{"id":3870,"date":"2022-03-15T23:22:51","date_gmt":"2022-03-15T23:22:51","guid":{"rendered":"https:\/\/courses.lumenlearning.com\/lumen-danacenter-statsmockup\/?post_type=chapter&#038;p=3870"},"modified":"2022-06-14T19:36:11","modified_gmt":"2022-06-14T19:36:11","slug":"forming-connections-in-6-d","status":"publish","type":"chapter","link":"https:\/\/courses.lumenlearning.com\/lumen-danacenter-statsmockup\/chapter\/forming-connections-in-6-d\/","title":{"raw":"Forming Connections in 6.D: Using Residual Plots with a Linear Regression Model","rendered":"Forming Connections in 6.D: Using Residual Plots with a Linear Regression Model"},"content":{"raw":"<div class=\"textbox learning-objectives\">\r\n<h3>Objectives for this activity<\/h3>\r\nDuring this activity you will:\r\n<ul>\r\n \t<li style=\"font-weight: 400;\" aria-level=\"1\">Construct and interpret a residual plot.<\/li>\r\n \t<li style=\"font-weight: 400;\" aria-level=\"1\">Informally assess the appropriateness of a linear regression model.<\/li>\r\n<\/ul>\r\n<\/div>\r\nIn the previous\u00a0<em>What to Know\u00a0<\/em>assignment for this activity, you spent quite a lot of time calculating and interpreting residuals, identifying common scenarios of datasets for which linear analysis would not be appropriate, and exploring the effect of outliers on\u00a0[latex]R^2[\/latex]. You did this work to prepare for this activity in which you'll construct a plot of the residuals for a dataset and use it to assess the appropriateness of a linear regression model. As you do so, you'll see that residual\u00a0plots can magnify potential issues with a linear model and that linear regression models may not be appropriate when we observe non-linear data trends or non-constant variance of residuals. You'll also see that outliers\u00a0\u00a0should be investigated, as they affect\u00a0the strength of a model.\r\n\r\nThis activity will investigate the whether the average income in an area is correlated with access to high quality, nutritious foods. Let's begin.\r\n<h2>Model Adequacy and Residuals<\/h2>\r\n<img class=\"alignnone wp-image-1286\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/5738\/2022\/01\/12202555\/Picture170-300x200.jpg\" alt=\"A large array of fruits, vegetables, and other fresh foods\" width=\"1318\" height=\"877\" \/>\r\n\r\nAs an introduction to this activity, read and answer Question 1 independently before sharing and discussing your answer with a partner.\r\n<div class=\"textbox key-takeaways\">\r\n<h3>question 1<\/h3>\r\nWhat factors might make it more difficult for people with modest incomes to access healthy foods (relative to individuals with higher incomes)?\r\n\r\n[reveal-answer q=\"924526\"]Hint[\/reveal-answer]\r\n[hidden-answer a=\"924526\"]What do <em>you\u00a0<\/em>think?[\/hidden-answer]\r\n\r\n<\/div>\r\n<div class=\"textbox tryit\">\r\n<h3>Guidance<\/h3>\r\n<span style=\"background-color: #e6daf7;\">[Intro: In the following Question, you'll perform some familiar tasks such as loading a dataset in the data analysis tool and creating a scatterplot with a line of best fit. Remember to closely examine the explanatory and response variables in order to fully understand the scenario before you begin your analysis. in Question 2 Part A, you'll draw in the residuals for each data point. You can perform this in the tool by clicking the Regression Option to \"Show Residuals on Plot.\" Analyzing the first 10 stores on the graph will help you transition to being able to read the Fitted Values &amp; Residual Analysis tab in the tool in order to answer the remainder of the question. Work together in pairs or groups to support one another as you learn this new skill.]<\/span>\r\n\r\n<\/div>\r\n<div class=\"textbox key-takeaways\">\r\n<h3>question 2<\/h3>\r\nThe following data were collected on the number of organic foods offered at 37 grocery stores in San Antonio, Texas in 2019. The number of organic foods offered at each store is plotted against the average income of the zip code in which each store is located. All stores are from the same grocery chain (same company).\r\n\r\nFind the dataset \u201cOrganic Foods\u201d in the DCMP Linear Regression tool at <a href=\"https:\/\/dcmathpathways.shinyapps.io\/LinearRegression\/\">https:\/\/dcmathpathways.shinyapps.io\/LinearRegression\/<\/a> and reproduce the following plot.\r\n\r\n<img class=\"alignnone wp-image-1287\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/5738\/2022\/01\/12202559\/Picture171-300x117.png\" alt=\"A scatterplot showing &quot;Average Income in Zip Code ($)&quot; on the horizontal axis and &quot;Number of Organic Items Offered&quot; on the vertical axis. The horizontal axis is number in increments of 20,000 from 40,000 to 140,000. The vertical axis is labeled in increments of 20 from 0 to 100. There is a line of best fit whose slope is labeled as y = -14.7 + 0.000959x. The first ten points are located at approximately (37000, 4), (39000, 14), (41000, 15), (42000, 16), (47500, 27), (48000, 30), (49000, 36), (4900, 38), (50000, 44), (50000, 66).\" width=\"1277\" height=\"499\" \/>\r\n\r\nPart A: For the first 10 stores (going from left to right on the graph), draw the residuals on the plot. Among the first four stores, how many have positive residuals? How many have negative residuals? Which store (among the first four) has the highest magnitude residual?\r\n\r\n[reveal-answer q=\"174438\"]Hint[\/reveal-answer]\r\n[hidden-answer a=\"174438\"]Recall where data points having positive or negative residuals are located with respect to the line of best fit. Which of these is the furthest away from the line (highest magnitude residual)?[\/hidden-answer]\r\n\r\n&nbsp;\r\n\r\nPart B: Go to the Fitted Values &amp; Residual Analysis tab in the DCMP Linear Regression tool. You\u2019ll see a \u201cResidual Plot\u201d that looks like the following plot. Compare the x-axis and y-axis labels of the residual plot with those from the previous regular scatterplot. What is the same? What is different?\r\n\r\n<img class=\"alignnone wp-image-1288\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/5738\/2022\/01\/12202604\/Picture172-300x112.png\" alt=\"A residual plot with the x-axis labeled &quot;Average Income in Zip Code ($)&quot; and the y-axis labeled &quot;Residual&quot;. The x-axis is numbered in increments of 20,000 starting at 40,000 and continuing up to 140,000. The y-axis is numbered in increments of 20, starting at -20 and going up to 40. The first four points are at approximately (37000, -14), (39000, -9), (41000, -10), (42000, -9).\" width=\"1093\" height=\"408\" \/>\r\n\r\n[reveal-answer q=\"185617\"]Hint[\/reveal-answer]\r\n[hidden-answer a=\"185617\"]How are the axes labeled compared to the scatterplot? [\/hidden-answer]\r\n\r\n&nbsp;\r\n\r\nPart C: Compare the first four store values (again, reading the graphs from left to right) in the regular scatterplot and the residual plot. Are the y-values in the regular scatterplot positive or negative? Are the y-values in the residual plot positive or negative? Explain.\r\n\r\n[reveal-answer q=\"417105\"]Hint[\/reveal-answer]\r\n[hidden-answer a=\"417105\"]What is the residual plot measuring for each of the data points?[\/hidden-answer]\r\n\r\n&nbsp;\r\n\r\nPart D: Based on what you saw in the previous plot, why might statisticians find residual plots to be useful?\r\n\r\n[reveal-answer q=\"723081\"]Hint[\/reveal-answer]\r\n[hidden-answer a=\"723081\"]What do <em>you<\/em> think?[\/hidden-answer]\r\n\r\n<\/div>\r\nResidual plots emphasize the residual values in our model. The following scatterplot and corresponding residual plot show that our linear regression model is appropriate: the residual values appear to be randomly scattered across the x-values, with no clear patterns or changes in variability. Recall that this was one of the conditions assumed for a linear regression to be appropriate for a dataset.\r\n\r\n<img class=\"wp-image-1289 aligncenter\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/5738\/2022\/01\/12202612\/Picture173-300x119.png\" alt=\"Two graphs. On the left is a scatterplot with a line of best fit. On the right is a residual plot, where the dots look seemingly random in relation to the horizontal line in the middle.\" width=\"574\" height=\"228\" \/>\r\n\r\nCheck in with your group to make sure your understanding of how to read a residual plot compared to a scatterplot is clear at this point, then move on to Question 3.\r\n<div class=\"textbox key-takeaways\">\r\n<h3>question 3<\/h3>\r\nNow, let\u2019s explore how residual plots can help us assess the reasonableness of our linear model.\r\n\r\n&nbsp;\r\n\r\nPart A: For the following scatterplot, is a linear model appropriate? Does the residual plot help make assessing this clearer? Explain.\r\n\r\n<img class=\"alignnone wp-image-1290\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/5738\/2022\/01\/12202620\/Picture174-300x119.png\" alt=\"Two graphs. On the left is a scatterplot where the points are clustered closely along the line of best fit. On the right is a residual plot where the points are in a somewhat sinusoidal pattern.\" width=\"1196\" height=\"475\" \/>\r\n\r\n[reveal-answer q=\"34123\"]Hint[\/reveal-answer]\r\n[hidden-answer a=\"34123\"]Does a pattern emerge more clearly in the residual plot than in the scatterplot?[\/hidden-answer]\r\n\r\nPart B: For the following scatterplot, is a linear model appropriate? Does the residual plot help make assessing this clearer? Explain.\r\n\r\n<img class=\"alignnone wp-image-1291\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/5738\/2022\/01\/12202634\/Picture175-300x110.png\" alt=\"Two graphs. On the left is a scatterplot where the points are clustered closely along the line of best fit. On the right is a residual plot where the points are far closer to the line at low x-values and distributed more widely at higher x-values.\" width=\"1088\" height=\"399\" \/>\r\n\r\n[reveal-answer q=\"497552\"]Hint[\/reveal-answer]\r\n[hidden-answer a=\"497552\"]Recall some of the violations of assumptions you learned about in the <em>What to Know<\/em> assignment. Remember that the residual plot will exaggerate non-linear characteristics so that they are easier to see.[\/hidden-answer]\r\n\r\n<\/div>\r\nFor Questions 4 and 5, there is no single correct answer. Statistics involves making a choice and justifying it through proper reasoning. As you work in groups to answer these questions by finding support for your choices, you may ask yourself what would be clearly misleading to do in this situation. This can help you discover alternative reasonable choices.\r\n<div class=\"textbox key-takeaways\">\r\n<h3>question 4<\/h3>\r\nLet\u2019s return to the organic grocery items dataset. Real datasets rarely have \u201cperfect\u201d residual plots. Looking at the following residual plot, is there any reason to question if a linear model is appropriate? Explain.\r\n\r\n<img class=\"alignnone wp-image-1292\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/5738\/2022\/01\/12202639\/Picture176-300x117.png\" alt=\"A scatterplot showing &quot;Average Income in Zip Code ($)&quot; on the horizontal axis and &quot;Number of Organic Items Offered&quot; on the vertical axis. The horizontal axis is number in increments of 20,000 from 40,000 to 140,000. The vertical axis is labeled in increments of 20 from 0 to 100. There is a line of best fit whose slope is labeled as y = -14.7 + 0.000959x.\" width=\"1237\" height=\"483\" \/>\r\n\r\n<img class=\"alignnone wp-image-1293\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/5738\/2022\/01\/12202645\/Picture177-300x112.png\" alt=\"A residual plot with the x-axis labeled &quot;Average Income in Zip Code ($)&quot; and the y-axis labeled &quot;Residual&quot;. The x-axis is numbered in increments of 20,000 starting at 40,000 and continuing up to 140,000. The y-axis is numbered in increments of 20, starting at -20 and going up to 40. The points are arranged relatively randomly on the graph, although many points with high y-values are near the middle of the graph.\" width=\"1315\" height=\"490\" \/>\r\n\r\n[reveal-answer q=\"734002\"]Hint[\/reveal-answer]\r\n[hidden-answer a=\"734002\"]Think about what you know about reasonable expectations for a residual plot that is appropriate for linear modeling. You may refer to your notes from the <em>What to Know<\/em> assignment for clues that a residual plot indicates a linear model is inappropriate.[\/hidden-answer]\r\n\r\n<\/div>\r\n<div class=\"textbox key-takeaways\">\r\n<h3>question 5<\/h3>\r\n<span style=\"background-color: #ffff00;\">[Please note that the images below are incorrect. They are the same as the previous images but shouldn't be. There are<strong> slightly<\/strong> different plots given in DC Question 5 that include an outlier point at about (123, 58) on the scatterplot and about (123, -40) on the residual plot. The equation is also different in the plots that should be showing up here. Please snip them over from the DC in-class to this page.]<\/span>\r\n\r\nThe organic items dataset contained in the course web app is actually a slightly altered version of the original dataset. The original dataset is visualized in the following scatterplot, along with an accompanying residual plot:\r\n\r\n<img class=\"alignnone wp-image-1292\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/5738\/2022\/01\/12202639\/Picture176-300x117.png\" alt=\"A scatterplot showing &quot;Average Income in Zip Code ($)&quot; on the horizontal axis and &quot;Number of Organic Items Offered&quot; on the vertical axis. The horizontal axis is number in increments of 20,000 from 40,000 to 140,000. The vertical axis is labeled in increments of 20 from 0 to 100. There is a line of best fit whose slope is labeled as y = -14.7 + 0.000959x.\" width=\"1237\" height=\"483\" \/>\r\n\r\n<img class=\"alignnone wp-image-1293\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/5738\/2022\/01\/12202645\/Picture177-300x112.png\" alt=\"A residual plot with the x-axis labeled &quot;Average Income in Zip Code ($)&quot; and the y-axis labeled &quot;Residual&quot;. The x-axis is numbered in increments of 20,000 starting at 40,000 and continuing up to 140,000. The y-axis is numbered in increments of 20, starting at -20 and going up to 40. The points are arranged relatively randomly on the graph, although many points with high y-values are near the middle of the graph.\" width=\"1315\" height=\"490\" \/>\r\n\r\nPart A: This original dataset is identical to the dataset we saw earlier, except it contains one additional data point\u2014an outlier. Locate the outlier in both the scatterplot and the residual plot. How can you tell that this data point is an outlier?\r\n\r\n[reveal-answer q=\"856550\"]Hint[\/reveal-answer]\r\n[hidden-answer a=\"856550\"]Consider the magnitude of the the outlier's residual. You may wish to compare this plot with the one you saw previously.[\/hidden-answer]\r\n\r\n&nbsp;\r\n\r\nPart B: A statistician would like to remove this value from the dataset. They justify the choice by saying that removing the data value would increase the R2 value. Is this proper justification? Explain.\r\n\r\n[reveal-answer q=\"454411\"]Hint[\/reveal-answer]\r\n[hidden-answer a=\"454411\"]What do <em>you\u00a0<\/em>think? Consider statistically sound and ethical practices.[\/hidden-answer]\r\n\r\n&nbsp;\r\n\r\nPart C: The real reason this store was removed from the dataset was because it is a specialty boutique store, and, therefore, is much smaller in size compared to the supermarkets that make up the rest of the data points. Is this a valid justification for removing the store? Explain.\r\n\r\n[reveal-answer q=\"507976\"]Hint[\/reveal-answer]\r\n[hidden-answer a=\"507976\"]Use what you know about representative samples to answer this question.[\/hidden-answer]\r\n\r\n&nbsp;\r\n\r\n<\/div>\r\nAnswer Question 6 independently before discussing your answers with your partner or group. Make sure you include specific reasoning in your answer. When discussing your answers, you may wish to consult with nearby groups to obtain the largest variety of viewpoints.\r\n<div class=\"textbox key-takeaways\">\r\n<h3>question 6<\/h3>\r\nImagine that an investigative journalist finds that this supermarket actually uses a model to estimate how many organic items it should put on shelves purely based on neighborhood income. The model calls for fewer organic items in low-income areas.\r\n\r\nPart A: How is this story supported by our previous data?\r\n\r\n[reveal-answer q=\"361690\"]Hint[\/reveal-answer]\r\n[hidden-answer a=\"361690\"]Consider only the mathematical portion of the analysis in your answer to this question.[\/hidden-answer]\r\n\r\n&nbsp;\r\n\r\nPart B: Are the company\u2019s practices unethical? Explain.\r\n\r\n[reveal-answer q=\"428918\"]Hint[\/reveal-answer]\r\n[hidden-answer a=\"428918\"]What do <em>you<\/em> think? No matter your answer, be sure to include clear, specific, and reasonable suport.[\/hidden-answer]\r\n\r\n&nbsp;\r\n\r\n<\/div>\r\n<div class=\"textbox tryit\">\r\n<h3>Guidance<\/h3>\r\n<span style=\"background-color: #e6daf7;\">[Wrap-up: Hopefully you have a better idea of the subjective nature of interpreting statistical results after completing this (and the previous) activity. Understanding the mathematical implications of measures such as [latex]R^2[\/latex] and residuals ensure a statistically sound analysis, but when adopting policy changes, there is still further room for interpretation. The most important practice you can adopt when performing analysis to implement change is to support your conclusions clearly and thoroughly and to permit a variety of viewpoints to enter into the discussion.<\/span>\r\n\r\n<span style=\"background-color: #e6daf7;\">Take a moment to look back on the introductory paragraph to this activity to find which parts of the activity addressed which objectives and desired understanding.]<\/span>\r\n\r\n<\/div>\r\n&nbsp;","rendered":"<div class=\"textbox learning-objectives\">\n<h3>Objectives for this activity<\/h3>\n<p>During this activity you will:<\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\">Construct and interpret a residual plot.<\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\">Informally assess the appropriateness of a linear regression model.<\/li>\n<\/ul>\n<\/div>\n<p>In the previous\u00a0<em>What to Know\u00a0<\/em>assignment for this activity, you spent quite a lot of time calculating and interpreting residuals, identifying common scenarios of datasets for which linear analysis would not be appropriate, and exploring the effect of outliers on\u00a0[latex]R^2[\/latex]. You did this work to prepare for this activity in which you&#8217;ll construct a plot of the residuals for a dataset and use it to assess the appropriateness of a linear regression model. As you do so, you&#8217;ll see that residual\u00a0plots can magnify potential issues with a linear model and that linear regression models may not be appropriate when we observe non-linear data trends or non-constant variance of residuals. You&#8217;ll also see that outliers\u00a0\u00a0should be investigated, as they affect\u00a0the strength of a model.<\/p>\n<p>This activity will investigate the whether the average income in an area is correlated with access to high quality, nutritious foods. Let&#8217;s begin.<\/p>\n<h2>Model Adequacy and Residuals<\/h2>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-1286\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/5738\/2022\/01\/12202555\/Picture170-300x200.jpg\" alt=\"A large array of fruits, vegetables, and other fresh foods\" width=\"1318\" height=\"877\" \/><\/p>\n<p>As an introduction to this activity, read and answer Question 1 independently before sharing and discussing your answer with a partner.<\/p>\n<div class=\"textbox key-takeaways\">\n<h3>question 1<\/h3>\n<p>What factors might make it more difficult for people with modest incomes to access healthy foods (relative to individuals with higher incomes)?<\/p>\n<div class=\"qa-wrapper\" style=\"display: block\"><span class=\"show-answer collapsed\" style=\"cursor: pointer\" data-target=\"q924526\">Hint<\/span><\/p>\n<div id=\"q924526\" class=\"hidden-answer\" style=\"display: none\">What do <em>you\u00a0<\/em>think?<\/div>\n<\/div>\n<\/div>\n<div class=\"textbox tryit\">\n<h3>Guidance<\/h3>\n<p><span style=\"background-color: #e6daf7;\">[Intro: In the following Question, you&#8217;ll perform some familiar tasks such as loading a dataset in the data analysis tool and creating a scatterplot with a line of best fit. Remember to closely examine the explanatory and response variables in order to fully understand the scenario before you begin your analysis. in Question 2 Part A, you&#8217;ll draw in the residuals for each data point. You can perform this in the tool by clicking the Regression Option to &#8220;Show Residuals on Plot.&#8221; Analyzing the first 10 stores on the graph will help you transition to being able to read the Fitted Values &amp; Residual Analysis tab in the tool in order to answer the remainder of the question. Work together in pairs or groups to support one another as you learn this new skill.]<\/span><\/p>\n<\/div>\n<div class=\"textbox key-takeaways\">\n<h3>question 2<\/h3>\n<p>The following data were collected on the number of organic foods offered at 37 grocery stores in San Antonio, Texas in 2019. The number of organic foods offered at each store is plotted against the average income of the zip code in which each store is located. All stores are from the same grocery chain (same company).<\/p>\n<p>Find the dataset \u201cOrganic Foods\u201d in the DCMP Linear Regression tool at <a href=\"https:\/\/dcmathpathways.shinyapps.io\/LinearRegression\/\">https:\/\/dcmathpathways.shinyapps.io\/LinearRegression\/<\/a> and reproduce the following plot.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-1287\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/5738\/2022\/01\/12202559\/Picture171-300x117.png\" alt=\"A scatterplot showing &quot;Average Income in Zip Code ($)&quot; on the horizontal axis and &quot;Number of Organic Items Offered&quot; on the vertical axis. The horizontal axis is number in increments of 20,000 from 40,000 to 140,000. The vertical axis is labeled in increments of 20 from 0 to 100. There is a line of best fit whose slope is labeled as y = -14.7 + 0.000959x. The first ten points are located at approximately (37000, 4), (39000, 14), (41000, 15), (42000, 16), (47500, 27), (48000, 30), (49000, 36), (4900, 38), (50000, 44), (50000, 66).\" width=\"1277\" height=\"499\" \/><\/p>\n<p>Part A: For the first 10 stores (going from left to right on the graph), draw the residuals on the plot. Among the first four stores, how many have positive residuals? How many have negative residuals? Which store (among the first four) has the highest magnitude residual?<\/p>\n<div class=\"qa-wrapper\" style=\"display: block\"><span class=\"show-answer collapsed\" style=\"cursor: pointer\" data-target=\"q174438\">Hint<\/span><\/p>\n<div id=\"q174438\" class=\"hidden-answer\" style=\"display: none\">Recall where data points having positive or negative residuals are located with respect to the line of best fit. Which of these is the furthest away from the line (highest magnitude residual)?<\/div>\n<\/div>\n<p>&nbsp;<\/p>\n<p>Part B: Go to the Fitted Values &amp; Residual Analysis tab in the DCMP Linear Regression tool. You\u2019ll see a \u201cResidual Plot\u201d that looks like the following plot. Compare the x-axis and y-axis labels of the residual plot with those from the previous regular scatterplot. What is the same? What is different?<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-1288\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/5738\/2022\/01\/12202604\/Picture172-300x112.png\" alt=\"A residual plot with the x-axis labeled &quot;Average Income in Zip Code ($)&quot; and the y-axis labeled &quot;Residual&quot;. The x-axis is numbered in increments of 20,000 starting at 40,000 and continuing up to 140,000. The y-axis is numbered in increments of 20, starting at -20 and going up to 40. The first four points are at approximately (37000, -14), (39000, -9), (41000, -10), (42000, -9).\" width=\"1093\" height=\"408\" \/><\/p>\n<div class=\"qa-wrapper\" style=\"display: block\"><span class=\"show-answer collapsed\" style=\"cursor: pointer\" data-target=\"q185617\">Hint<\/span><\/p>\n<div id=\"q185617\" class=\"hidden-answer\" style=\"display: none\">How are the axes labeled compared to the scatterplot? <\/div>\n<\/div>\n<p>&nbsp;<\/p>\n<p>Part C: Compare the first four store values (again, reading the graphs from left to right) in the regular scatterplot and the residual plot. Are the y-values in the regular scatterplot positive or negative? Are the y-values in the residual plot positive or negative? Explain.<\/p>\n<div class=\"qa-wrapper\" style=\"display: block\"><span class=\"show-answer collapsed\" style=\"cursor: pointer\" data-target=\"q417105\">Hint<\/span><\/p>\n<div id=\"q417105\" class=\"hidden-answer\" style=\"display: none\">What is the residual plot measuring for each of the data points?<\/div>\n<\/div>\n<p>&nbsp;<\/p>\n<p>Part D: Based on what you saw in the previous plot, why might statisticians find residual plots to be useful?<\/p>\n<div class=\"qa-wrapper\" style=\"display: block\"><span class=\"show-answer collapsed\" style=\"cursor: pointer\" data-target=\"q723081\">Hint<\/span><\/p>\n<div id=\"q723081\" class=\"hidden-answer\" style=\"display: none\">What do <em>you<\/em> think?<\/div>\n<\/div>\n<\/div>\n<p>Residual plots emphasize the residual values in our model. The following scatterplot and corresponding residual plot show that our linear regression model is appropriate: the residual values appear to be randomly scattered across the x-values, with no clear patterns or changes in variability. Recall that this was one of the conditions assumed for a linear regression to be appropriate for a dataset.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-1289 aligncenter\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/5738\/2022\/01\/12202612\/Picture173-300x119.png\" alt=\"Two graphs. On the left is a scatterplot with a line of best fit. On the right is a residual plot, where the dots look seemingly random in relation to the horizontal line in the middle.\" width=\"574\" height=\"228\" \/><\/p>\n<p>Check in with your group to make sure your understanding of how to read a residual plot compared to a scatterplot is clear at this point, then move on to Question 3.<\/p>\n<div class=\"textbox key-takeaways\">\n<h3>question 3<\/h3>\n<p>Now, let\u2019s explore how residual plots can help us assess the reasonableness of our linear model.<\/p>\n<p>&nbsp;<\/p>\n<p>Part A: For the following scatterplot, is a linear model appropriate? Does the residual plot help make assessing this clearer? Explain.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-1290\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/5738\/2022\/01\/12202620\/Picture174-300x119.png\" alt=\"Two graphs. On the left is a scatterplot where the points are clustered closely along the line of best fit. On the right is a residual plot where the points are in a somewhat sinusoidal pattern.\" width=\"1196\" height=\"475\" \/><\/p>\n<div class=\"qa-wrapper\" style=\"display: block\"><span class=\"show-answer collapsed\" style=\"cursor: pointer\" data-target=\"q34123\">Hint<\/span><\/p>\n<div id=\"q34123\" class=\"hidden-answer\" style=\"display: none\">Does a pattern emerge more clearly in the residual plot than in the scatterplot?<\/div>\n<\/div>\n<p>Part B: For the following scatterplot, is a linear model appropriate? Does the residual plot help make assessing this clearer? Explain.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-1291\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/5738\/2022\/01\/12202634\/Picture175-300x110.png\" alt=\"Two graphs. On the left is a scatterplot where the points are clustered closely along the line of best fit. On the right is a residual plot where the points are far closer to the line at low x-values and distributed more widely at higher x-values.\" width=\"1088\" height=\"399\" \/><\/p>\n<div class=\"qa-wrapper\" style=\"display: block\"><span class=\"show-answer collapsed\" style=\"cursor: pointer\" data-target=\"q497552\">Hint<\/span><\/p>\n<div id=\"q497552\" class=\"hidden-answer\" style=\"display: none\">Recall some of the violations of assumptions you learned about in the <em>What to Know<\/em> assignment. Remember that the residual plot will exaggerate non-linear characteristics so that they are easier to see.<\/div>\n<\/div>\n<\/div>\n<p>For Questions 4 and 5, there is no single correct answer. Statistics involves making a choice and justifying it through proper reasoning. As you work in groups to answer these questions by finding support for your choices, you may ask yourself what would be clearly misleading to do in this situation. This can help you discover alternative reasonable choices.<\/p>\n<div class=\"textbox key-takeaways\">\n<h3>question 4<\/h3>\n<p>Let\u2019s return to the organic grocery items dataset. Real datasets rarely have \u201cperfect\u201d residual plots. Looking at the following residual plot, is there any reason to question if a linear model is appropriate? Explain.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-1292\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/5738\/2022\/01\/12202639\/Picture176-300x117.png\" alt=\"A scatterplot showing &quot;Average Income in Zip Code ($)&quot; on the horizontal axis and &quot;Number of Organic Items Offered&quot; on the vertical axis. The horizontal axis is number in increments of 20,000 from 40,000 to 140,000. The vertical axis is labeled in increments of 20 from 0 to 100. There is a line of best fit whose slope is labeled as y = -14.7 + 0.000959x.\" width=\"1237\" height=\"483\" \/><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-1293\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/5738\/2022\/01\/12202645\/Picture177-300x112.png\" alt=\"A residual plot with the x-axis labeled &quot;Average Income in Zip Code ($)&quot; and the y-axis labeled &quot;Residual&quot;. The x-axis is numbered in increments of 20,000 starting at 40,000 and continuing up to 140,000. The y-axis is numbered in increments of 20, starting at -20 and going up to 40. The points are arranged relatively randomly on the graph, although many points with high y-values are near the middle of the graph.\" width=\"1315\" height=\"490\" \/><\/p>\n<div class=\"qa-wrapper\" style=\"display: block\"><span class=\"show-answer collapsed\" style=\"cursor: pointer\" data-target=\"q734002\">Hint<\/span><\/p>\n<div id=\"q734002\" class=\"hidden-answer\" style=\"display: none\">Think about what you know about reasonable expectations for a residual plot that is appropriate for linear modeling. You may refer to your notes from the <em>What to Know<\/em> assignment for clues that a residual plot indicates a linear model is inappropriate.<\/div>\n<\/div>\n<\/div>\n<div class=\"textbox key-takeaways\">\n<h3>question 5<\/h3>\n<p><span style=\"background-color: #ffff00;\">[Please note that the images below are incorrect. They are the same as the previous images but shouldn&#8217;t be. There are<strong> slightly<\/strong> different plots given in DC Question 5 that include an outlier point at about (123, 58) on the scatterplot and about (123, -40) on the residual plot. The equation is also different in the plots that should be showing up here. Please snip them over from the DC in-class to this page.]<\/span><\/p>\n<p>The organic items dataset contained in the course web app is actually a slightly altered version of the original dataset. The original dataset is visualized in the following scatterplot, along with an accompanying residual plot:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-1292\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/5738\/2022\/01\/12202639\/Picture176-300x117.png\" alt=\"A scatterplot showing &quot;Average Income in Zip Code ($)&quot; on the horizontal axis and &quot;Number of Organic Items Offered&quot; on the vertical axis. The horizontal axis is number in increments of 20,000 from 40,000 to 140,000. The vertical axis is labeled in increments of 20 from 0 to 100. There is a line of best fit whose slope is labeled as y = -14.7 + 0.000959x.\" width=\"1237\" height=\"483\" \/><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-1293\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/5738\/2022\/01\/12202645\/Picture177-300x112.png\" alt=\"A residual plot with the x-axis labeled &quot;Average Income in Zip Code ($)&quot; and the y-axis labeled &quot;Residual&quot;. The x-axis is numbered in increments of 20,000 starting at 40,000 and continuing up to 140,000. The y-axis is numbered in increments of 20, starting at -20 and going up to 40. The points are arranged relatively randomly on the graph, although many points with high y-values are near the middle of the graph.\" width=\"1315\" height=\"490\" \/><\/p>\n<p>Part A: This original dataset is identical to the dataset we saw earlier, except it contains one additional data point\u2014an outlier. Locate the outlier in both the scatterplot and the residual plot. How can you tell that this data point is an outlier?<\/p>\n<div class=\"qa-wrapper\" style=\"display: block\"><span class=\"show-answer collapsed\" style=\"cursor: pointer\" data-target=\"q856550\">Hint<\/span><\/p>\n<div id=\"q856550\" class=\"hidden-answer\" style=\"display: none\">Consider the magnitude of the the outlier&#8217;s residual. You may wish to compare this plot with the one you saw previously.<\/div>\n<\/div>\n<p>&nbsp;<\/p>\n<p>Part B: A statistician would like to remove this value from the dataset. They justify the choice by saying that removing the data value would increase the R2 value. Is this proper justification? Explain.<\/p>\n<div class=\"qa-wrapper\" style=\"display: block\"><span class=\"show-answer collapsed\" style=\"cursor: pointer\" data-target=\"q454411\">Hint<\/span><\/p>\n<div id=\"q454411\" class=\"hidden-answer\" style=\"display: none\">What do <em>you\u00a0<\/em>think? Consider statistically sound and ethical practices.<\/div>\n<\/div>\n<p>&nbsp;<\/p>\n<p>Part C: The real reason this store was removed from the dataset was because it is a specialty boutique store, and, therefore, is much smaller in size compared to the supermarkets that make up the rest of the data points. Is this a valid justification for removing the store? Explain.<\/p>\n<div class=\"qa-wrapper\" style=\"display: block\"><span class=\"show-answer collapsed\" style=\"cursor: pointer\" data-target=\"q507976\">Hint<\/span><\/p>\n<div id=\"q507976\" class=\"hidden-answer\" style=\"display: none\">Use what you know about representative samples to answer this question.<\/div>\n<\/div>\n<p>&nbsp;<\/p>\n<\/div>\n<p>Answer Question 6 independently before discussing your answers with your partner or group. Make sure you include specific reasoning in your answer. When discussing your answers, you may wish to consult with nearby groups to obtain the largest variety of viewpoints.<\/p>\n<div class=\"textbox key-takeaways\">\n<h3>question 6<\/h3>\n<p>Imagine that an investigative journalist finds that this supermarket actually uses a model to estimate how many organic items it should put on shelves purely based on neighborhood income. The model calls for fewer organic items in low-income areas.<\/p>\n<p>Part A: How is this story supported by our previous data?<\/p>\n<div class=\"qa-wrapper\" style=\"display: block\"><span class=\"show-answer collapsed\" style=\"cursor: pointer\" data-target=\"q361690\">Hint<\/span><\/p>\n<div id=\"q361690\" class=\"hidden-answer\" style=\"display: none\">Consider only the mathematical portion of the analysis in your answer to this question.<\/div>\n<\/div>\n<p>&nbsp;<\/p>\n<p>Part B: Are the company\u2019s practices unethical? Explain.<\/p>\n<div class=\"qa-wrapper\" style=\"display: block\"><span class=\"show-answer collapsed\" style=\"cursor: pointer\" data-target=\"q428918\">Hint<\/span><\/p>\n<div id=\"q428918\" class=\"hidden-answer\" style=\"display: none\">What do <em>you<\/em> think? No matter your answer, be sure to include clear, specific, and reasonable suport.<\/div>\n<\/div>\n<p>&nbsp;<\/p>\n<\/div>\n<div class=\"textbox tryit\">\n<h3>Guidance<\/h3>\n<p><span style=\"background-color: #e6daf7;\">[Wrap-up: Hopefully you have a better idea of the subjective nature of interpreting statistical results after completing this (and the previous) activity. Understanding the mathematical implications of measures such as [latex]R^2[\/latex] and residuals ensure a statistically sound analysis, but when adopting policy changes, there is still further room for interpretation. The most important practice you can adopt when performing analysis to implement change is to support your conclusions clearly and thoroughly and to permit a variety of viewpoints to enter into the discussion.<\/span><\/p>\n<p><span style=\"background-color: #e6daf7;\">Take a moment to look back on the introductory paragraph to this activity to find which parts of the activity addressed which objectives and desired understanding.]<\/span><\/p>\n<\/div>\n<p>&nbsp;<\/p>\n","protected":false},"author":428269,"menu_order":19,"template":"","meta":{"_candela_citation":"[]","CANDELA_OUTCOMES_GUID":"","pb_show_title":"on","pb_short_title":"","pb_subtitle":"","pb_authors":[],"pb_section_license":""},"chapter-type":[],"contributor":[],"license":[],"class_list":["post-3870","chapter","type-chapter","status-publish","hentry"],"part":4241,"_links":{"self":[{"href":"https:\/\/courses.lumenlearning.com\/lumen-danacenter-statsmockup\/wp-json\/pressbooks\/v2\/chapters\/3870","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/courses.lumenlearning.com\/lumen-danacenter-statsmockup\/wp-json\/pressbooks\/v2\/chapters"}],"about":[{"href":"https:\/\/courses.lumenlearning.com\/lumen-danacenter-statsmockup\/wp-json\/wp\/v2\/types\/chapter"}],"author":[{"embeddable":true,"href":"https:\/\/courses.lumenlearning.com\/lumen-danacenter-statsmockup\/wp-json\/wp\/v2\/users\/428269"}],"version-history":[{"count":12,"href":"https:\/\/courses.lumenlearning.com\/lumen-danacenter-statsmockup\/wp-json\/pressbooks\/v2\/chapters\/3870\/revisions"}],"predecessor-version":[{"id":4868,"href":"https:\/\/courses.lumenlearning.com\/lumen-danacenter-statsmockup\/wp-json\/pressbooks\/v2\/chapters\/3870\/revisions\/4868"}],"part":[{"href":"https:\/\/courses.lumenlearning.com\/lumen-danacenter-statsmockup\/wp-json\/pressbooks\/v2\/parts\/4241"}],"metadata":[{"href":"https:\/\/courses.lumenlearning.com\/lumen-danacenter-statsmockup\/wp-json\/pressbooks\/v2\/chapters\/3870\/metadata\/"}],"wp:attachment":[{"href":"https:\/\/courses.lumenlearning.com\/lumen-danacenter-statsmockup\/wp-json\/wp\/v2\/media?parent=3870"}],"wp:term":[{"taxonomy":"chapter-type","embeddable":true,"href":"https:\/\/courses.lumenlearning.com\/lumen-danacenter-statsmockup\/wp-json\/pressbooks\/v2\/chapter-type?post=3870"},{"taxonomy":"contributor","embeddable":true,"href":"https:\/\/courses.lumenlearning.com\/lumen-danacenter-statsmockup\/wp-json\/wp\/v2\/contributor?post=3870"},{"taxonomy":"license","embeddable":true,"href":"https:\/\/courses.lumenlearning.com\/lumen-danacenter-statsmockup\/wp-json\/wp\/v2\/license?post=3870"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}