{"id":3868,"date":"2022-03-15T23:22:36","date_gmt":"2022-03-15T23:22:36","guid":{"rendered":"https:\/\/courses.lumenlearning.com\/lumen-danacenter-statsmockup\/?post_type=chapter&#038;p=3868"},"modified":"2022-06-03T04:49:08","modified_gmt":"2022-06-03T04:49:08","slug":"what-to-know-about-6-d","status":"publish","type":"chapter","link":"https:\/\/courses.lumenlearning.com\/lumen-danacenter-statsmockup\/chapter\/what-to-know-about-6-d\/","title":{"raw":"What to Know About 6.D: Using Residual Plots with a Linear Regression Model","rendered":"What to Know About 6.D: Using Residual Plots with a Linear Regression Model"},"content":{"raw":"<div class=\"textbox learning-objectives\">\r\n<h3>Learning Goals<\/h3>\r\nAt the end of this page, you should feel comfortable performing these skills:\r\n<ul>\r\n \t<li>Calculate and interpret residual errors.<\/li>\r\n \t<li>Identify violations of assumptions needed to perform linear regression.<\/li>\r\n \t<li>Discuss the effect of influential points on [latex]R^{2}[\/latex].<\/li>\r\n<\/ul>\r\n<\/div>\r\nIn the next in-class activity, you will need to calculate and interpret residuals, identify violations of the assumptions needed to perform linear regression, and discuss the effect of outliers on [latex]R^2[\/latex]. Let's prepare for that by learning the formula for calculating residuals, seeing how to interpret a residual error for an observed value, and then challenging assumptions about the appropriateness of linear regression from the perspective of residuals. We'll end this preparation assignment with a discussion on how outliers affect the coefficient of determination.\r\n<h2>Residual Error<\/h2>\r\nWe have used scatterplots of data and constructed lines of best fit to describe the relationship in bivariate data. You have learned about the correlation coefficient [latex]r[\/latex] and the coefficient of determination [latex]R^2[\/latex], which are tools we have for determining whether the line of best fit is a useful model and how well the line fits the data.\r\n<h3>Calculation and Interpretation<\/h3>\r\nAnother tool we have is the analysis of residuals, which you saw for the first time in In-Class Activity 6.A. When we fit a line to the data, one thing we are interested in is how similar the linear model\u2019s predictions are to the observed data\u2014in other words, we want to know how closely the model matches the data. The residual for a data point is the difference between the observed value of the response variable and the linear model\u2019s prediction.\r\n<p style=\"text-align: center;\">Residual = observed value \u2013 predicted value<\/p>\r\n<p style=\"text-align: center;\">Residual = [latex]y-\\hat y[\/latex]<\/p>\r\n\r\n<div class=\"textbox\"><strong>Vocabulary:<\/strong> The word \u201cresidual\u201d means \u201cleft over\u201d or \u201cremaining.\u201d One way to relate the term \u201cresidual\u201d to the concept above is to think of the residual as the quantity left over that can\u2019t be explained by the linear relationship between the response variable and the explanatory variable.<\/div>\r\nTo calculate the predicted value, input a value of the explanatory variable, [latex]x[\/latex], to get a predicted value of the response variable, [latex]\\hat y[\/latex]. For example, suppose you have the following equation:\r\n<p style=\"text-align: center;\">[latex]\\hat y=5+3.4x[\/latex].<\/p>\r\nYou can calculate the predicted value of the response variable for a value of the explanatory variable [latex]x=6[\/latex] in the following way:\r\n\r\n[latex]\\hat y=5+3.4\\cdot 6=5+20.4=25.4[\/latex]\r\n\r\nThus, when [latex]x=6[\/latex], the predicted value of [latex]\\hat y[\/latex] will be 20.4.\r\n\r\nSee the video below for a demonstration before attempting Question 1.\r\n<div class=\"textbox tryit\">\r\n<h3>Video Placement<\/h3>\r\n<span style=\"background-color: #e6daf7;\">[Worked example video: A video following the process above to calculate a residual error. It should preview questions similar to the ones in Question 1 belos.]<\/span>\r\n\r\n<\/div>\r\nNow you try calculating and interpreting residuals in Question 1.\r\n<div class=\"textbox key-takeaways\">\r\n<h3>Question 1<\/h3>\r\nConsider the following scatterplot, with the equation of the line of best fit given in the upper left corner. This dataset has compiled information about 21 different animal species and recorded the average longevity (or lifespan) in years of each animal, along with the animal\u2019s average gestational period (how long a fetus must develop before it is born) in days. We will pay particular attention to the observation corresponding to the bear.\r\n\r\nThe equation of the line of best fit is:\r\n\r\n[latex]\\hat y=6.29+0.0449x[\/latex]\r\n\r\nThere is text on the scatterplot that reads:\r\n<p style=\"text-align: center;\">Animal: Bear<\/p>\r\n<p style=\"text-align: center;\">Gestation (days): 220<\/p>\r\n<p style=\"text-align: center;\">Longevity (years): 22<\/p>\r\n<img class=\"alignnone wp-image-1299\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/5738\/2022\/01\/12202715\/Picture183-300x154.png\" alt=\"A scatterplot showing Gestational Period and Longevity for 21 Animals. The x-axis is labeled &quot;Gestation (days)&quot; and the y-axis is labeled &quot;Longevity (years).&quot; The regression line with an equation y = 6.29 + 0.0449x goes from the bottom left of the graph to the upper right. One of the points is labeled &quot;Animal: Bear, Gestation (days): 220, Longevity (years): 22.&quot;\" width=\"1308\" height=\"672\" \/>\r\n\r\nPart A: The bear\u2019s average gestational period is 220 days. What is the bear\u2019s predicted average longevity in years given by the line of best fit?\r\n\r\n[reveal-answer q=\"381952\"]Hint[\/reveal-answer]\r\n[hidden-answer a=\"381952\"]Use the equation of the line of best fit.[\/hidden-answer]\r\n\r\n&nbsp;\r\n\r\nPart B: What is the bear\u2019s actual average longevity in years?\r\n\r\n[reveal-answer q=\"149908\"]Hint[\/reveal-answer]\r\n[hidden-answer a=\"149908\"]Use the data point to identify the observed value.[\/hidden-answer]\r\n\r\n&nbsp;\r\n\r\nPart C: What is the residual for the observation corresponding to the bear?\r\n\r\n[reveal-answer q=\"662179\"]Hint[\/reveal-answer]\r\n[hidden-answer a=\"662179\"]See the definition of <em>residual <\/em>given in the text above.[\/hidden-answer]\r\n\r\n&nbsp;\r\n\r\nPart D: Interpret the meaning of the residual by filling in the blanks:\r\n\r\nThe bear\u2019s actual longevity is _____ years <span style=\"text-decoration: underline;\">greater\/less<\/span> than predicted.\r\n\r\n[reveal-answer q=\"940313\"]Hint[\/reveal-answer]\r\n[hidden-answer a=\"940313\"]Compare the observed value to the predicted value.[\/hidden-answer]\r\n\r\n<\/div>\r\nIn the case of the bear, we saw that the residual for the observation was a positive number and that the data point was located above the line of best fit. What would happen to the sign of the residual if the data point were to lie below the line of best fit?\r\n<div class=\"textbox exercises\">\r\n<h3>Example<\/h3>\r\nAs you saw in the text and video demonstration above, use the formula given to calculate the residual error for\r\n\r\nan observed value of the response variable, where [latex]y[\/latex] represents the observed value and [latex]\\hat{y}[\/latex] represents the predicted value.\r\n<p style=\"text-align: center;\">[latex]y - \\hat{y}=\\text{residual}[\/latex]<\/p>\r\nCalculate the following residuals for the statistical model given in Question 1: [latex]\\hat{y} = 6.29 + 0.0449x[\/latex].\r\n<ol>\r\n \t<li>The observed data point lies at (260, 30). What is the residual for this observation? Does the data point lie above or below the line of best fit?<\/li>\r\n \t<li>The observed data point lies at (290, 8). What is the residual for this observation?\u00a0Does the data point lie above or below the line of best fit?<\/li>\r\n \t<li>The observed data point lies at (65, 9). What is the residual for this observation?\u00a0Does the data point lie above or below the line of best fit?<\/li>\r\n<\/ol>\r\n[reveal-answer q=\"687507\"]Show Answer[\/reveal-answer]\r\n[hidden-answer a=\"687507\"]\r\n<ol>\r\n \t<li>Using the statistical model\u00a0[latex]\\hat{y} = 6.29 + 0.0449x[\/latex], we find [latex]\\hat{y}=17.964. Then, [latex]y - \\hat{y}=30-17.964. The residual is 12.036. The data point lies above the line of best fit since the residual is positive. That is, the observed value of the response value is greater than the predicted value.<\/li>\r\n \t<li>Using the statistical model\u00a0[latex]\\hat{y} = 6.29 + 0.0449x[\/latex], we find [latex]\\hat{y}=19.311. Then, [latex]y - \\hat{y}=8 - 19.311. The residual is -11.311. The data point lies below the line of best fit since the residual is negative. That is, the observed value of the response value is less than the predicted value.<\/li>\r\n \t<li>Using the statistical model\u00a0[latex]\\hat{y} = 6.29 + 0.0449x[\/latex], we find [latex]\\hat{y}=9.2085. Then, [latex]y - \\hat{y}=9 - 9.2085. The residual is -0.2085. This value is very close to zero. The data point lies technically below the line of best fit since the residual is negative, however the point is extremely close to the line of best fit. The observed value of the response value is 0.2085 units less than the predicted value.<\/li>\r\n<\/ol>\r\n[\/hidden-answer]\r\n\r\n<\/div>\r\nNow that you've seen examples of residuals for data points above, below, and very close to the line of best fit, use what you know to answer Questions 2, 3, and 4 below.\r\n<div class=\"textbox key-takeaways\">\r\n<h3>question 2<\/h3>\r\nFill in the blank: If a residual error is positive, then the observed value is ______ the predicted value.\r\n<ol>\r\n \t<li>a) Greater than<\/li>\r\n \t<li>b) Less than<\/li>\r\n \t<li>c) Equal to<\/li>\r\n<\/ol>\r\n[reveal-answer q=\"518819\"]Hint[\/reveal-answer]\r\n[hidden-answer a=\"518819\"]See the Example above for guidance.[\/hidden-answer]\r\n\r\n<\/div>\r\n<div class=\"textbox key-takeaways\">\r\n<h3>question 3<\/h3>\r\nFill in the blank: If a residual is negative, then the observed value is ______ the predicted value.\r\n<ol>\r\n \t<li>a) Greater than<\/li>\r\n \t<li>b) Less than<\/li>\r\n \t<li>c) Equal to<\/li>\r\n<\/ol>\r\n[reveal-answer q=\"588669\"]Hint[\/reveal-answer]\r\n[hidden-answer a=\"588669\"]See the Example above for guidance.[\/hidden-answer]\r\n\r\n<\/div>\r\n<div class=\"textbox key-takeaways\">\r\n<h3>question 4<\/h3>\r\nWhich feature on the following scatterplot represents the residuals?\r\n\r\n<img class=\"alignnone wp-image-1300\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/5738\/2022\/01\/12202723\/Picture184-300x154.png\" alt=\"A scatterplot of the Gestational Period and Longevity for 21 Animals. It is labeled Gestation (days) on the x-axis and Longevity (years) on the y-axis. There are several points on the graph marked with red dots. There is also a blue line that goes diagonally across the graph, through the center of the cluster of red dots. Lastly, there are green vertical lines connecting the red dots to the blue line.\" width=\"1452\" height=\"746\" \/>\r\n<ol>\r\n \t<li>a) The blue line<\/li>\r\n \t<li>b) The vertical green lines<\/li>\r\n \t<li>c) The red dots<\/li>\r\n<\/ol>\r\n[reveal-answer q=\"40460\"]Hint[\/reveal-answer]\r\n[hidden-answer a=\"40460\"]See the Example above for guidance.[\/hidden-answer]\r\n\r\n<\/div>\r\n<h3>Necessary Assumptions<\/h3>\r\nExamining the residuals can give us useful information about whether a line of best fit is an appropriate choice to model the data in question.\r\n<div class=\"textbox tryit\">\r\n<h3>Video Placement<\/h3>\r\n<span style=\"background-color: #e6daf7;\">[Perspective Video: a 3-instructor video presenting and annotating examples of scatterplots in which the residuals indicate the data is appropriate for a linear model, then also showing plots inappropriate for linear analysis that violate the necessary conditions in the ways indicated below. This presentation should be more intuitive and visual than technical but it may include technical discussion presented in the text above. ]<\/span>\r\n\r\n<\/div>\r\nWhen a linear regression is appropriate, the value of the residuals will be randomly scattered around 0. That is, some residuals will be positive (observed value above the line) and some will be negative (observed value below the line), but we do not want to see some systematic pattern (e.g., all above in order and then all below).\r\n\r\nIn particular, we might worry about the appropriateness of the model if we notice the following:\r\n<ul>\r\n \t<li style=\"font-weight: 400;\" aria-level=\"1\">The trend in the scatterplot is non-linear, indicating that the relationship between the explanatory variable and the response variable is not modeled very well by a line. The residuals tend to have a pattern. The observations are above and below the line systematically.<\/li>\r\n \t<li style=\"font-weight: 400;\" aria-level=\"1\">The observed values are further and further away from the line of best fit for a portion of the data. That is, the errors are not consistent for all values of the explanatory variable. Often, we will see that the size of the residuals tends to increase or decrease as the value of the explanatory variable increases. When this happens, it can be hard to get a handle on the accuracy of the model because the standard deviation of the residuals is not constant over the values of the independent variable.<\/li>\r\n<\/ul>\r\nConsider the conditions required for using a line to model data (listed in the text and the video above) as you explore scenarios of scatterplots in which one of those conditions has been violated.\r\n<div class=\"textbox key-takeaways\">\r\n<h3>question 5<\/h3>\r\nFor each of the following examples, one of our necessary conditions has been violated. Indicate how you know a condition has been violated.\r\n\r\nPart A:\r\n\r\n<img class=\"alignnone wp-image-1301\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/5738\/2022\/01\/12202729\/Picture185-300x178.png\" alt=\"A scatterplot with a line of best fit. For lower x-values, the points are close to the line. As x increases, the y-distribution of the points increases as well.\" width=\"1235\" height=\"733\" \/>\r\n\r\nHow do you know a condition has been violated?\r\n<ol>\r\n \t<li>a) The trend in the scatterplot is non-linear.<\/li>\r\n \t<li>b) The size of the residuals tends to increase or decrease as the value of the explanatory variable increases.<\/li>\r\n<\/ol>\r\n[reveal-answer q=\"403288\"]Hint[\/reveal-answer]\r\n[hidden-answer a=\"403288\"]Draw a few lines representing the residuals. What happens?[\/hidden-answer]\r\n\r\n&nbsp;\r\n\r\nPart B:\r\n\r\n<img class=\"alignnone wp-image-1302\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/5738\/2022\/01\/12202734\/Picture186-300x179.png\" alt=\"A scatterplot with a line of best fit. The points are clustered closely around the line, alternating being above and below the line at semi-regular intervals.\" width=\"1463\" height=\"873\" \/>\r\n\r\nHow do you know a condition has been violated?\r\n<ol>\r\n \t<li>a) The trend in the scatterplot is non-linear.<\/li>\r\n \t<li>b) The size of the residuals tends to increase or decrease as the value of the explanatory variable increases.<\/li>\r\n<\/ol>\r\n[reveal-answer q=\"590322\"]Hint[\/reveal-answer]\r\n[hidden-answer a=\"590322\"]Do you notice a particular pattern in the data?[\/hidden-answer]\r\n\r\n<\/div>\r\n<h3>Outliers<\/h3>\r\nWe also might worry about the appropriateness of the model if we notice an extreme observation that affects the value of the line. Recall from earlier activities that an outlier is an extreme observation that is far away from the rest of the data. In fitting a regression line, an outlier can also be an observation that does not fit the trend of the data as well. We call this type of outlier an <strong>influential point<\/strong>. This point drastically changes the equation of the line, consequently increasing the values of all of the residuals. An influential point appears to \u201cpull\u201d the line towards its value. We will also study how these points affect [latex]R^2[\/latex]. It is important to note that not all outliers affect the equation of the line, so we will need to investigate.\r\n<div class=\"textbox exercises\">\r\n<h3>Example<\/h3>\r\nRecall that an outlier is an extreme observation far away from the rest of the data. Observe the two scatterplots below. Both contain 19 identical data points but the second plot contains an influential point. Notice the effect the outlier has on the regression line.\r\n\r\n<img class=\"aligncenter size-full wp-image-4840\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/5738\/2022\/03\/03041446\/Influential-Point.jpg\" alt=\"two scatterplots are shown, the first with a line of best fit with y-intercept of 10, positive slope, and 20 data points tightly arranged about the line of best fit. The second shows a line of best fit with y-intercept of 17, 19 identical data points as the first and one outlier far above the line. The first line of best fit is superimposed on the second plot, showing that the line of best fit has been pulled upward toward the extreme outlier. \" width=\"1044\" height=\"1086\" \/>\r\n\r\nThe red line in the second plot is the original line of best fit. The extreme outlier appears to have pulled the line upward in the second plot.\r\n<ol>\r\n \t<li>What is the y-intercept in the first plot, without the outlier?<\/li>\r\n \t<li>What is the y-intercept in the second plot, with the extreme outlier?<\/li>\r\n<\/ol>\r\n[reveal-answer q=\"977299\"]Show Answer[\/reveal-answer]\r\n[hidden-answer a=\"977299\"]\r\n<ol>\r\n \t<li>The first y-intercept appears to be approximately 10.<\/li>\r\n \t<li>The second y-intercept, under the influence of the outlier appears to be approximately 17.[\/hidden-answer]<\/li>\r\n<\/ol>\r\nNote the differences in the characteristics of the two lines.\r\n<table style=\"border-collapse: collapse; width: 100%; height: 36px;\" border=\"1\">\r\n<tbody>\r\n<tr style=\"height: 12px;\">\r\n<td style=\"width: 25%; height: 12px;\"><\/td>\r\n<td style=\"width: 25%; height: 12px;\">Slope<\/td>\r\n<td style=\"width: 25%; height: 12px;\">[latex]r[\/latex]<\/td>\r\n<td style=\"width: 25%; height: 12px;\">[latex]R^{2}[\/latex]<\/td>\r\n<\/tr>\r\n<tr style=\"height: 12px;\">\r\n<td style=\"width: 25%; height: 12px;\">Original line without the outlier<\/td>\r\n<td style=\"width: 25%; height: 12px;\">[latex]3.84[\/latex]<\/td>\r\n<td style=\"width: 25%; height: 12px;\">[latex]0.90[\/latex]<\/td>\r\n<td style=\"width: 25%; height: 12px;\">[latex]81 \\%[\/latex]<\/td>\r\n<\/tr>\r\n<tr style=\"height: 12px;\">\r\n<td style=\"width: 25%; height: 12px;\">Second line with the outlier<\/td>\r\n<td style=\"width: 25%; height: 12px;\">[latex]3.84[\/latex]<\/td>\r\n<td style=\"width: 25%; height: 12px;\">[latex]0.64[\/latex]<\/td>\r\n<td style=\"width: 25%; height: 12px;\">[latex]40 \\%[\/latex]<\/td>\r\n<\/tr>\r\n<\/tbody>\r\n<\/table>\r\n&nbsp;\r\n\r\n<\/div>\r\nAs we noted above, not all outliers qualify as influential points. That is, not all outliers have a large effect on the slope of the regression line. Question 6 and 7 will help illustrate this idea.\r\n<div class=\"textbox key-takeaways\">\r\n<h3>question 6<\/h3>\r\nOpen the <em>DCMP Linear Regression<\/em> tool at <a href=\"https:\/\/dcmathpathways.shinyapps.io\/LinearRegression\/\">https:\/\/dcmathpathways.shinyapps.io\/LinearRegression\/<\/a>. Open the \u201cAnimal Longevity\u201d dataset. This is the same dataset that was used in Question 1. Make sure that gestation is the explanatory variable and longevity is the response variable.\r\n\r\n&nbsp;\r\n\r\nPart A: Find the point on the graph corresponding to the elephant by hovering over the points. Is this observation an outlier that does not fit the trend?\r\n<ol>\r\n \t<li>a) Yes<\/li>\r\n \t<li>b) No<\/li>\r\n<\/ol>\r\n[reveal-answer q=\"194072\"]Hint[\/reveal-answer]\r\n[hidden-answer a=\"194072\"]Where does the point lie in relation to the line of best fit?[\/hidden-answer]\r\n\r\n&nbsp;\r\n\r\nPart B: What is [latex]R^2[\/latex] for this dataset?\r\n\r\n[reveal-answer q=\"778353\"]Hint[\/reveal-answer]\r\n[hidden-answer a=\"778353\"]See the table below the scatterplot.[\/hidden-answer]\r\n\r\nPart C: Select the box on the left side of screen that says \u201cClick to Remove Points.\u201d Remove the point corresponding to the elephant by either clicking on the scatterplot or by highlighting the elephant row in the data table and deleting it. How does [latex]R^2[\/latex] change when you remove the elephant from the dataset?\r\n<ol>\r\n \t<li>a) [latex]R^2[\/latex] increases<\/li>\r\n \t<li>b) [latex]R^2[\/latex] decreases<\/li>\r\n \t<li>c) [latex]R^2[\/latex] stays the same<\/li>\r\n<\/ol>\r\n[reveal-answer q=\"219512\"]Hint[\/reveal-answer]\r\n[hidden-answer a=\"219512\"]Compare the value of [latex]R^2[\/latex] before and after clicking to remove the data point.[\/hidden-answer]\r\n\r\n&nbsp;\r\n\r\nPart D: Why do you think this is? How does removing the elephant affect the relative amount of variability that is explained by the linear relationship?\r\n\r\n[reveal-answer q=\"80822\"]Hint[\/reveal-answer]\r\n[hidden-answer a=\"80822\"]What do <em>you<\/em> think? If the data point corresponding to the elephant is not an outlier, would removing it increase or decrease the variability explained by the linear relationship?[\/hidden-answer]\r\n\r\n<\/div>\r\n<div class=\"textbox key-takeaways\">\r\n<h3>Question 7<\/h3>\r\nNow, reload the \u201cAnimal Longevity\u201d dataset so that the data for the elephant are included once again. You may need to reload the webpage, then reload the dataset.\r\n\r\n&nbsp;\r\n\r\nPart A: Find the point on the graph corresponding to the hippopotamus. Is this observation an outlier?\r\n<ol>\r\n \t<li>a) Yes<\/li>\r\n \t<li>b) No<\/li>\r\n<\/ol>\r\n[reveal-answer q=\"142783\"]Hint[\/reveal-answer]\r\n[hidden-answer a=\"142783\"] Where does the point lie in relation to the line of best fit?[\/hidden-answer]\r\n\r\nPart B: What is [latex]R^2[\/latex] for this dataset?\r\n\r\n[reveal-answer q=\"359610\"]Hint[\/reveal-answer]\r\n[hidden-answer a=\"359610\"]You\u2019ve already answered this in the previous question. See the table below the scatterplot.[\/hidden-answer]\r\n\r\n&nbsp;\r\n\r\nPart C: Select the box on the left side of screen that says \u201cClick to Remove Points.\u201d Remove the point corresponding to the hippopotamus by either clicking on the scatterplot or by highlighting the hippopotamus row in the data table and deleting it. How does [latex]R^2[\/latex] change when you remove the hippopotamus from the dataset?\r\n<ol>\r\n \t<li>a) [latex]R^2[\/latex] increases<\/li>\r\n \t<li>b) [latex]R^2[\/latex] decreases<\/li>\r\n \t<li>c) [latex]R^2[\/latex] stays the same<\/li>\r\n<\/ol>\r\n[reveal-answer q=\"546005\"]Hint[\/reveal-answer]\r\n[hidden-answer a=\"546005\"]Compare the value of [latex]R^2[\/latex] before and after clicking to remove the data point.[\/hidden-answer]\r\n\r\n&nbsp;\r\n\r\nPart D: Why do you think this is? How does removing the hippopotamus affect the relative amount of variability that is explained by the linear relationship?\r\n\r\n[reveal-answer q=\"874512\"]Hint[\/reveal-answer]\r\n[hidden-answer a=\"874512\"]What do <em>you\u00a0<\/em>think?[\/hidden-answer]\r\n\r\n<\/div>\r\nFinally, let's move our attention from the idea of influential points to consider again the shape of a plot with respect to a very large or very small value of\u00a0[latex]R^2[\/latex]. Recall the situations you explored above that violate conditions necessary to consider a linear model appropriate for data. Then answer Question 8.\r\n<div class=\"textbox key-takeaways\">\r\n<h3>question 8<\/h3>\r\nConsider this plot again, which we saw in a previous question. The scatterplot has [latex]R^2=98.2\\%[\/latex].\r\n\r\n<img class=\"alignnone wp-image-1302\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/5738\/2022\/01\/12202734\/Picture186-300x179.png\" alt=\"A scatterplot with a line of best fit. The points are clustered closely around the line, alternating being above and below the line at semi-regular intervals.\" width=\"1463\" height=\"873\" \/>\r\n\r\nDetermine whether the following statement is true or false: If the scatterplot of two variables has a high [latex]R^2[\/latex] value, then a line is an appropriate model for the relationship between the variables.\r\n\r\n[reveal-answer q=\"803464\"]Hint[\/reveal-answer]\r\n[hidden-answer a=\"803464\"]What conditions must be satisfied in order for a linear regression model to be appropriate?[\/hidden-answer]\r\n\r\n<\/div>\r\n<h2>Summary<\/h2>\r\nIn this <em>What to Know\u00a0<\/em>page,\u00a0you calculated and interpreted residuals and examined assumptions and the effects of outliers on the appropriateness of a linear model for bivariate data. Let\u2019s summarize these three ideas by noting the questions in which they appeared.\r\n<ul>\r\n \t<li>In Questions 1, 2, 3, and 4, you calculated and interpreted residual errors.<\/li>\r\n \t<li>In Questions 5 and 8, you identified violations of assumptions needed to perform linear regression.<\/li>\r\n \t<li>In Questions 6 and 7, you discussed the effect of influential points on\u00a0[latex]R^{2}[\/latex].<\/li>\r\n<\/ul>\r\nUnderstanding the characteristics and processes of analyzing residuals is crucial to performing linear regression analysis on data. Let's move on to the activity to apply these skills.","rendered":"<div class=\"textbox learning-objectives\">\n<h3>Learning Goals<\/h3>\n<p>At the end of this page, you should feel comfortable performing these skills:<\/p>\n<ul>\n<li>Calculate and interpret residual errors.<\/li>\n<li>Identify violations of assumptions needed to perform linear regression.<\/li>\n<li>Discuss the effect of influential points on [latex]R^{2}[\/latex].<\/li>\n<\/ul>\n<\/div>\n<p>In the next in-class activity, you will need to calculate and interpret residuals, identify violations of the assumptions needed to perform linear regression, and discuss the effect of outliers on [latex]R^2[\/latex]. Let&#8217;s prepare for that by learning the formula for calculating residuals, seeing how to interpret a residual error for an observed value, and then challenging assumptions about the appropriateness of linear regression from the perspective of residuals. We&#8217;ll end this preparation assignment with a discussion on how outliers affect the coefficient of determination.<\/p>\n<h2>Residual Error<\/h2>\n<p>We have used scatterplots of data and constructed lines of best fit to describe the relationship in bivariate data. You have learned about the correlation coefficient [latex]r[\/latex] and the coefficient of determination [latex]R^2[\/latex], which are tools we have for determining whether the line of best fit is a useful model and how well the line fits the data.<\/p>\n<h3>Calculation and Interpretation<\/h3>\n<p>Another tool we have is the analysis of residuals, which you saw for the first time in In-Class Activity 6.A. When we fit a line to the data, one thing we are interested in is how similar the linear model\u2019s predictions are to the observed data\u2014in other words, we want to know how closely the model matches the data. The residual for a data point is the difference between the observed value of the response variable and the linear model\u2019s prediction.<\/p>\n<p style=\"text-align: center;\">Residual = observed value \u2013 predicted value<\/p>\n<p style=\"text-align: center;\">Residual = [latex]y-\\hat y[\/latex]<\/p>\n<div class=\"textbox\"><strong>Vocabulary:<\/strong> The word \u201cresidual\u201d means \u201cleft over\u201d or \u201cremaining.\u201d One way to relate the term \u201cresidual\u201d to the concept above is to think of the residual as the quantity left over that can\u2019t be explained by the linear relationship between the response variable and the explanatory variable.<\/div>\n<p>To calculate the predicted value, input a value of the explanatory variable, [latex]x[\/latex], to get a predicted value of the response variable, [latex]\\hat y[\/latex]. For example, suppose you have the following equation:<\/p>\n<p style=\"text-align: center;\">[latex]\\hat y=5+3.4x[\/latex].<\/p>\n<p>You can calculate the predicted value of the response variable for a value of the explanatory variable [latex]x=6[\/latex] in the following way:<\/p>\n<p>[latex]\\hat y=5+3.4\\cdot 6=5+20.4=25.4[\/latex]<\/p>\n<p>Thus, when [latex]x=6[\/latex], the predicted value of [latex]\\hat y[\/latex] will be 20.4.<\/p>\n<p>See the video below for a demonstration before attempting Question 1.<\/p>\n<div class=\"textbox tryit\">\n<h3>Video Placement<\/h3>\n<p><span style=\"background-color: #e6daf7;\">[Worked example video: A video following the process above to calculate a residual error. It should preview questions similar to the ones in Question 1 belos.]<\/span><\/p>\n<\/div>\n<p>Now you try calculating and interpreting residuals in Question 1.<\/p>\n<div class=\"textbox key-takeaways\">\n<h3>Question 1<\/h3>\n<p>Consider the following scatterplot, with the equation of the line of best fit given in the upper left corner. This dataset has compiled information about 21 different animal species and recorded the average longevity (or lifespan) in years of each animal, along with the animal\u2019s average gestational period (how long a fetus must develop before it is born) in days. We will pay particular attention to the observation corresponding to the bear.<\/p>\n<p>The equation of the line of best fit is:<\/p>\n<p>[latex]\\hat y=6.29+0.0449x[\/latex]<\/p>\n<p>There is text on the scatterplot that reads:<\/p>\n<p style=\"text-align: center;\">Animal: Bear<\/p>\n<p style=\"text-align: center;\">Gestation (days): 220<\/p>\n<p style=\"text-align: center;\">Longevity (years): 22<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-1299\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/5738\/2022\/01\/12202715\/Picture183-300x154.png\" alt=\"A scatterplot showing Gestational Period and Longevity for 21 Animals. The x-axis is labeled &quot;Gestation (days)&quot; and the y-axis is labeled &quot;Longevity (years).&quot; The regression line with an equation y = 6.29 + 0.0449x goes from the bottom left of the graph to the upper right. One of the points is labeled &quot;Animal: Bear, Gestation (days): 220, Longevity (years): 22.&quot;\" width=\"1308\" height=\"672\" \/><\/p>\n<p>Part A: The bear\u2019s average gestational period is 220 days. What is the bear\u2019s predicted average longevity in years given by the line of best fit?<\/p>\n<div class=\"qa-wrapper\" style=\"display: block\"><span class=\"show-answer collapsed\" style=\"cursor: pointer\" data-target=\"q381952\">Hint<\/span><\/p>\n<div id=\"q381952\" class=\"hidden-answer\" style=\"display: none\">Use the equation of the line of best fit.<\/div>\n<\/div>\n<p>&nbsp;<\/p>\n<p>Part B: What is the bear\u2019s actual average longevity in years?<\/p>\n<div class=\"qa-wrapper\" style=\"display: block\"><span class=\"show-answer collapsed\" style=\"cursor: pointer\" data-target=\"q149908\">Hint<\/span><\/p>\n<div id=\"q149908\" class=\"hidden-answer\" style=\"display: none\">Use the data point to identify the observed value.<\/div>\n<\/div>\n<p>&nbsp;<\/p>\n<p>Part C: What is the residual for the observation corresponding to the bear?<\/p>\n<div class=\"qa-wrapper\" style=\"display: block\"><span class=\"show-answer collapsed\" style=\"cursor: pointer\" data-target=\"q662179\">Hint<\/span><\/p>\n<div id=\"q662179\" class=\"hidden-answer\" style=\"display: none\">See the definition of <em>residual <\/em>given in the text above.<\/div>\n<\/div>\n<p>&nbsp;<\/p>\n<p>Part D: Interpret the meaning of the residual by filling in the blanks:<\/p>\n<p>The bear\u2019s actual longevity is _____ years <span style=\"text-decoration: underline;\">greater\/less<\/span> than predicted.<\/p>\n<div class=\"qa-wrapper\" style=\"display: block\"><span class=\"show-answer collapsed\" style=\"cursor: pointer\" data-target=\"q940313\">Hint<\/span><\/p>\n<div id=\"q940313\" class=\"hidden-answer\" style=\"display: none\">Compare the observed value to the predicted value.<\/div>\n<\/div>\n<\/div>\n<p>In the case of the bear, we saw that the residual for the observation was a positive number and that the data point was located above the line of best fit. What would happen to the sign of the residual if the data point were to lie below the line of best fit?<\/p>\n<div class=\"textbox exercises\">\n<h3>Example<\/h3>\n<p>As you saw in the text and video demonstration above, use the formula given to calculate the residual error for<\/p>\n<p>an observed value of the response variable, where [latex]y[\/latex] represents the observed value and [latex]\\hat{y}[\/latex] represents the predicted value.<\/p>\n<p style=\"text-align: center;\">[latex]y - \\hat{y}=\\text{residual}[\/latex]<\/p>\n<p>Calculate the following residuals for the statistical model given in Question 1: [latex]\\hat{y} = 6.29 + 0.0449x[\/latex].<\/p>\n<ol>\n<li>The observed data point lies at (260, 30). What is the residual for this observation? Does the data point lie above or below the line of best fit?<\/li>\n<li>The observed data point lies at (290, 8). What is the residual for this observation?\u00a0Does the data point lie above or below the line of best fit?<\/li>\n<li>The observed data point lies at (65, 9). What is the residual for this observation?\u00a0Does the data point lie above or below the line of best fit?<\/li>\n<\/ol>\n<div class=\"qa-wrapper\" style=\"display: block\"><span class=\"show-answer collapsed\" style=\"cursor: pointer\" data-target=\"q687507\">Show Answer<\/span><\/p>\n<div id=\"q687507\" class=\"hidden-answer\" style=\"display: none\">\n<ol>\n<li>Using the statistical model\u00a0[latex]\\hat{y} = 6.29 + 0.0449x[\/latex], we find [latex]\\hat{y}=17.964. Then, [latex]y - \\hat{y}=30-17.964. The residual is 12.036. The data point lies above the line of best fit since the residual is positive. That is, the observed value of the response value is greater than the predicted value.<\/li>\n<li>Using the statistical model\u00a0[latex]\\hat{y} = 6.29 + 0.0449x[\/latex], we find [latex]\\hat{y}=19.311. Then, [latex]y - \\hat{y}=8 - 19.311. The residual is -11.311. The data point lies below the line of best fit since the residual is negative. That is, the observed value of the response value is less than the predicted value.<\/li>\n<li>Using the statistical model\u00a0[latex]\\hat{y} = 6.29 + 0.0449x[\/latex], we find [latex][\/latex]\\hat{y}=9.2085. Then, [latex][\/latex]y - \\hat{y}=9 - 9.2085. The residual is -0.2085. This value is very close to zero. The data point lies technically below the line of best fit since the residual is negative, however the point is extremely close to the line of best fit. The observed value of the response value is 0.2085 units less than the predicted value.<\/li>\n<\/ol>\n<\/div>\n<\/div>\n<\/div>\n<p>Now that you've seen examples of residuals for data points above, below, and very close to the line of best fit, use what you know to answer Questions 2, 3, and 4 below.<\/p>\n<div class=\"textbox key-takeaways\">\n<h3>question 2<\/h3>\n<p>Fill in the blank: If a residual error is positive, then the observed value is ______ the predicted value.<\/p>\n<ol>\n<li>a) Greater than<\/li>\n<li>b) Less than<\/li>\n<li>c) Equal to<\/li>\n<\/ol>\n<div class=\"qa-wrapper\" style=\"display: block\"><span class=\"show-answer collapsed\" style=\"cursor: pointer\" data-target=\"q518819\">Hint<\/span><\/p>\n<div id=\"q518819\" class=\"hidden-answer\" style=\"display: none\">See the Example above for guidance.<\/div>\n<\/div>\n<\/div>\n<div class=\"textbox key-takeaways\">\n<h3>question 3<\/h3>\n<p>Fill in the blank: If a residual is negative, then the observed value is ______ the predicted value.<\/p>\n<ol>\n<li>a) Greater than<\/li>\n<li>b) Less than<\/li>\n<li>c) Equal to<\/li>\n<\/ol>\n<div class=\"qa-wrapper\" style=\"display: block\"><span class=\"show-answer collapsed\" style=\"cursor: pointer\" data-target=\"q588669\">Hint<\/span><\/p>\n<div id=\"q588669\" class=\"hidden-answer\" style=\"display: none\">See the Example above for guidance.<\/div>\n<\/div>\n<\/div>\n<div class=\"textbox key-takeaways\">\n<h3>question 4<\/h3>\n<p>Which feature on the following scatterplot represents the residuals?<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-1300\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/5738\/2022\/01\/12202723\/Picture184-300x154.png\" alt=\"A scatterplot of the Gestational Period and Longevity for 21 Animals. It is labeled Gestation (days) on the x-axis and Longevity (years) on the y-axis. There are several points on the graph marked with red dots. There is also a blue line that goes diagonally across the graph, through the center of the cluster of red dots. Lastly, there are green vertical lines connecting the red dots to the blue line.\" width=\"1452\" height=\"746\" \/><\/p>\n<ol>\n<li>a) The blue line<\/li>\n<li>b) The vertical green lines<\/li>\n<li>c) The red dots<\/li>\n<\/ol>\n<div class=\"qa-wrapper\" style=\"display: block\"><span class=\"show-answer collapsed\" style=\"cursor: pointer\" data-target=\"q40460\">Hint<\/span><\/p>\n<div id=\"q40460\" class=\"hidden-answer\" style=\"display: none\">See the Example above for guidance.<\/div>\n<\/div>\n<\/div>\n<h3>Necessary Assumptions<\/h3>\n<p>Examining the residuals can give us useful information about whether a line of best fit is an appropriate choice to model the data in question.<\/p>\n<div class=\"textbox tryit\">\n<h3>Video Placement<\/h3>\n<p><span style=\"background-color: #e6daf7;\">[Perspective Video: a 3-instructor video presenting and annotating examples of scatterplots in which the residuals indicate the data is appropriate for a linear model, then also showing plots inappropriate for linear analysis that violate the necessary conditions in the ways indicated below. This presentation should be more intuitive and visual than technical but it may include technical discussion presented in the text above. ]<\/span><\/p>\n<\/div>\n<p>When a linear regression is appropriate, the value of the residuals will be randomly scattered around 0. That is, some residuals will be positive (observed value above the line) and some will be negative (observed value below the line), but we do not want to see some systematic pattern (e.g., all above in order and then all below).<\/p>\n<p>In particular, we might worry about the appropriateness of the model if we notice the following:<\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\">The trend in the scatterplot is non-linear, indicating that the relationship between the explanatory variable and the response variable is not modeled very well by a line. The residuals tend to have a pattern. The observations are above and below the line systematically.<\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\">The observed values are further and further away from the line of best fit for a portion of the data. That is, the errors are not consistent for all values of the explanatory variable. Often, we will see that the size of the residuals tends to increase or decrease as the value of the explanatory variable increases. When this happens, it can be hard to get a handle on the accuracy of the model because the standard deviation of the residuals is not constant over the values of the independent variable.<\/li>\n<\/ul>\n<p>Consider the conditions required for using a line to model data (listed in the text and the video above) as you explore scenarios of scatterplots in which one of those conditions has been violated.<\/p>\n<div class=\"textbox key-takeaways\">\n<h3>question 5<\/h3>\n<p>For each of the following examples, one of our necessary conditions has been violated. Indicate how you know a condition has been violated.<\/p>\n<p>Part A:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-1301\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/5738\/2022\/01\/12202729\/Picture185-300x178.png\" alt=\"A scatterplot with a line of best fit. For lower x-values, the points are close to the line. As x increases, the y-distribution of the points increases as well.\" width=\"1235\" height=\"733\" \/><\/p>\n<p>How do you know a condition has been violated?<\/p>\n<ol>\n<li>a) The trend in the scatterplot is non-linear.<\/li>\n<li>b) The size of the residuals tends to increase or decrease as the value of the explanatory variable increases.<\/li>\n<\/ol>\n<div class=\"qa-wrapper\" style=\"display: block\"><span class=\"show-answer collapsed\" style=\"cursor: pointer\" data-target=\"q403288\">Hint<\/span><\/p>\n<div id=\"q403288\" class=\"hidden-answer\" style=\"display: none\">Draw a few lines representing the residuals. What happens?<\/div>\n<\/div>\n<p>&nbsp;<\/p>\n<p>Part B:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-1302\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/5738\/2022\/01\/12202734\/Picture186-300x179.png\" alt=\"A scatterplot with a line of best fit. The points are clustered closely around the line, alternating being above and below the line at semi-regular intervals.\" width=\"1463\" height=\"873\" \/><\/p>\n<p>How do you know a condition has been violated?<\/p>\n<ol>\n<li>a) The trend in the scatterplot is non-linear.<\/li>\n<li>b) The size of the residuals tends to increase or decrease as the value of the explanatory variable increases.<\/li>\n<\/ol>\n<div class=\"qa-wrapper\" style=\"display: block\"><span class=\"show-answer collapsed\" style=\"cursor: pointer\" data-target=\"q590322\">Hint<\/span><\/p>\n<div id=\"q590322\" class=\"hidden-answer\" style=\"display: none\">Do you notice a particular pattern in the data?<\/div>\n<\/div>\n<\/div>\n<h3>Outliers<\/h3>\n<p>We also might worry about the appropriateness of the model if we notice an extreme observation that affects the value of the line. Recall from earlier activities that an outlier is an extreme observation that is far away from the rest of the data. In fitting a regression line, an outlier can also be an observation that does not fit the trend of the data as well. We call this type of outlier an <strong>influential point<\/strong>. This point drastically changes the equation of the line, consequently increasing the values of all of the residuals. An influential point appears to \u201cpull\u201d the line towards its value. We will also study how these points affect [latex]R^2[\/latex]. It is important to note that not all outliers affect the equation of the line, so we will need to investigate.<\/p>\n<div class=\"textbox exercises\">\n<h3>Example<\/h3>\n<p>Recall that an outlier is an extreme observation far away from the rest of the data. Observe the two scatterplots below. Both contain 19 identical data points but the second plot contains an influential point. Notice the effect the outlier has on the regression line.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-4840\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/5738\/2022\/03\/03041446\/Influential-Point.jpg\" alt=\"two scatterplots are shown, the first with a line of best fit with y-intercept of 10, positive slope, and 20 data points tightly arranged about the line of best fit. The second shows a line of best fit with y-intercept of 17, 19 identical data points as the first and one outlier far above the line. The first line of best fit is superimposed on the second plot, showing that the line of best fit has been pulled upward toward the extreme outlier.\" width=\"1044\" height=\"1086\" \/><\/p>\n<p>The red line in the second plot is the original line of best fit. The extreme outlier appears to have pulled the line upward in the second plot.<\/p>\n<ol>\n<li>What is the y-intercept in the first plot, without the outlier?<\/li>\n<li>What is the y-intercept in the second plot, with the extreme outlier?<\/li>\n<\/ol>\n<div class=\"qa-wrapper\" style=\"display: block\"><span class=\"show-answer collapsed\" style=\"cursor: pointer\" data-target=\"q977299\">Show Answer<\/span><\/p>\n<div id=\"q977299\" class=\"hidden-answer\" style=\"display: none\">\n<ol>\n<li>The first y-intercept appears to be approximately 10.<\/li>\n<li>The second y-intercept, under the influence of the outlier appears to be approximately 17.<\/div>\n<\/div>\n<\/li>\n<\/ol>\n<p>Note the differences in the characteristics of the two lines.<\/p>\n<table style=\"border-collapse: collapse; width: 100%; height: 36px;\">\n<tbody>\n<tr style=\"height: 12px;\">\n<td style=\"width: 25%; height: 12px;\"><\/td>\n<td style=\"width: 25%; height: 12px;\">Slope<\/td>\n<td style=\"width: 25%; height: 12px;\">[latex]r[\/latex]<\/td>\n<td style=\"width: 25%; height: 12px;\">[latex]R^{2}[\/latex]<\/td>\n<\/tr>\n<tr style=\"height: 12px;\">\n<td style=\"width: 25%; height: 12px;\">Original line without the outlier<\/td>\n<td style=\"width: 25%; height: 12px;\">[latex]3.84[\/latex]<\/td>\n<td style=\"width: 25%; height: 12px;\">[latex]0.90[\/latex]<\/td>\n<td style=\"width: 25%; height: 12px;\">[latex]81 \\%[\/latex]<\/td>\n<\/tr>\n<tr style=\"height: 12px;\">\n<td style=\"width: 25%; height: 12px;\">Second line with the outlier<\/td>\n<td style=\"width: 25%; height: 12px;\">[latex]3.84[\/latex]<\/td>\n<td style=\"width: 25%; height: 12px;\">[latex]0.64[\/latex]<\/td>\n<td style=\"width: 25%; height: 12px;\">[latex]40 \\%[\/latex]<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>&nbsp;<\/p>\n<\/div>\n<p>As we noted above, not all outliers qualify as influential points. That is, not all outliers have a large effect on the slope of the regression line. Question 6 and 7 will help illustrate this idea.<\/p>\n<div class=\"textbox key-takeaways\">\n<h3>question 6<\/h3>\n<p>Open the <em>DCMP Linear Regression<\/em> tool at <a href=\"https:\/\/dcmathpathways.shinyapps.io\/LinearRegression\/\">https:\/\/dcmathpathways.shinyapps.io\/LinearRegression\/<\/a>. Open the \u201cAnimal Longevity\u201d dataset. This is the same dataset that was used in Question 1. Make sure that gestation is the explanatory variable and longevity is the response variable.<\/p>\n<p>&nbsp;<\/p>\n<p>Part A: Find the point on the graph corresponding to the elephant by hovering over the points. Is this observation an outlier that does not fit the trend?<\/p>\n<ol>\n<li>a) Yes<\/li>\n<li>b) No<\/li>\n<\/ol>\n<div class=\"qa-wrapper\" style=\"display: block\"><span class=\"show-answer collapsed\" style=\"cursor: pointer\" data-target=\"q194072\">Hint<\/span><\/p>\n<div id=\"q194072\" class=\"hidden-answer\" style=\"display: none\">Where does the point lie in relation to the line of best fit?<\/div>\n<\/div>\n<p>&nbsp;<\/p>\n<p>Part B: What is [latex]R^2[\/latex] for this dataset?<\/p>\n<div class=\"qa-wrapper\" style=\"display: block\"><span class=\"show-answer collapsed\" style=\"cursor: pointer\" data-target=\"q778353\">Hint<\/span><\/p>\n<div id=\"q778353\" class=\"hidden-answer\" style=\"display: none\">See the table below the scatterplot.<\/div>\n<\/div>\n<p>Part C: Select the box on the left side of screen that says \u201cClick to Remove Points.\u201d Remove the point corresponding to the elephant by either clicking on the scatterplot or by highlighting the elephant row in the data table and deleting it. How does [latex]R^2[\/latex] change when you remove the elephant from the dataset?<\/p>\n<ol>\n<li>a) [latex]R^2[\/latex] increases<\/li>\n<li>b) [latex]R^2[\/latex] decreases<\/li>\n<li>c) [latex]R^2[\/latex] stays the same<\/li>\n<\/ol>\n<div class=\"qa-wrapper\" style=\"display: block\"><span class=\"show-answer collapsed\" style=\"cursor: pointer\" data-target=\"q219512\">Hint<\/span><\/p>\n<div id=\"q219512\" class=\"hidden-answer\" style=\"display: none\">Compare the value of [latex]R^2[\/latex] before and after clicking to remove the data point.<\/div>\n<\/div>\n<p>&nbsp;<\/p>\n<p>Part D: Why do you think this is? How does removing the elephant affect the relative amount of variability that is explained by the linear relationship?<\/p>\n<div class=\"qa-wrapper\" style=\"display: block\"><span class=\"show-answer collapsed\" style=\"cursor: pointer\" data-target=\"q80822\">Hint<\/span><\/p>\n<div id=\"q80822\" class=\"hidden-answer\" style=\"display: none\">What do <em>you<\/em> think? If the data point corresponding to the elephant is not an outlier, would removing it increase or decrease the variability explained by the linear relationship?<\/div>\n<\/div>\n<\/div>\n<div class=\"textbox key-takeaways\">\n<h3>Question 7<\/h3>\n<p>Now, reload the \u201cAnimal Longevity\u201d dataset so that the data for the elephant are included once again. You may need to reload the webpage, then reload the dataset.<\/p>\n<p>&nbsp;<\/p>\n<p>Part A: Find the point on the graph corresponding to the hippopotamus. Is this observation an outlier?<\/p>\n<ol>\n<li>a) Yes<\/li>\n<li>b) No<\/li>\n<\/ol>\n<div class=\"qa-wrapper\" style=\"display: block\"><span class=\"show-answer collapsed\" style=\"cursor: pointer\" data-target=\"q142783\">Hint<\/span><\/p>\n<div id=\"q142783\" class=\"hidden-answer\" style=\"display: none\"> Where does the point lie in relation to the line of best fit?<\/div>\n<\/div>\n<p>Part B: What is [latex]R^2[\/latex] for this dataset?<\/p>\n<div class=\"qa-wrapper\" style=\"display: block\"><span class=\"show-answer collapsed\" style=\"cursor: pointer\" data-target=\"q359610\">Hint<\/span><\/p>\n<div id=\"q359610\" class=\"hidden-answer\" style=\"display: none\">You\u2019ve already answered this in the previous question. See the table below the scatterplot.<\/div>\n<\/div>\n<p>&nbsp;<\/p>\n<p>Part C: Select the box on the left side of screen that says \u201cClick to Remove Points.\u201d Remove the point corresponding to the hippopotamus by either clicking on the scatterplot or by highlighting the hippopotamus row in the data table and deleting it. How does [latex]R^2[\/latex] change when you remove the hippopotamus from the dataset?<\/p>\n<ol>\n<li>a) [latex]R^2[\/latex] increases<\/li>\n<li>b) [latex]R^2[\/latex] decreases<\/li>\n<li>c) [latex]R^2[\/latex] stays the same<\/li>\n<\/ol>\n<div class=\"qa-wrapper\" style=\"display: block\"><span class=\"show-answer collapsed\" style=\"cursor: pointer\" data-target=\"q546005\">Hint<\/span><\/p>\n<div id=\"q546005\" class=\"hidden-answer\" style=\"display: none\">Compare the value of [latex]R^2[\/latex] before and after clicking to remove the data point.<\/div>\n<\/div>\n<p>&nbsp;<\/p>\n<p>Part D: Why do you think this is? How does removing the hippopotamus affect the relative amount of variability that is explained by the linear relationship?<\/p>\n<div class=\"qa-wrapper\" style=\"display: block\"><span class=\"show-answer collapsed\" style=\"cursor: pointer\" data-target=\"q874512\">Hint<\/span><\/p>\n<div id=\"q874512\" class=\"hidden-answer\" style=\"display: none\">What do <em>you\u00a0<\/em>think?<\/div>\n<\/div>\n<\/div>\n<p>Finally, let's move our attention from the idea of influential points to consider again the shape of a plot with respect to a very large or very small value of\u00a0[latex]R^2[\/latex]. Recall the situations you explored above that violate conditions necessary to consider a linear model appropriate for data. Then answer Question 8.<\/p>\n<div class=\"textbox key-takeaways\">\n<h3>question 8<\/h3>\n<p>Consider this plot again, which we saw in a previous question. The scatterplot has [latex]R^2=98.2\\%[\/latex].<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-1302\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/5738\/2022\/01\/12202734\/Picture186-300x179.png\" alt=\"A scatterplot with a line of best fit. The points are clustered closely around the line, alternating being above and below the line at semi-regular intervals.\" width=\"1463\" height=\"873\" \/><\/p>\n<p>Determine whether the following statement is true or false: If the scatterplot of two variables has a high [latex]R^2[\/latex] value, then a line is an appropriate model for the relationship between the variables.<\/p>\n<div class=\"qa-wrapper\" style=\"display: block\"><span class=\"show-answer collapsed\" style=\"cursor: pointer\" data-target=\"q803464\">Hint<\/span><\/p>\n<div id=\"q803464\" class=\"hidden-answer\" style=\"display: none\">What conditions must be satisfied in order for a linear regression model to be appropriate?<\/div>\n<\/div>\n<\/div>\n<h2>Summary<\/h2>\n<p>In this <em>What to Know\u00a0<\/em>page,\u00a0you calculated and interpreted residuals and examined assumptions and the effects of outliers on the appropriateness of a linear model for bivariate data. Let\u2019s summarize these three ideas by noting the questions in which they appeared.<\/p>\n<ul>\n<li>In Questions 1, 2, 3, and 4, you calculated and interpreted residual errors.<\/li>\n<li>In Questions 5 and 8, you identified violations of assumptions needed to perform linear regression.<\/li>\n<li>In Questions 6 and 7, you discussed the effect of influential points on\u00a0[latex]R^{2}[\/latex].<\/li>\n<\/ul>\n<p>Understanding the characteristics and processes of analyzing residuals is crucial to performing linear regression analysis on data. Let's move on to the activity to apply these skills.<\/p>\n","protected":false},"author":428269,"menu_order":18,"template":"","meta":{"_candela_citation":"[]","CANDELA_OUTCOMES_GUID":"","pb_show_title":"on","pb_short_title":"","pb_subtitle":"","pb_authors":[],"pb_section_license":""},"chapter-type":[],"contributor":[],"license":[],"class_list":["post-3868","chapter","type-chapter","status-publish","hentry"],"part":4241,"_links":{"self":[{"href":"https:\/\/courses.lumenlearning.com\/lumen-danacenter-statsmockup\/wp-json\/pressbooks\/v2\/chapters\/3868","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/courses.lumenlearning.com\/lumen-danacenter-statsmockup\/wp-json\/pressbooks\/v2\/chapters"}],"about":[{"href":"https:\/\/courses.lumenlearning.com\/lumen-danacenter-statsmockup\/wp-json\/wp\/v2\/types\/chapter"}],"author":[{"embeddable":true,"href":"https:\/\/courses.lumenlearning.com\/lumen-danacenter-statsmockup\/wp-json\/wp\/v2\/users\/428269"}],"version-history":[{"count":13,"href":"https:\/\/courses.lumenlearning.com\/lumen-danacenter-statsmockup\/wp-json\/pressbooks\/v2\/chapters\/3868\/revisions"}],"predecessor-version":[{"id":4842,"href":"https:\/\/courses.lumenlearning.com\/lumen-danacenter-statsmockup\/wp-json\/pressbooks\/v2\/chapters\/3868\/revisions\/4842"}],"part":[{"href":"https:\/\/courses.lumenlearning.com\/lumen-danacenter-statsmockup\/wp-json\/pressbooks\/v2\/parts\/4241"}],"metadata":[{"href":"https:\/\/courses.lumenlearning.com\/lumen-danacenter-statsmockup\/wp-json\/pressbooks\/v2\/chapters\/3868\/metadata\/"}],"wp:attachment":[{"href":"https:\/\/courses.lumenlearning.com\/lumen-danacenter-statsmockup\/wp-json\/wp\/v2\/media?parent=3868"}],"wp:term":[{"taxonomy":"chapter-type","embeddable":true,"href":"https:\/\/courses.lumenlearning.com\/lumen-danacenter-statsmockup\/wp-json\/pressbooks\/v2\/chapter-type?post=3868"},{"taxonomy":"contributor","embeddable":true,"href":"https:\/\/courses.lumenlearning.com\/lumen-danacenter-statsmockup\/wp-json\/wp\/v2\/contributor?post=3868"},{"taxonomy":"license","embeddable":true,"href":"https:\/\/courses.lumenlearning.com\/lumen-danacenter-statsmockup\/wp-json\/wp\/v2\/license?post=3868"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}