{"id":969,"date":"2016-04-21T22:43:35","date_gmt":"2016-04-21T22:43:35","guid":{"rendered":"https:\/\/courses.lumenlearning.com\/introstats1xmaster\/?post_type=chapter&#038;p=969"},"modified":"2016-04-21T22:43:35","modified_gmt":"2016-04-21T22:43:35","slug":"checking-model-assumptions-using-graphs","status":"publish","type":"chapter","link":"https:\/\/courses.lumenlearning.com\/ntcc-introstats1\/chapter\/checking-model-assumptions-using-graphs\/","title":{"raw":"Checking Model Assumptions Using Graphs","rendered":"Checking Model Assumptions Using Graphs"},"content":{"raw":"<p>Multiple regression methods using the model\n<\/p><p style=\"text-align: center;\">[latex]\\displaystyle\\hat{y}=\\beta_0+\\beta_1x_1+\\beta_2x_2+\\dots+\\beta_kx_k[\/latex]<\/p>\ngenerally depend on the following four assumptions:\n<ol><li>the residuals of the model are nearly normal,<\/li>\n\t<li>the variability of the residuals is nearly constant,<\/li>\n\t<li>the residuals are independent, and<\/li>\n\t<li>each variable is linearly related to the outcome.<\/li>\n<\/ol><strong>Diagnostic plots<\/strong> can be used to check each of these assumptions. We will consider the\u00a0model from the Mario Kart auction data, and check whether there are any notable concerns:\n<p style=\"text-align: center;\">[latex]\\displaystyle\\widehat{\\text{price}}=36.05+5.18\\times\\text{cond_new}+1.12\\times\\text{stock_photo}+7.30\\times\\text{wheels}[\/latex]<\/p>\n<strong>Normal probability plot.<\/strong> A normal probability plot of the residuals is shown in Figure 1. While the plot exhibits some minor irregularities, there are no outliers that might be cause for concern. In a normal probability plot for residuals, we tend to be most worried about residuals that appear to be outliers, since these indicate long\u00a0tails in the distribution of residuals.\n\n[caption id=\"attachment_1462\" align=\"aligncenter\" width=\"458\"]<img class=\"wp-image-1462 size-full\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/132\/2016\/04\/21215319\/Figure8_1.png\" alt=\"A scatter plot of data points. The data has a tight line of best fit, with a few outliers in the top right corner of the plot.\" width=\"458\" height=\"367\"\/> Figure 1. A normal probability plot of the residuals is helpful in identifying observations that might be outliers.[\/caption]\n\n<strong>Absolute values of residuals against fitted values.<\/strong> A plot of the absolute value of\u00a0the residuals against their corresponding fitted values [latex]\\left(\\displaystyle\\hat{y}_i\\right)[\/latex]\u00a0is shown in Figure 2.\n\nThis plot is helpful to check the condition that the variance of the residuals is approximately constant. We don't see any obvious deviations from constant variance in\u00a0this example.\n\n[caption id=\"attachment_1463\" align=\"aligncenter\" width=\"531\"]<img class=\"wp-image-1463 size-full\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/132\/2016\/04\/21215321\/Figure8_2.png\" alt=\"The y axis of this plot is the absolute value of residuals, and the x axis is the fitted values.\" width=\"531\" height=\"357\"\/> Figure 2.\u00a0Comparing the absolute value of the residuals against the fitted\u00a0values [latex]\\left(\\displaystyle\\hat{y}_i\\right)[\/latex]\u00a0is helpful in identifying deviations from the constant variance\u00a0assumption.[\/caption]<strong>Residuals in order of their data collection.<\/strong> A plot of the residuals in the order their\u00a0corresponding auctions were observed is shown in Figure 3. Such a plot is helpful in identifying any connection between cases that are close to one another, e.g. we could look for declining prices over time or if there was a time of the day when auctions\u00a0tended to fetch a higher price. Here we see no structure that indicates a problem.[footnote]An especially rigorous check would use time series methods. For instance, we could check whether consecutive residuals are correlated. Doing so with these residuals yields no statistically significant correlations.[\/footnote]\n\n[caption id=\"attachment_1464\" align=\"aligncenter\" width=\"550\"]<img class=\"wp-image-1464 size-full\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/132\/2016\/04\/21215322\/Figure8_3.png\" alt=\"A scatter plot. The y axis is the residuals, and the x axis is the order of collection.\" width=\"550\" height=\"376\"\/> Figure 3. Plotting residuals in the order that their corresponding observations\u00a0were collected helps identify connections between successive observations.\u00a0If it seems that consecutive observations tend to be close to\u00a0each other, this indicates the independence assumption of the observations\u00a0would fail.[\/caption]\n\n<strong>Residuals against each predictor variable.<\/strong> We consider a plot of the residuals against\u00a0the cond_new variable, the residuals against the stock photo variable, and the residuals against the wheels variable. These plots are shown in Figure 4. For the two-level condition variable, we are guaranteed not to see any remaining trend, and instead we are checking that the variability doesn't \ufb02uctuate across groups, which it does not. However, looking at the stock photo variable, we find that there is some difference in the variability of the residuals in the two groups. Additionally, when we consider the residuals against the wheels variable, we see some possible structure. There appears to be curvature in the residuals, indicating the relationship is probably\u00a0not linear.\n\n[caption id=\"attachment_1465\" align=\"aligncenter\" width=\"786\"]<img class=\"wp-image-1465 size-full\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/132\/2016\/04\/21215324\/Figure8_4.jpg\" alt=\"Plots comparing the residuals to the condition, photo type, and number of wheels.\" width=\"786\" height=\"993\"\/> Figure 4. For the condition and stock photo variables, we check for\u00a0differences in the distribution shape or variability of the residuals. In the\u00a0case of the stock photos variable, we see a little less variability in the unique\u00a0photo group than the stock photo group. For numerical predictors, we also\u00a0check for trends or other structure. We see some slight bowing in the\u00a0residuals against the wheels variable in the bottom plot.[\/caption]\n\nIt is necessary to summarize diagnostics for any model fit. If the diagnostics support\u00a0the model assumptions, this would improve credibility in the findings. If the diagnostic assessment shows remaining underlying structure in the residuals, we should try to adjust the model to account for that structure. If we are unable to do so, we may still report the model but must also note its shortcomings. In the case of the auction data, we report that there appears to be non-constant variance in the stock photo variable and that there may be a nonlinear relationship between the total price and the number of wheels included for an auction. This information would be important to buyers and sellers who may review the analysis, and omitting this information could be a setback to the very people who the\u00a0model might assist.\n<div class=\"textbox\">\n<h3>Be Wary<\/h3>\n<strong>\"All models are wrong, but some are useful\" <\/strong>\n\n<strong>\u2014George E.P. Box <\/strong>\n\nThe truth is that no model is perfect. However, even imperfect models can be\u00a0useful. Reporting a \ufb02awed model can be reasonable so long as we are clear and\u00a0report the model's shortcomings.\n\n<\/div>\n<div class=\"textbox\">\n<h3>Caution: Don't report results when assumptions are grossly violated<\/h3>\nWhile there is a little leeway in model assumptions, don't go too far. If model assumptions are very clearly violated, consider a new model, even if it means learning\u00a0more statistical methods or hiring someone who can help.\n\n<\/div>\n<div class=\"textbox\">\n<h3>TIP: Confidence intervals in multiple regression<\/h3>\nConfidence intervals for coefficients in multiple regression can be computed using\u00a0the same formula as in the single predictor model:\n\n[latex]\\displaystyle{b}_i\\pm{t}^*_{df}SE_{b_i}[\/latex]\n\nwhere<em> t*<sub>df<\/sub><\/em>\u00a0is the appropriate<em> t<\/em>-value corresponding to the confidence level and model\u00a0degrees of freedom,<em> df<\/em> = <em>n<\/em> \u2212 <em>k<\/em> \u2212 1.\n\n<\/div>","rendered":"<p>Multiple regression methods using the model\n<\/p>\n<p style=\"text-align: center;\">[latex]\\displaystyle\\hat{y}=\\beta_0+\\beta_1x_1+\\beta_2x_2+\\dots+\\beta_kx_k[\/latex]<\/p>\n<p>generally depend on the following four assumptions:<\/p>\n<ol>\n<li>the residuals of the model are nearly normal,<\/li>\n<li>the variability of the residuals is nearly constant,<\/li>\n<li>the residuals are independent, and<\/li>\n<li>each variable is linearly related to the outcome.<\/li>\n<\/ol>\n<p><strong>Diagnostic plots<\/strong> can be used to check each of these assumptions. We will consider the\u00a0model from the Mario Kart auction data, and check whether there are any notable concerns:<\/p>\n<p style=\"text-align: center;\">[latex]\\displaystyle\\widehat{\\text{price}}=36.05+5.18\\times\\text{cond_new}+1.12\\times\\text{stock_photo}+7.30\\times\\text{wheels}[\/latex]<\/p>\n<p><strong>Normal probability plot.<\/strong> A normal probability plot of the residuals is shown in Figure 1. While the plot exhibits some minor irregularities, there are no outliers that might be cause for concern. In a normal probability plot for residuals, we tend to be most worried about residuals that appear to be outliers, since these indicate long\u00a0tails in the distribution of residuals.<\/p>\n<div id=\"attachment_1462\" style=\"width: 468px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-1462\" class=\"wp-image-1462 size-full\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/132\/2016\/04\/21215319\/Figure8_1.png\" alt=\"A scatter plot of data points. The data has a tight line of best fit, with a few outliers in the top right corner of the plot.\" width=\"458\" height=\"367\" \/><\/p>\n<p id=\"caption-attachment-1462\" class=\"wp-caption-text\">Figure 1. A normal probability plot of the residuals is helpful in identifying observations that might be outliers.<\/p>\n<\/div>\n<p><strong>Absolute values of residuals against fitted values.<\/strong> A plot of the absolute value of\u00a0the residuals against their corresponding fitted values [latex]\\left(\\displaystyle\\hat{y}_i\\right)[\/latex]\u00a0is shown in Figure 2.<\/p>\n<p>This plot is helpful to check the condition that the variance of the residuals is approximately constant. We don&#8217;t see any obvious deviations from constant variance in\u00a0this example.<\/p>\n<div id=\"attachment_1463\" style=\"width: 541px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-1463\" class=\"wp-image-1463 size-full\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/132\/2016\/04\/21215321\/Figure8_2.png\" alt=\"The y axis of this plot is the absolute value of residuals, and the x axis is the fitted values.\" width=\"531\" height=\"357\" \/><\/p>\n<p id=\"caption-attachment-1463\" class=\"wp-caption-text\">Figure 2.\u00a0Comparing the absolute value of the residuals against the fitted\u00a0values [latex]\\left(\\displaystyle\\hat{y}_i\\right)[\/latex]\u00a0is helpful in identifying deviations from the constant variance\u00a0assumption.<\/p>\n<\/div>\n<p><strong>Residuals in order of their data collection.<\/strong> A plot of the residuals in the order their\u00a0corresponding auctions were observed is shown in Figure 3. Such a plot is helpful in identifying any connection between cases that are close to one another, e.g. we could look for declining prices over time or if there was a time of the day when auctions\u00a0tended to fetch a higher price. Here we see no structure that indicates a problem.<a class=\"footnote\" title=\"An especially rigorous check would use time series methods. For instance, we could check whether consecutive residuals are correlated. Doing so with these residuals yields no statistically significant correlations.\" id=\"return-footnote-969-1\" href=\"#footnote-969-1\" aria-label=\"Footnote 1\"><sup class=\"footnote\">[1]<\/sup><\/a><\/p>\n<div id=\"attachment_1464\" style=\"width: 560px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-1464\" class=\"wp-image-1464 size-full\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/132\/2016\/04\/21215322\/Figure8_3.png\" alt=\"A scatter plot. The y axis is the residuals, and the x axis is the order of collection.\" width=\"550\" height=\"376\" \/><\/p>\n<p id=\"caption-attachment-1464\" class=\"wp-caption-text\">Figure 3. Plotting residuals in the order that their corresponding observations\u00a0were collected helps identify connections between successive observations.\u00a0If it seems that consecutive observations tend to be close to\u00a0each other, this indicates the independence assumption of the observations\u00a0would fail.<\/p>\n<\/div>\n<p><strong>Residuals against each predictor variable.<\/strong> We consider a plot of the residuals against\u00a0the cond_new variable, the residuals against the stock photo variable, and the residuals against the wheels variable. These plots are shown in Figure 4. For the two-level condition variable, we are guaranteed not to see any remaining trend, and instead we are checking that the variability doesn&#8217;t \ufb02uctuate across groups, which it does not. However, looking at the stock photo variable, we find that there is some difference in the variability of the residuals in the two groups. Additionally, when we consider the residuals against the wheels variable, we see some possible structure. There appears to be curvature in the residuals, indicating the relationship is probably\u00a0not linear.<\/p>\n<div id=\"attachment_1465\" style=\"width: 796px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-1465\" class=\"wp-image-1465 size-full\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/132\/2016\/04\/21215324\/Figure8_4.jpg\" alt=\"Plots comparing the residuals to the condition, photo type, and number of wheels.\" width=\"786\" height=\"993\" \/><\/p>\n<p id=\"caption-attachment-1465\" class=\"wp-caption-text\">Figure 4. For the condition and stock photo variables, we check for\u00a0differences in the distribution shape or variability of the residuals. In the\u00a0case of the stock photos variable, we see a little less variability in the unique\u00a0photo group than the stock photo group. For numerical predictors, we also\u00a0check for trends or other structure. We see some slight bowing in the\u00a0residuals against the wheels variable in the bottom plot.<\/p>\n<\/div>\n<p>It is necessary to summarize diagnostics for any model fit. If the diagnostics support\u00a0the model assumptions, this would improve credibility in the findings. If the diagnostic assessment shows remaining underlying structure in the residuals, we should try to adjust the model to account for that structure. If we are unable to do so, we may still report the model but must also note its shortcomings. In the case of the auction data, we report that there appears to be non-constant variance in the stock photo variable and that there may be a nonlinear relationship between the total price and the number of wheels included for an auction. This information would be important to buyers and sellers who may review the analysis, and omitting this information could be a setback to the very people who the\u00a0model might assist.<\/p>\n<div class=\"textbox\">\n<h3>Be Wary<\/h3>\n<p><strong>&#8220;All models are wrong, but some are useful&#8221; <\/strong><\/p>\n<p><strong>\u2014George E.P. Box <\/strong><\/p>\n<p>The truth is that no model is perfect. However, even imperfect models can be\u00a0useful. Reporting a \ufb02awed model can be reasonable so long as we are clear and\u00a0report the model&#8217;s shortcomings.<\/p>\n<\/div>\n<div class=\"textbox\">\n<h3>Caution: Don&#8217;t report results when assumptions are grossly violated<\/h3>\n<p>While there is a little leeway in model assumptions, don&#8217;t go too far. If model assumptions are very clearly violated, consider a new model, even if it means learning\u00a0more statistical methods or hiring someone who can help.<\/p>\n<\/div>\n<div class=\"textbox\">\n<h3>TIP: Confidence intervals in multiple regression<\/h3>\n<p>Confidence intervals for coefficients in multiple regression can be computed using\u00a0the same formula as in the single predictor model:<\/p>\n<p>[latex]\\displaystyle{b}_i\\pm{t}^*_{df}SE_{b_i}[\/latex]<\/p>\n<p>where<em> t*<sub>df<\/sub><\/em>\u00a0is the appropriate<em> t<\/em>-value corresponding to the confidence level and model\u00a0degrees of freedom,<em> df<\/em> = <em>n<\/em> \u2212 <em>k<\/em> \u2212 1.<\/p>\n<\/div>\n\n\t\t\t <section class=\"citations-section\" role=\"contentinfo\">\n\t\t\t <h3>Candela Citations<\/h3>\n\t\t\t\t\t <div>\n\t\t\t\t\t\t <div id=\"citation-list-969\">\n\t\t\t\t\t\t\t <div class=\"licensing\"><div class=\"license-attribution-dropdown-subheading\">CC licensed content, Shared previously<\/div><ul class=\"citation-list\"><li>OpenIntro Statistics. <strong>Authored by<\/strong>: David M Diez, Christopher D Barr, and Mine Cetinkaya-Rundel. <strong>Provided by<\/strong>: OpenIntro. <strong>Located at<\/strong>: <a target=\"_blank\" href=\"https:\/\/www.openintro.org\/stat\/textbook.php\">https:\/\/www.openintro.org\/stat\/textbook.php<\/a>. <strong>License<\/strong>: <em><a target=\"_blank\" rel=\"license\" href=\"https:\/\/creativecommons.org\/licenses\/by-sa\/4.0\/\">CC BY-SA: Attribution-ShareAlike<\/a><\/em>. <strong>License Terms<\/strong>: This textbook is available under a Creative Commons license. Visit openintro.org for a free  PDF, to download the textbook&#039;s source files.<\/li><\/ul><\/div>\n\t\t\t\t\t\t <\/div>\n\t\t\t\t\t <\/div>\n\t\t\t <\/section><hr class=\"before-footnotes clear\" \/><div class=\"footnotes\"><ol><li id=\"footnote-969-1\">An especially rigorous check would use time series methods. For instance, we could check whether consecutive residuals are correlated. Doing so with these residuals yields no statistically significant correlations. <a href=\"#return-footnote-969-1\" class=\"return-footnote\" aria-label=\"Return to footnote 1\">&crarr;<\/a><\/li><\/ol><\/div>","protected":false},"author":21,"menu_order":3,"template":"","meta":{"_candela_citation":"[{\"type\":\"cc\",\"description\":\"OpenIntro Statistics\",\"author\":\"David M Diez, Christopher D Barr, and Mine Cetinkaya-Rundel\",\"organization\":\"OpenIntro\",\"url\":\"https:\/\/www.openintro.org\/stat\/textbook.php\",\"project\":\"\",\"license\":\"cc-by-sa\",\"license_terms\":\"This textbook is available under a Creative Commons license. Visit openintro.org for a free  PDF, to download the textbook's source files.\"}]","CANDELA_OUTCOMES_GUID":"","pb_show_title":"on","pb_short_title":"","pb_subtitle":"","pb_authors":[],"pb_section_license":""},"chapter-type":[],"contributor":[],"license":[],"class_list":["post-969","chapter","type-chapter","status-publish","hentry"],"part":961,"_links":{"self":[{"href":"https:\/\/courses.lumenlearning.com\/ntcc-introstats1\/wp-json\/pressbooks\/v2\/chapters\/969","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/courses.lumenlearning.com\/ntcc-introstats1\/wp-json\/pressbooks\/v2\/chapters"}],"about":[{"href":"https:\/\/courses.lumenlearning.com\/ntcc-introstats1\/wp-json\/wp\/v2\/types\/chapter"}],"author":[{"embeddable":true,"href":"https:\/\/courses.lumenlearning.com\/ntcc-introstats1\/wp-json\/wp\/v2\/users\/21"}],"version-history":[{"count":1,"href":"https:\/\/courses.lumenlearning.com\/ntcc-introstats1\/wp-json\/pressbooks\/v2\/chapters\/969\/revisions"}],"predecessor-version":[{"id":1237,"href":"https:\/\/courses.lumenlearning.com\/ntcc-introstats1\/wp-json\/pressbooks\/v2\/chapters\/969\/revisions\/1237"}],"part":[{"href":"https:\/\/courses.lumenlearning.com\/ntcc-introstats1\/wp-json\/pressbooks\/v2\/parts\/961"}],"metadata":[{"href":"https:\/\/courses.lumenlearning.com\/ntcc-introstats1\/wp-json\/pressbooks\/v2\/chapters\/969\/metadata\/"}],"wp:attachment":[{"href":"https:\/\/courses.lumenlearning.com\/ntcc-introstats1\/wp-json\/wp\/v2\/media?parent=969"}],"wp:term":[{"taxonomy":"chapter-type","embeddable":true,"href":"https:\/\/courses.lumenlearning.com\/ntcc-introstats1\/wp-json\/pressbooks\/v2\/chapter-type?post=969"},{"taxonomy":"contributor","embeddable":true,"href":"https:\/\/courses.lumenlearning.com\/ntcc-introstats1\/wp-json\/wp\/v2\/contributor?post=969"},{"taxonomy":"license","embeddable":true,"href":"https:\/\/courses.lumenlearning.com\/ntcc-introstats1\/wp-json\/wp\/v2\/license?post=969"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}