{"id":964,"date":"2016-04-21T22:43:35","date_gmt":"2016-04-21T22:43:35","guid":{"rendered":"https:\/\/courses.lumenlearning.com\/introstats1xmaster\/?post_type=chapter&#038;p=964"},"modified":"2017-07-20T19:48:57","modified_gmt":"2017-07-20T19:48:57","slug":"model-selection","status":"web-only","type":"chapter","link":"https:\/\/courses.lumenlearning.com\/suny-suffolk-introstats1\/chapter\/model-selection\/","title":{"raw":"Model Selection","rendered":"Model Selection"},"content":{"raw":"<div>\r\n\r\nThe best model is not always the most complicated. Sometimes including variables that\u00a0are not evidently important can actually reduce the accuracy of predictions. In this section we discuss model selection strategies, which will help us eliminate variables from the model\u00a0that are found to be less important.\r\n\r\nIn practice, the model that includes all available explanatory variables is often referred\u00a0to as the<strong> full model<\/strong>. The full model may not be the best model, and if it isn't, we want\u00a0to identify a smaller model that is preferable.\r\n<h2>Identifying variables in the model that may not be helpful<\/h2>\r\nAdjusted<em> <em>R<\/em><sup>2<\/sup><\/em>\u00a0describes the strength of a model fit, and it is a useful tool for evaluating\u00a0which predictors are adding value to the model, where<em> adding value<\/em> means they are (likely)\u00a0improving the accuracy in predicting future outcomes.\r\n\r\nLet's consider two models, which are shown in Tables 1\u00a0and 2. The first table\u00a0summarizes the full model since it includes all predictors, while the second does not include\u00a0the duration variable.\r\n\r\n<\/div>\r\n<em>df<\/em> = 136\r\n<table>\r\n<thead>\r\n<tr>\r\n<th colspan=\"5\">Table 1.\u00a0The fit for the full regression model, including the adjusted<em> <em>R<\/em><sup>2<\/sup><\/em>.<\/th>\r\n<\/tr>\r\n<\/thead>\r\n<tbody>\r\n<tr>\r\n<td><\/td>\r\n<td>Estimate<\/td>\r\n<td>Std. Error<\/td>\r\n<td>t value<\/td>\r\n<td>Pr(<em> &gt;<\/em>|t|)<\/td>\r\n<\/tr>\r\n<tr>\r\n<td>(Intercept)<\/td>\r\n<td>36.2110<\/td>\r\n<td>1.5140<\/td>\r\n<td>23.92<\/td>\r\n<td>0.0000<\/td>\r\n<\/tr>\r\n<tr>\r\n<td>cond_new<\/td>\r\n<td>5.1306<\/td>\r\n<td>1.0511<\/td>\r\n<td>4.88<\/td>\r\n<td>0.0000<\/td>\r\n<\/tr>\r\n<tr>\r\n<td>stock_photo<\/td>\r\n<td>1.0803<\/td>\r\n<td>1.0568<\/td>\r\n<td>1.02<\/td>\r\n<td>0.3085<\/td>\r\n<\/tr>\r\n<tr>\r\n<td>duration<\/td>\r\n<td>\u20130.0268<\/td>\r\n<td>0.1904<\/td>\r\n<td>\u20130.14<\/td>\r\n<td>0.8882<\/td>\r\n<\/tr>\r\n<tr>\r\n<td>wheels<\/td>\r\n<td>7.2852<\/td>\r\n<td>0.5547<\/td>\r\n<td>13.13<\/td>\r\n<td>0.0000<\/td>\r\n<\/tr>\r\n<tr>\r\n<td><em><em><em>R<\/em><sup>2<\/sup><sub><em>adj<\/em><\/sub><\/em><\/em>\u00a0= 0<em>.<\/em>7108<\/td>\r\n<\/tr>\r\n<\/tbody>\r\n<\/table>\r\n<div><\/div>\r\n<div>\u00a0\u00a0\u00a0\u00a0\u00a0<em>df<\/em> = 137\r\n<table>\r\n<thead>\r\n<tr>\r\n<th colspan=\"5\">Table 2. The fit for the regression model for predictors cond_new,\u00a0stock_photo, and wheels.<\/th>\r\n<\/tr>\r\n<\/thead>\r\n<tbody>\r\n<tr>\r\n<td><\/td>\r\n<td>Estimate<\/td>\r\n<td>Std. Error<\/td>\r\n<td>t value<\/td>\r\n<td>Pr(<em> &gt;<\/em>|t|)<\/td>\r\n<\/tr>\r\n<tr>\r\n<td>(Intercept)<\/td>\r\n<td>36.0483<\/td>\r\n<td>0.9745<\/td>\r\n<td>36.99<\/td>\r\n<td>0.0000<\/td>\r\n<\/tr>\r\n<tr>\r\n<td>cond_new<\/td>\r\n<td>5.1763<\/td>\r\n<td>0.9961<\/td>\r\n<td>5.20<\/td>\r\n<td>0.0000<\/td>\r\n<\/tr>\r\n<tr>\r\n<td>stock_photo<\/td>\r\n<td>1.1177<\/td>\r\n<td>1.0192<\/td>\r\n<td>1.10<\/td>\r\n<td>0.2747<\/td>\r\n<\/tr>\r\n<tr>\r\n<td>wheels<\/td>\r\n<td>7.2984<\/td>\r\n<td>0.5448<\/td>\r\n<td>13.40<\/td>\r\n<td>0.0000<\/td>\r\n<\/tr>\r\n<tr>\r\n<td><em><em><em>R<\/em><sup>2<\/sup><sub><em>adj<\/em><\/sub><\/em><\/em>\u00a0= 0<em>.<\/em>7128<\/td>\r\n<\/tr>\r\n<\/tbody>\r\n<\/table>\r\n<\/div>\r\n&nbsp;\r\n<div class=\"textbox exercises\">\r\n<h3>Example<\/h3>\r\nWhich of the two models is better?\r\n\r\nSolution:\r\n\r\nWe compare the adjusted<em> R<sup>2<\/sup><\/em>\u00a0of each model to determine which to choose. Since the\u00a0first model has an<em> R<sup>2<\/sup><sub>adj<\/sub><\/em>\u00a0smaller than the<em> R<sup>2<\/sup><sub>adj<\/sub><\/em>\u00a0of the second model, we prefer the\u00a0second model to the first.\r\n\r\nWill the model without duration be better than the model with duration? We cannot\u00a0know for sure, but based on the adjusted<em> R<sup>2<\/sup><\/em>, this is our best assessment.\r\n\r\n<\/div>\r\n<h2>Two model selection strategies<\/h2>\r\nTwo common strategies for adding or removing variables in a multiple regression model are\u00a0called<em> backward elimination<\/em> and<em> forward selection<\/em>. These techniques are often referred to as<strong> stepwise<\/strong> model selection strategies, because they add or delete one variable at a time\u00a0as they \"step\" through the candidate predictors.\r\n\r\n<strong>Backward elimination<\/strong> starts with the model that includes all potential predictor\u00a0variables. Variables are eliminated one-at-a-time from the model until we cannot improve the adjusted<em> <em>R<\/em><sup>2<\/sup><\/em>. The strategy within each elimination step is to eliminate the variable that\u00a0leads to the largest improvement in adjusted<em> <em>R<\/em><sup>2<\/sup><\/em>.\r\n<p style=\"text-align: center;\"><\/p>\r\n\r\n<div class=\"textbox exercises\">\r\n<h3>Example<\/h3>\r\nResults corresponding to the<em> full model<\/em> for the mario kart data are shown in Table 8.6. How should we proceed under the backward elimination\u00a0strategy?\r\n\r\nSolution:\r\n\r\nOur baseline adjusted<em> R<sup>2<\/sup><\/em>\u00a0from the full model is\u00a0<em>R<\/em><sup>2<\/sup><sub><em>adj<\/em><\/sub> = 0<em>.<\/em>7108, and we need to\u00a0determine whether dropping a predictor will improve the adjusted<em> R<sup>2<\/sup><\/em>. To check, we\r\n\r\nfit four models that each drop a different predictor, and we record the adjusted<em> R<sup>2<\/sup><\/em>\r\n\r\nfrom each:\r\n\r\n[latex]\\begin{array}\\text{Exclude . . .}\\hfill&amp;\\text{cond_new}\\hfill&amp;\\text{stock_photo}\\hfill&amp;\\text{duration}\\hfill&amp;\\text{wheels}\\\\\\text{ }\\hfill&amp;R^2_{adj}=0.6626\\hfill&amp;R^2_{adj}=0.7107\\hfill&amp;R^2_{adj}=0.7128\\hfill&amp;R^2_{adj}=0.3487\\end{array}[\/latex]\r\n\r\nThe third model without duration has the highest adjusted<em> R<sup>2<\/sup><\/em>\u00a0of 0.7128, so we\u00a0compare it to the adjusted<em> R<sup>2<\/sup><\/em>\u00a0for the full model. Because eliminating duration\u00a0leads to a model with a higher adjusted<em> R<sup>2<\/sup><\/em>, we drop duration from the model.\u00a0Since we eliminated a predictor from the model in the first step, we see whether\u00a0we should eliminate any additional predictors. Our baseline adjusted<em> R<sup>2<\/sup><\/em>\u00a0is now\u00a0<em>R<sup>2<\/sup><sub>adj<\/sub><\/em>\u00a0= 0<em>.<\/em>7128. We now fit three new models, which consider eliminating each of the\u00a0three remaining predictors:\r\n\r\n[latex]\\begin{array}\\text{Exclude duration and . . .}\\hfill&amp;\\text{cond_new}\\hfill&amp;\\text{stock_photo}\\hfill&amp;\\text{wheels}\\\\\\text{ }\\hfill&amp;R^2_{adj}=0.6587\\hfill&amp;R^2_{adj}=0.7124\\hfill&amp;R^2_{adj}=0.3414\\end{array}[\/latex]\r\n\r\nNone of these models lead to an improvement in adjusted<em> R<sup>2<\/sup><\/em>, so we do not eliminate\u00a0any of the remaining predictors. That is, after backward elimination, we are left with the model that keeps cond new, stock photos, and wheels, which we can summarize\u00a0using the coefficients from Table 2:\r\n\r\n[latex]\\displaystyle\\hat{y}=b_0+b_1x_1+b_2x_2+b_4x_4[\/latex]\r\n\r\n[latex]\\displaystyle\\widehat{\\text{price}}=36.05+5.18\\times\\text{cond_new}+1.12\\times\\text{stock_photo}+7.30\\times\\text{wheels}[\/latex]\r\n\r\n<\/div>\r\nThe<strong> forward selection<\/strong> strategy is the reverse of the backward elimination technique.\u00a0Instead of eliminating variables one-at-a-time, we add variables one-at-a-time until we\u00a0cannot find any variables that improve the model (as measured by adjusted<em> <em>R<\/em><sup>2<\/sup><\/em>).\r\n\r\n&nbsp;\r\n<div class=\"textbox exercises\">\r\n<h3>Example<\/h3>\r\nConstruct a model for the mario kart data set using the forward\u00a0selection strategy.\r\n\r\nSolution:\r\n\r\nWe start with the model that includes no variables. Then we fit each of the possible\u00a0models with just one variable. That is, we fit the model including just cond_new, then the model including just stock photo, then a model with just duration, and a\u00a0model with just wheels. Each of the four models provides an adjusted<em> R<sup>2<\/sup><\/em>\u00a0value:\r\n\r\n[latex]\\begin{array}\\text{Add . . .}\\hfill&amp;\\text{cond_new}\\hfill&amp;\\text{stock_photo}\\hfill&amp;\\text{duration}\\hfill&amp;\\text{wheels}\\\\\\text{ }\\hfill&amp;R^2_{adj}=0.3459\\hfill&amp;R^2_{adj}=0.0332\\hfill&amp;R^2_{adj}=0.1338\\hfill&amp;R^2_{adj}=0.6390\\end{array}[\/latex]\r\n\r\nIn this first step, we compare the adjusted<em> R<sup>2<\/sup><\/em>\u00a0against a baseline model that has\u00a0no predictors. The no-predictors model always has\u00a0<em>R<\/em><sup>2<\/sup><sub><em>adj<\/em><\/sub>\u00a0= 0. The model with one\u00a0predictor that has the largest adjusted<em> R<sup>2<\/sup><\/em>\u00a0is the model with the wheels predictor,\u00a0and because this adjusted<em> R<sup>2<\/sup><\/em>\u00a0is larger than the adjusted<em> R<sup>2<\/sup><\/em>\u00a0from the model with no\u00a0predictors (<em>R<\/em><sup>2<\/sup><sub><em>adj<\/em><\/sub>\u00a0= 0), we will add this variable to our model.\r\n\r\nWe repeat the process again, this time considering 2-predictor models where one of\u00a0the predictors is wheels and with a new baseline of<em> R<sup>2<\/sup><sub>adj<\/sub><\/em>\u00a0= 0<em>.<\/em>6390:\r\n\r\n[latex]\\begin{array}\\text{Add wheels and . . .}\\hfill&amp;\\text{cond_new}\\hfill&amp;\\text{stock_photo}\\hfill&amp;\\text{duration}\\\\\\text{ }\\hfill&amp;R^2_{adj}=0.7124\\hfill&amp;R^2_{adj}=0.6587\\hfill&amp;R^2_{adj}=0.6528\\end{array}[\/latex]\r\n\r\nThe best predictor in this stage, cond new, has a higher adjusted<em> R<sup>2<\/sup><\/em>\u00a0(0.7124) than\u00a0the baseline (0.6390), so we also add cond_new to the model.\r\n\r\nSince we have again added a variable to the model, we continue and see whether it\u00a0would be beneficial to add a third variable:\r\n\r\n[latex]\\begin{array}\\text{Add wheels, cond_new, and . . .}\\hfill&amp;\\text{stock_photo}\\hfill&amp;\\text{duration}\\\\\\text{ }\\hfill&amp;R^2_{adj}=0.7128\\hfill&amp;R^2_{adj}=0.7107\\end{array}[\/latex]\r\n\r\nThe model adding stock photo improved adjusted<em> R<sup>2<\/sup><\/em>\u00a0(0.7124 to 0.7128), so we add\u00a0stock_photo to the model.\r\n\r\nBecause we have again added a predictor, we check whether adding the last variable,\u00a0duration, will improve adjusted<em> R<sup>2<\/sup><\/em>. We compare the adjusted<em> R<sup>2<\/sup><\/em>\u00a0for the model with duration and the other three predictors (0.7108) to the model that only considers wheels, cond_new, and stock photo (0.7128). Adding duration does not improve the adjusted<em> R<sup>2<\/sup><\/em>, so we do not add it to the model, and we have arrived at the same\u00a0model that we identified from backward elimination.\r\n\r\n<\/div>\r\n<div class=\"textbox\">\r\n<h3>Model Selection Strategies<\/h3>\r\nBackward elimination begins with the largest model and eliminates variables one-by-one until we are satisfied that all remaining variables are important to the model. Forward selection starts with no variables included in the model, then it adds in variables according to their importance until no other important variables\u00a0are found.\r\n\r\n<\/div>\r\nThere is no guarantee that backward elimination and forward selection will arrive at\u00a0the same final model. If both techniques are tried and they arrive at different models, we\u00a0choose the model with the larger<em> <em>R<\/em><sup>2<\/sup><sub><em>adj<\/em><\/sub><\/em>; other tie-break options exist but are beyond the\u00a0scope of this book.\r\n<h2>The <em>p<\/em>-Value Approach, an Alternative to Adjusted <em>R<\/em><sup>2<\/sup><\/h2>\r\nThe <em>p<\/em>-value may be used as an alternative to adjusted <em>R<\/em><sup>2<\/sup>\u00a0for model selection.\r\n\r\nIn backward elimination, we would identify the predictor corresponding to the largest\u00a0<em>p<\/em>-value. If the <em>p<\/em>-value is above the significance level, usually <em>\u03b1<\/em> = 0<em>.<\/em>05, then we would drop that variable, refit the model, and repeat the process. If the largest <em>p<\/em>-value is less than <em>\u03b1<\/em> = 0<em>.<\/em>05, then we would not eliminate any predictors and the current model would be our\u00a0best-fitting model.\r\n\r\nIn forward selection with <em>p<\/em>-values, we reverse the process. We begin with a model\u00a0that has no predictors, then we fit a model for each possible predictor, identifying the model where the corresponding predictor's <em>p<\/em>-value is smallest. If that <em>p<\/em>-value is smaller than\u00a0<em>\u03b1<\/em>\u00a0= 0<em>.<\/em>05, we add it to the model and repeat the process, considering whether to add more variables one-at-a-time. When none of the remaining predictors can be added to the model and have a <em>p<\/em>-value less than 0.05, then we stop adding variables and the current\u00a0model would be our best-fitting model.\r\n\r\n&nbsp;\r\n<div class=\"textbox key-takeaways\">\r\n<h3>Try It<\/h3>\r\nExamine Table 2, which considers the model including the cond_new, stock_photo, and wheels predictors. If we were using the <em>p<\/em>-value approach with backward elimination and we were considering this model, which of these three variables would be up for elimination? Would we drop that variable,\u00a0or would we keep it in the model?\r\n\r\nSolution:\r\n\r\nThe stock photo predictor is up for elimination since it has the largest <em>p<\/em>-value. Additionally, since that <em>p<\/em>-value is larger than 0.05, we would in fact eliminate stock photo from the model.\r\n\r\n<\/div>\r\nWhile the adjusted<em> <em><em>R<\/em><sup>2<\/sup><\/em><\/em>\u00a0and <em>p<\/em>-value approaches are similar, they sometimes lead to\u00a0different models, with the adjusted<em> <em><em>R<\/em><sup>2<\/sup><\/em><\/em>\u00a0approach tending to include more predictors in the final model. For example, if we had used the <em>p<\/em>-value approach with the auction data, we\u00a0would not have included the stock photo predictor in the final model.\r\n<div class=\"textbox\">\r\n<h3>When to use the adjusted <em>R<\/em><sup>2<\/sup>\u00a0and when to use the <em>p<\/em>-value approach<\/h3>\r\nWhen the sole goal is to improve prediction accuracy, use adjusted <em>R<\/em><sup>2<\/sup>. This is\u00a0commonly the case in machine learning applications.\r\n\r\nWhen we care about understanding which variables are statistically significant predictors of the response, or if there is interest in producing a simpler model at the potential cost of a little prediction accuracy, then the <em>p<\/em>-value approach is preferred.\r\n\r\n<\/div>\r\nRegardless of whether you use adjusted <em>R<\/em><sup>2<\/sup>\u00a0or the <em>p<\/em>-value approach, or if you use\u00a0the backward elimination of forward selection strategy, our job is not done after variable\u00a0selection. We must still verify the model conditions are reasonable.","rendered":"<div>\n<p>The best model is not always the most complicated. Sometimes including variables that\u00a0are not evidently important can actually reduce the accuracy of predictions. In this section we discuss model selection strategies, which will help us eliminate variables from the model\u00a0that are found to be less important.<\/p>\n<p>In practice, the model that includes all available explanatory variables is often referred\u00a0to as the<strong> full model<\/strong>. The full model may not be the best model, and if it isn&#8217;t, we want\u00a0to identify a smaller model that is preferable.<\/p>\n<h2>Identifying variables in the model that may not be helpful<\/h2>\n<p>Adjusted<em> <em>R<\/em><sup>2<\/sup><\/em>\u00a0describes the strength of a model fit, and it is a useful tool for evaluating\u00a0which predictors are adding value to the model, where<em> adding value<\/em> means they are (likely)\u00a0improving the accuracy in predicting future outcomes.<\/p>\n<p>Let&#8217;s consider two models, which are shown in Tables 1\u00a0and 2. The first table\u00a0summarizes the full model since it includes all predictors, while the second does not include\u00a0the duration variable.<\/p>\n<\/div>\n<p><em>df<\/em> = 136<\/p>\n<table>\n<thead>\n<tr>\n<th colspan=\"5\">Table 1.\u00a0The fit for the full regression model, including the adjusted<em> <em>R<\/em><sup>2<\/sup><\/em>.<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><\/td>\n<td>Estimate<\/td>\n<td>Std. Error<\/td>\n<td>t value<\/td>\n<td>Pr(<em> &gt;<\/em>|t|)<\/td>\n<\/tr>\n<tr>\n<td>(Intercept)<\/td>\n<td>36.2110<\/td>\n<td>1.5140<\/td>\n<td>23.92<\/td>\n<td>0.0000<\/td>\n<\/tr>\n<tr>\n<td>cond_new<\/td>\n<td>5.1306<\/td>\n<td>1.0511<\/td>\n<td>4.88<\/td>\n<td>0.0000<\/td>\n<\/tr>\n<tr>\n<td>stock_photo<\/td>\n<td>1.0803<\/td>\n<td>1.0568<\/td>\n<td>1.02<\/td>\n<td>0.3085<\/td>\n<\/tr>\n<tr>\n<td>duration<\/td>\n<td>\u20130.0268<\/td>\n<td>0.1904<\/td>\n<td>\u20130.14<\/td>\n<td>0.8882<\/td>\n<\/tr>\n<tr>\n<td>wheels<\/td>\n<td>7.2852<\/td>\n<td>0.5547<\/td>\n<td>13.13<\/td>\n<td>0.0000<\/td>\n<\/tr>\n<tr>\n<td><em><em><em>R<\/em><sup>2<\/sup><sub><em>adj<\/em><\/sub><\/em><\/em>\u00a0= 0<em>.<\/em>7108<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<div><\/div>\n<div>\u00a0\u00a0\u00a0\u00a0\u00a0<em>df<\/em> = 137<\/p>\n<table>\n<thead>\n<tr>\n<th colspan=\"5\">Table 2. The fit for the regression model for predictors cond_new,\u00a0stock_photo, and wheels.<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><\/td>\n<td>Estimate<\/td>\n<td>Std. Error<\/td>\n<td>t value<\/td>\n<td>Pr(<em> &gt;<\/em>|t|)<\/td>\n<\/tr>\n<tr>\n<td>(Intercept)<\/td>\n<td>36.0483<\/td>\n<td>0.9745<\/td>\n<td>36.99<\/td>\n<td>0.0000<\/td>\n<\/tr>\n<tr>\n<td>cond_new<\/td>\n<td>5.1763<\/td>\n<td>0.9961<\/td>\n<td>5.20<\/td>\n<td>0.0000<\/td>\n<\/tr>\n<tr>\n<td>stock_photo<\/td>\n<td>1.1177<\/td>\n<td>1.0192<\/td>\n<td>1.10<\/td>\n<td>0.2747<\/td>\n<\/tr>\n<tr>\n<td>wheels<\/td>\n<td>7.2984<\/td>\n<td>0.5448<\/td>\n<td>13.40<\/td>\n<td>0.0000<\/td>\n<\/tr>\n<tr>\n<td><em><em><em>R<\/em><sup>2<\/sup><sub><em>adj<\/em><\/sub><\/em><\/em>\u00a0= 0<em>.<\/em>7128<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<p>&nbsp;<\/p>\n<div class=\"textbox exercises\">\n<h3>Example<\/h3>\n<p>Which of the two models is better?<\/p>\n<p>Solution:<\/p>\n<p>We compare the adjusted<em> R<sup>2<\/sup><\/em>\u00a0of each model to determine which to choose. Since the\u00a0first model has an<em> R<sup>2<\/sup><sub>adj<\/sub><\/em>\u00a0smaller than the<em> R<sup>2<\/sup><sub>adj<\/sub><\/em>\u00a0of the second model, we prefer the\u00a0second model to the first.<\/p>\n<p>Will the model without duration be better than the model with duration? We cannot\u00a0know for sure, but based on the adjusted<em> R<sup>2<\/sup><\/em>, this is our best assessment.<\/p>\n<\/div>\n<h2>Two model selection strategies<\/h2>\n<p>Two common strategies for adding or removing variables in a multiple regression model are\u00a0called<em> backward elimination<\/em> and<em> forward selection<\/em>. These techniques are often referred to as<strong> stepwise<\/strong> model selection strategies, because they add or delete one variable at a time\u00a0as they &#8220;step&#8221; through the candidate predictors.<\/p>\n<p><strong>Backward elimination<\/strong> starts with the model that includes all potential predictor\u00a0variables. Variables are eliminated one-at-a-time from the model until we cannot improve the adjusted<em> <em>R<\/em><sup>2<\/sup><\/em>. The strategy within each elimination step is to eliminate the variable that\u00a0leads to the largest improvement in adjusted<em> <em>R<\/em><sup>2<\/sup><\/em>.<\/p>\n<p style=\"text-align: center;\">\n<div class=\"textbox exercises\">\n<h3>Example<\/h3>\n<p>Results corresponding to the<em> full model<\/em> for the mario kart data are shown in Table 8.6. How should we proceed under the backward elimination\u00a0strategy?<\/p>\n<p>Solution:<\/p>\n<p>Our baseline adjusted<em> R<sup>2<\/sup><\/em>\u00a0from the full model is\u00a0<em>R<\/em><sup>2<\/sup><sub><em>adj<\/em><\/sub> = 0<em>.<\/em>7108, and we need to\u00a0determine whether dropping a predictor will improve the adjusted<em> R<sup>2<\/sup><\/em>. To check, we<\/p>\n<p>fit four models that each drop a different predictor, and we record the adjusted<em> R<sup>2<\/sup><\/em><\/p>\n<p>from each:<\/p>\n<p>[latex]\\begin{array}\\text{Exclude . . .}\\hfill&\\text{cond_new}\\hfill&\\text{stock_photo}\\hfill&\\text{duration}\\hfill&\\text{wheels}\\\\\\text{ }\\hfill&R^2_{adj}=0.6626\\hfill&R^2_{adj}=0.7107\\hfill&R^2_{adj}=0.7128\\hfill&R^2_{adj}=0.3487\\end{array}[\/latex]<\/p>\n<p>The third model without duration has the highest adjusted<em> R<sup>2<\/sup><\/em>\u00a0of 0.7128, so we\u00a0compare it to the adjusted<em> R<sup>2<\/sup><\/em>\u00a0for the full model. Because eliminating duration\u00a0leads to a model with a higher adjusted<em> R<sup>2<\/sup><\/em>, we drop duration from the model.\u00a0Since we eliminated a predictor from the model in the first step, we see whether\u00a0we should eliminate any additional predictors. Our baseline adjusted<em> R<sup>2<\/sup><\/em>\u00a0is now\u00a0<em>R<sup>2<\/sup><sub>adj<\/sub><\/em>\u00a0= 0<em>.<\/em>7128. We now fit three new models, which consider eliminating each of the\u00a0three remaining predictors:<\/p>\n<p>[latex]\\begin{array}\\text{Exclude duration and . . .}\\hfill&\\text{cond_new}\\hfill&\\text{stock_photo}\\hfill&\\text{wheels}\\\\\\text{ }\\hfill&R^2_{adj}=0.6587\\hfill&R^2_{adj}=0.7124\\hfill&R^2_{adj}=0.3414\\end{array}[\/latex]<\/p>\n<p>None of these models lead to an improvement in adjusted<em> R<sup>2<\/sup><\/em>, so we do not eliminate\u00a0any of the remaining predictors. That is, after backward elimination, we are left with the model that keeps cond new, stock photos, and wheels, which we can summarize\u00a0using the coefficients from Table 2:<\/p>\n<p>[latex]\\displaystyle\\hat{y}=b_0+b_1x_1+b_2x_2+b_4x_4[\/latex]<\/p>\n<p>[latex]\\displaystyle\\widehat{\\text{price}}=36.05+5.18\\times\\text{cond_new}+1.12\\times\\text{stock_photo}+7.30\\times\\text{wheels}[\/latex]<\/p>\n<\/div>\n<p>The<strong> forward selection<\/strong> strategy is the reverse of the backward elimination technique.\u00a0Instead of eliminating variables one-at-a-time, we add variables one-at-a-time until we\u00a0cannot find any variables that improve the model (as measured by adjusted<em> <em>R<\/em><sup>2<\/sup><\/em>).<\/p>\n<p>&nbsp;<\/p>\n<div class=\"textbox exercises\">\n<h3>Example<\/h3>\n<p>Construct a model for the mario kart data set using the forward\u00a0selection strategy.<\/p>\n<p>Solution:<\/p>\n<p>We start with the model that includes no variables. Then we fit each of the possible\u00a0models with just one variable. That is, we fit the model including just cond_new, then the model including just stock photo, then a model with just duration, and a\u00a0model with just wheels. Each of the four models provides an adjusted<em> R<sup>2<\/sup><\/em>\u00a0value:<\/p>\n<p>[latex]\\begin{array}\\text{Add . . .}\\hfill&\\text{cond_new}\\hfill&\\text{stock_photo}\\hfill&\\text{duration}\\hfill&\\text{wheels}\\\\\\text{ }\\hfill&R^2_{adj}=0.3459\\hfill&R^2_{adj}=0.0332\\hfill&R^2_{adj}=0.1338\\hfill&R^2_{adj}=0.6390\\end{array}[\/latex]<\/p>\n<p>In this first step, we compare the adjusted<em> R<sup>2<\/sup><\/em>\u00a0against a baseline model that has\u00a0no predictors. The no-predictors model always has\u00a0<em>R<\/em><sup>2<\/sup><sub><em>adj<\/em><\/sub>\u00a0= 0. The model with one\u00a0predictor that has the largest adjusted<em> R<sup>2<\/sup><\/em>\u00a0is the model with the wheels predictor,\u00a0and because this adjusted<em> R<sup>2<\/sup><\/em>\u00a0is larger than the adjusted<em> R<sup>2<\/sup><\/em>\u00a0from the model with no\u00a0predictors (<em>R<\/em><sup>2<\/sup><sub><em>adj<\/em><\/sub>\u00a0= 0), we will add this variable to our model.<\/p>\n<p>We repeat the process again, this time considering 2-predictor models where one of\u00a0the predictors is wheels and with a new baseline of<em> R<sup>2<\/sup><sub>adj<\/sub><\/em>\u00a0= 0<em>.<\/em>6390:<\/p>\n<p>[latex]\\begin{array}\\text{Add wheels and . . .}\\hfill&\\text{cond_new}\\hfill&\\text{stock_photo}\\hfill&\\text{duration}\\\\\\text{ }\\hfill&R^2_{adj}=0.7124\\hfill&R^2_{adj}=0.6587\\hfill&R^2_{adj}=0.6528\\end{array}[\/latex]<\/p>\n<p>The best predictor in this stage, cond new, has a higher adjusted<em> R<sup>2<\/sup><\/em>\u00a0(0.7124) than\u00a0the baseline (0.6390), so we also add cond_new to the model.<\/p>\n<p>Since we have again added a variable to the model, we continue and see whether it\u00a0would be beneficial to add a third variable:<\/p>\n<p>[latex]\\begin{array}\\text{Add wheels, cond_new, and . . .}\\hfill&\\text{stock_photo}\\hfill&\\text{duration}\\\\\\text{ }\\hfill&R^2_{adj}=0.7128\\hfill&R^2_{adj}=0.7107\\end{array}[\/latex]<\/p>\n<p>The model adding stock photo improved adjusted<em> R<sup>2<\/sup><\/em>\u00a0(0.7124 to 0.7128), so we add\u00a0stock_photo to the model.<\/p>\n<p>Because we have again added a predictor, we check whether adding the last variable,\u00a0duration, will improve adjusted<em> R<sup>2<\/sup><\/em>. We compare the adjusted<em> R<sup>2<\/sup><\/em>\u00a0for the model with duration and the other three predictors (0.7108) to the model that only considers wheels, cond_new, and stock photo (0.7128). Adding duration does not improve the adjusted<em> R<sup>2<\/sup><\/em>, so we do not add it to the model, and we have arrived at the same\u00a0model that we identified from backward elimination.<\/p>\n<\/div>\n<div class=\"textbox\">\n<h3>Model Selection Strategies<\/h3>\n<p>Backward elimination begins with the largest model and eliminates variables one-by-one until we are satisfied that all remaining variables are important to the model. Forward selection starts with no variables included in the model, then it adds in variables according to their importance until no other important variables\u00a0are found.<\/p>\n<\/div>\n<p>There is no guarantee that backward elimination and forward selection will arrive at\u00a0the same final model. If both techniques are tried and they arrive at different models, we\u00a0choose the model with the larger<em> <em>R<\/em><sup>2<\/sup><sub><em>adj<\/em><\/sub><\/em>; other tie-break options exist but are beyond the\u00a0scope of this book.<\/p>\n<h2>The <em>p<\/em>-Value Approach, an Alternative to Adjusted <em>R<\/em><sup>2<\/sup><\/h2>\n<p>The <em>p<\/em>-value may be used as an alternative to adjusted <em>R<\/em><sup>2<\/sup>\u00a0for model selection.<\/p>\n<p>In backward elimination, we would identify the predictor corresponding to the largest\u00a0<em>p<\/em>-value. If the <em>p<\/em>-value is above the significance level, usually <em>\u03b1<\/em> = 0<em>.<\/em>05, then we would drop that variable, refit the model, and repeat the process. If the largest <em>p<\/em>-value is less than <em>\u03b1<\/em> = 0<em>.<\/em>05, then we would not eliminate any predictors and the current model would be our\u00a0best-fitting model.<\/p>\n<p>In forward selection with <em>p<\/em>-values, we reverse the process. We begin with a model\u00a0that has no predictors, then we fit a model for each possible predictor, identifying the model where the corresponding predictor&#8217;s <em>p<\/em>-value is smallest. If that <em>p<\/em>-value is smaller than\u00a0<em>\u03b1<\/em>\u00a0= 0<em>.<\/em>05, we add it to the model and repeat the process, considering whether to add more variables one-at-a-time. When none of the remaining predictors can be added to the model and have a <em>p<\/em>-value less than 0.05, then we stop adding variables and the current\u00a0model would be our best-fitting model.<\/p>\n<p>&nbsp;<\/p>\n<div class=\"textbox key-takeaways\">\n<h3>Try It<\/h3>\n<p>Examine Table 2, which considers the model including the cond_new, stock_photo, and wheels predictors. If we were using the <em>p<\/em>-value approach with backward elimination and we were considering this model, which of these three variables would be up for elimination? Would we drop that variable,\u00a0or would we keep it in the model?<\/p>\n<p>Solution:<\/p>\n<p>The stock photo predictor is up for elimination since it has the largest <em>p<\/em>-value. Additionally, since that <em>p<\/em>-value is larger than 0.05, we would in fact eliminate stock photo from the model.<\/p>\n<\/div>\n<p>While the adjusted<em> <em><em>R<\/em><sup>2<\/sup><\/em><\/em>\u00a0and <em>p<\/em>-value approaches are similar, they sometimes lead to\u00a0different models, with the adjusted<em> <em><em>R<\/em><sup>2<\/sup><\/em><\/em>\u00a0approach tending to include more predictors in the final model. For example, if we had used the <em>p<\/em>-value approach with the auction data, we\u00a0would not have included the stock photo predictor in the final model.<\/p>\n<div class=\"textbox\">\n<h3>When to use the adjusted <em>R<\/em><sup>2<\/sup>\u00a0and when to use the <em>p<\/em>-value approach<\/h3>\n<p>When the sole goal is to improve prediction accuracy, use adjusted <em>R<\/em><sup>2<\/sup>. This is\u00a0commonly the case in machine learning applications.<\/p>\n<p>When we care about understanding which variables are statistically significant predictors of the response, or if there is interest in producing a simpler model at the potential cost of a little prediction accuracy, then the <em>p<\/em>-value approach is preferred.<\/p>\n<\/div>\n<p>Regardless of whether you use adjusted <em>R<\/em><sup>2<\/sup>\u00a0or the <em>p<\/em>-value approach, or if you use\u00a0the backward elimination of forward selection strategy, our job is not done after variable\u00a0selection. We must still verify the model conditions are reasonable.<\/p>\n\n\t\t\t <section class=\"citations-section\" role=\"contentinfo\">\n\t\t\t <h3>Candela Citations<\/h3>\n\t\t\t\t\t <div>\n\t\t\t\t\t\t <div id=\"citation-list-964\">\n\t\t\t\t\t\t\t <div class=\"licensing\"><div class=\"license-attribution-dropdown-subheading\">CC licensed content, Shared previously<\/div><ul class=\"citation-list\"><li>OpenIntro Statistics. <strong>Authored by<\/strong>: David M Diez, Christopher D Barr, and Mine Cetinkaya-Rundel. <strong>Provided by<\/strong>: OpenIntro. <strong>Located at<\/strong>: <a target=\"_blank\" href=\"https:\/\/www.openintro.org\/stat\/textbook.php\">https:\/\/www.openintro.org\/stat\/textbook.php<\/a>. <strong>License<\/strong>: <em><a target=\"_blank\" rel=\"license\" href=\"https:\/\/creativecommons.org\/licenses\/by-sa\/4.0\/\">CC BY-SA: Attribution-ShareAlike<\/a><\/em>. <strong>License Terms<\/strong>: This textbook is available under a Creative Commons license. Visit openintro.org for a free  PDF, to download the textbook&#039;s source files.<\/li><\/ul><\/div>\n\t\t\t\t\t\t <\/div>\n\t\t\t\t\t <\/div>\n\t\t\t <\/section>","protected":false},"author":21,"menu_order":17,"template":"","meta":{"_candela_citation":"[{\"type\":\"cc\",\"description\":\"OpenIntro Statistics\",\"author\":\"David M Diez, Christopher D Barr, and Mine Cetinkaya-Rundel\",\"organization\":\"OpenIntro\",\"url\":\"https:\/\/www.openintro.org\/stat\/textbook.php\",\"project\":\"\",\"license\":\"cc-by-sa\",\"license_terms\":\"This textbook is available under a Creative Commons license. Visit openintro.org for a free  PDF, to download the textbook\\'s source files.\"}]","CANDELA_OUTCOMES_GUID":"","pb_show_title":null,"pb_short_title":"","pb_subtitle":"","pb_authors":[],"pb_section_license":""},"chapter-type":[],"contributor":[],"license":[],"class_list":["post-964","chapter","type-chapter","status-web-only","hentry"],"part":1622,"_links":{"self":[{"href":"https:\/\/courses.lumenlearning.com\/suny-suffolk-introstats1\/wp-json\/pressbooks\/v2\/chapters\/964","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/courses.lumenlearning.com\/suny-suffolk-introstats1\/wp-json\/pressbooks\/v2\/chapters"}],"about":[{"href":"https:\/\/courses.lumenlearning.com\/suny-suffolk-introstats1\/wp-json\/wp\/v2\/types\/chapter"}],"author":[{"embeddable":true,"href":"https:\/\/courses.lumenlearning.com\/suny-suffolk-introstats1\/wp-json\/wp\/v2\/users\/21"}],"version-history":[{"count":3,"href":"https:\/\/courses.lumenlearning.com\/suny-suffolk-introstats1\/wp-json\/pressbooks\/v2\/chapters\/964\/revisions"}],"predecessor-version":[{"id":1526,"href":"https:\/\/courses.lumenlearning.com\/suny-suffolk-introstats1\/wp-json\/pressbooks\/v2\/chapters\/964\/revisions\/1526"}],"part":[{"href":"https:\/\/courses.lumenlearning.com\/suny-suffolk-introstats1\/wp-json\/pressbooks\/v2\/parts\/1622"}],"metadata":[{"href":"https:\/\/courses.lumenlearning.com\/suny-suffolk-introstats1\/wp-json\/pressbooks\/v2\/chapters\/964\/metadata\/"}],"wp:attachment":[{"href":"https:\/\/courses.lumenlearning.com\/suny-suffolk-introstats1\/wp-json\/wp\/v2\/media?parent=964"}],"wp:term":[{"taxonomy":"chapter-type","embeddable":true,"href":"https:\/\/courses.lumenlearning.com\/suny-suffolk-introstats1\/wp-json\/pressbooks\/v2\/chapter-type?post=964"},{"taxonomy":"contributor","embeddable":true,"href":"https:\/\/courses.lumenlearning.com\/suny-suffolk-introstats1\/wp-json\/wp\/v2\/contributor?post=964"},{"taxonomy":"license","embeddable":true,"href":"https:\/\/courses.lumenlearning.com\/suny-suffolk-introstats1\/wp-json\/wp\/v2\/license?post=964"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}