{"id":973,"date":"2016-04-21T22:43:35","date_gmt":"2016-04-21T22:43:35","guid":{"rendered":"https:\/\/courses.lumenlearning.com\/introstats1xmaster\/?post_type=chapter&#038;p=973"},"modified":"2017-07-20T20:23:49","modified_gmt":"2017-07-20T20:23:49","slug":"introduction-to-logistic-regression","status":"publish","type":"chapter","link":"https:\/\/courses.lumenlearning.com\/ntcc-introstats1\/chapter\/introduction-to-logistic-regression\/","title":{"raw":"Introduction to Logistic Regression","rendered":"Introduction to Logistic Regression"},"content":{"raw":"In this section we introduce<strong> logistic regression<\/strong> as a tool for building models when there is\u00a0a categorical response variable with two levels. Logistic regression is a type of<strong> generalized <\/strong><strong>linear model<\/strong> (GLM) for response variables where regular multiple regression does not work very well. In particular, the response variable in these settings often takes a form\u00a0where residuals look completely different from the normal distribution.\r\n\r\nGLMs can be thought of as a two-stage modeling approach. We first model the\u00a0response variable using a probability distribution, such as the binomial or Poisson distribution. Second, we model the parameter of the distribution using a collection of predictors\u00a0and a special form of multiple regression.\r\n\r\nIn this page\u00a0we will data about emails. These emails were\u00a0collected from a single email account, and we will work on developing a basic spam filter using these data. The response variable, spam, has been encoded to take value 0 when a message is not spam and 1 when it is spam. Our task will be to build an appropriate model that classifies messages as spam or not spam using email characteristics coded as predictor variables. While this model will not be the same as those used in large-scale spam filters,\u00a0it shares many of the same features.\r\n<h2>Email Data<\/h2>\r\nThere are several\u00a0variables available that might be useful for classifying spam. Descriptions of these variables are presented in Table 1. The spam variable will be the outcome, and the other 10 variables will be the model predictors. While we have limited the predictors used in this section to be categorical variables (where many are represented as indicator variables), numerical predictors may also be used in logistic\u00a0regression.[footnote]Recall that if outliers are present in predictor variables, the corresponding observations may be especially influential on the resulting model. This is the motivation for omitting the numerical variables, such as the number of characters and line breaks in emails. These variables exhibit extreme skew. We could resolve this issue by transforming these variables (e.g. using a log-transformation), but we will omit this further investigation for brevity.[\/footnote]\r\n<table>\r\n<thead>\r\n<tr>\r\n<th colspan=\"2\">Table 1. Descriptions for 11 variables in the email data set. Notice that all of the variables are indicator variables, which take the value 1 if the specified characteristic is present and 0 otherwise.<\/th>\r\n<\/tr>\r\n<\/thead>\r\n<tbody>\r\n<tr>\r\n<th style=\"width: 15%;\">Variable<\/th>\r\n<th style=\"width: 85%;\">Description<\/th>\r\n<\/tr>\r\n<tr>\r\n<td style=\"width: 15%;\">spam<\/td>\r\n<td style=\"width: 85%;\">Specifies whether the message was spam<\/td>\r\n<\/tr>\r\n<tr>\r\n<td>to_multiple<\/td>\r\n<td>An indicator variable for if more than one person was listed in the To field of the email.<\/td>\r\n<\/tr>\r\n<tr>\r\n<td>cc<\/td>\r\n<td>An indicator for if someone was CCed on the email<\/td>\r\n<\/tr>\r\n<tr>\r\n<td>attach<\/td>\r\n<td>An indicator for if there was an attachment, such as a document or image<\/td>\r\n<\/tr>\r\n<tr>\r\n<td>dollar<\/td>\r\n<td>An indicator for if the word \u201cdollar\u201d or dollar symbol ($) appeared in the email.<\/td>\r\n<\/tr>\r\n<tr>\r\n<td>winner<\/td>\r\n<td>An indicator for if the word \u201cwinner\u201d appeared in the email message<\/td>\r\n<\/tr>\r\n<tr>\r\n<td>inherit<\/td>\r\n<td>An indicator for if the word \u201cinherit\u201d (or a variation, like \u201cinheritance\u201d) appeared in the email.<\/td>\r\n<\/tr>\r\n<tr>\r\n<td>password<\/td>\r\n<td>An indicator for if the word \u201cpassword\u201d was present in the email.<\/td>\r\n<\/tr>\r\n<tr>\r\n<td>format<\/td>\r\n<td>Indicates if the email contained special formatting, such as bolding, tables, or links<\/td>\r\n<\/tr>\r\n<tr>\r\n<td>re_subj<\/td>\r\n<td>Indicates whether \u201cRe:\u201d was included at the start of the email subject.<\/td>\r\n<\/tr>\r\n<tr>\r\n<td>exclaim_subj<\/td>\r\n<td>Indicates whether any exclamation point was included in the email subject<\/td>\r\n<\/tr>\r\n<\/tbody>\r\n<\/table>\r\n<h2>Modeling the Probability of an Event<\/h2>\r\n<div class=\"textbox\">\r\n<h3>TIP: Notation for a Logistic Regression Model<\/h3>\r\nThe outcome variable for a GLM is denoted by <em>Y<sub>i<\/sub><\/em>, where the index<em> i<\/em> is used to\u00a0represent observation<em> i<\/em>. In the email application, <em>Y<sub>i<\/sub><\/em> will be used to represent\u00a0whether email<em> i<\/em> is spam (<em>Y<sub>i<\/sub><\/em> = 1) or not (<em>Y<sub>i<\/sub><\/em>\u00a0= 0).\r\n\r\nThe predictor variables are represented as follows: <em>x<\/em><sub>1, <em>i<\/em><\/sub> is the value of variable 1 for\u00a0observation<em> i<\/em>, <em>x<\/em><sub>2, <em>i<\/em><\/sub> is the value of variable 2 for observation <em>i<\/em>, and so on.\r\n\r\n<\/div>\r\nLogistic regression is a generalized linear model where the outcome is a two-level\u00a0categorical variable. The outcome,<em> <em>Y<sub>i<\/sub><\/em><\/em>, takes the value 1 (in our application, this represents a spam message) with probability <em>p<sub>i<\/sub><\/em>\u00a0and the value 0 with probability 1 \u2212 <em>p<sub>i<\/sub><\/em>. It is the\u00a0probability <em>p<sub>i<\/sub><\/em>\u00a0that we model in relation to the predictor variables.\r\n\r\nThe logistic regression model relates the probability an email is spam (<em>p<sub>i<\/sub><\/em>) to the\u00a0predictors<em> <em>x<\/em><\/em><sub>1, <em>i<\/em><\/sub>,<em> <em>x<\/em><\/em><sub>2, <em>i<\/em><\/sub>\u00a0...,<em> <em>x<\/em><sub>k, <em>i<\/em><\/sub><\/em>\u00a0through a framework much like that of multiple regression:\r\n<p style=\"text-align: center;\">[latex]\\text{transformation}\\left(p_i\\right)=\\beta_0+\\beta_1x_{1,i}+\\beta_2x_{2,i}+\\dots\\beta_k{x}_{k,i}[\/latex]<\/p>\r\nWe want to choose a transformation in the equation above\u00a0that makes practical and mathematical sense. For example, we want a transformation that makes the range of possibilities on the left hand side of the equation above\u00a0equal to the range of possibilities for the right hand side; if there was no transformation for this equation, the left hand side could only take values between 0 and 1, but the right hand side could take values outside of this range. A\u00a0common transformation for<i>\u00a0<\/i><em>p<sub>i<\/sub><\/em>\u00a0is the<strong> logit transformation<\/strong>, which may be written as\r\n<p style=\"text-align: center;\">[latex]\\displaystyle\\text{logit}\\left(p_i\\right)=\\log_{e}\\left(\\frac{p_i}{1-p_i}\\right)[\/latex]<\/p>\r\nThe logit transformation is shown in Figure 1. Below, we rewrite the transformation equation\u00a0using\u00a0the logit transformation of<i>\u00a0<\/i><em>p<sub>i<\/sub><\/em>:\r\n<p style=\"text-align: center;\">[latex]\\displaystyle\\log_e\\left(\\frac{p_i}{1-p_i}\\right)=\\beta_0+\\beta_1x_{1,i}+\\beta_2x_{2,i}+\\dots\\beta_k{x}_{k,i}[\/latex]<\/p>\r\n\r\n\r\n[caption id=\"attachment_1468\" align=\"aligncenter\" width=\"768\"]<img class=\"wp-image-1468 size-full\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/132\/2016\/04\/21215327\/Figure8_5.png\" alt=\"A graph of values of pi against values of logit(pi). Points include (-5.0, 0.007), (-1.0, 0.27), (0.0, 0.50), (2.0, 0.88), and (4.0, 0.982).\" width=\"768\" height=\"435\" \/> Figure 1. Values of <i>p<sub>i<\/sub><\/i> against values of logit(<em>p<sub>i<\/sub><\/em>).[\/caption]\r\n\r\nIn our spam example, there are 10 predictor variables, so<em> k<\/em> = 10. This model isn't very\u00a0intuitive, but it still has some resemblance to multiple regression, and we can fit this model using software. In fact, once we look at results from software, it will start to feel like we're\u00a0back in multiple regression, even if the interpretation of the coefficients is more complex.\r\n\r\n&nbsp;\r\n<div class=\"textbox exercises\">\r\n<h3>Example<\/h3>\r\nHere we create a spam filter with a single predictor: to_multiple.\r\n\r\nThis variable indicates whether more than one email address was listed in the<em> To<\/em> field of the email. The following logistic regression model was fit using statistical software:\r\n\r\n[latex]\\displaystyle\\log\\left(\\frac{p_i}{1-p_i}\\right)=-2.12-1.81\\times\\text{to_multiple}[\/latex]\r\n\r\nIf an email is randomly selected and it has just one address in the<em> To<\/em> field, what is\u00a0the probability it is spam? What if more than one address is listed in the<em> To<\/em> field?\r\n\r\nSolution:\r\n\r\nIf there is only one email in the<em> To<\/em> field, then to_multiple takes value 0 and the\u00a0right side of the model equation equals \u22122.12. Solving for\u00a0<em>p<sub>i<\/sub><\/em>:\r\n\r\n[latex]\\displaystyle\\frac{e^{-2.12}}{1+e^{-2.12}}=0.11[\/latex]\r\n\r\nJust as\u00a0we labeled a fitted value of<em>\u00a0y<sub>i<\/sub><\/em>\u00a0with a \"hat\" in single-variable and multiple regression,\u00a0we will do the same for this probability: [latex]\\hat{p}_i=0.11[\/latex].\r\n\r\nIf there is more than one address listed in the<em> To<\/em> field, then the right side of the model\u00a0equation is \u22122<em>.<\/em>12 \u2212\u00a01<em>.<\/em>81 \u00d7\u00a01 = \u22123<em>.<\/em>93, which corresponds to a probability [latex]\\hat{p}_i=0.02[\/latex].\r\n\r\nNotice that we could examine \u22122.12 and \u22123.93 in Figure 1 to estimate the probability\u00a0before formally calculating the value.\r\n\r\n<\/div>\r\nTo convert from values on the regression-scale (e.g. \u22122.12 and \u22123.93 in Example 1),\u00a0use the following formula, which is the result of solving for<em> p<\/em><em>i<\/em> in the regression model:\r\n<p style=\"text-align: center;\">[latex]\\displaystyle{p}_i=\\frac{e^{\\beta_0+\\beta_1x_{1,i}+\\beta_2x_{2,i}+\\dots\\beta_k{x}_{k,i}}}{1 + e^{\\beta_0+\\beta_1x_{1,i}+\\beta_2x_{2,i}+\\dots\\beta_k{x}_{k,i}}}[\/latex]<\/p>\r\nAs with most applied data problems, we substitute the point estimates for the parameters\u00a0(the\u00a0<em>\u03b2<sub>i<\/sub><\/em>) so that we may make use of this formula. In Example 1, the probabilities were\u00a0calculated as\r\n<p style=\"text-align: center;\">[latex]\\displaystyle\\begin{array}\\text{ }\\frac{e^{-2.12}}{1+e^{-2.12}}=0.11\\hfill&amp;\\text{ }\\hfill&amp;\\frac{e^{-2.12-1.81}}{1+e^{-2.12-1.81}}=0.02\\end{array}[\/latex]<\/p>\r\nWhile the information about whether the email is addressed to multiple people is a helpful start in classifying email as spam or not, the probabilities of 11% and 2% are not dramatically different, and neither provides very strong evidence about which particular email messages are spam. To get more precise estimates, we'll need to include many more\u00a0variables in the model.\r\n\r\nWe used statistical software to fit the logistic regression model with all ten predictors\u00a0described in Table 1. Like multiple regression, the result may be presented in a summary table, which is shown in Table 2. The structure of this table is almost identical to that of multiple regression; the only notable difference is that the <em>p<\/em>-values are calculated using\u00a0the normal distribution rather than the<em> t<\/em>-distribution.\r\n\r\n&nbsp;\r\n<table>\r\n<thead>\r\n<tr>\r\n<th colspan=\"5\">Table 2. Summary table for the full logistic regression model for the\u00a0spam filter example<\/th>\r\n<\/tr>\r\n<\/thead>\r\n<tbody>\r\n<tr>\r\n<td><\/td>\r\n<td>Estimate<\/td>\r\n<td>Std. Error<\/td>\r\n<td>z value<\/td>\r\n<td>Pr (<em> &gt;<\/em> |z|)<\/td>\r\n<\/tr>\r\n<tr>\r\n<td>(Intercept)<\/td>\r\n<td>\u20130.8362<\/td>\r\n<td>0.0962<\/td>\r\n<td>\u20138.69<\/td>\r\n<td>0.0000<\/td>\r\n<\/tr>\r\n<tr>\r\n<td>to_multiple<\/td>\r\n<td>\u20132.8836<\/td>\r\n<td>0.3121<\/td>\r\n<td>\u20139.24<\/td>\r\n<td>0.0000<\/td>\r\n<\/tr>\r\n<tr>\r\n<td>winner<\/td>\r\n<td>1.7038<\/td>\r\n<td>0.3254<\/td>\r\n<td>5.24<\/td>\r\n<td>0.0000<\/td>\r\n<\/tr>\r\n<tr>\r\n<td>format<\/td>\r\n<td>\u20131.5902<\/td>\r\n<td>0.1239<\/td>\r\n<td>\u201312.84<\/td>\r\n<td>0.0000<\/td>\r\n<\/tr>\r\n<tr>\r\n<td>re_subj<\/td>\r\n<td>\u20132.9082<\/td>\r\n<td>0.3708<\/td>\r\n<td>\u20137.84<\/td>\r\n<td>0.0000<\/td>\r\n<\/tr>\r\n<tr>\r\n<td>exclaim_subj<\/td>\r\n<td>0.1355<\/td>\r\n<td>0.2268<\/td>\r\n<td>0.60<\/td>\r\n<td>0.5503<\/td>\r\n<\/tr>\r\n<tr>\r\n<td>cc<\/td>\r\n<td>\u20130.4863<\/td>\r\n<td>0.3054<\/td>\r\n<td>\u20131.59<\/td>\r\n<td>0.1113<\/td>\r\n<\/tr>\r\n<tr>\r\n<td>attach<\/td>\r\n<td>0.9790<\/td>\r\n<td>0.2170<\/td>\r\n<td>4.51<\/td>\r\n<td>0.0000<\/td>\r\n<\/tr>\r\n<tr>\r\n<td>dollar<\/td>\r\n<td>\u20130.0582<\/td>\r\n<td>0.1589<\/td>\r\n<td>\u20130.37<\/td>\r\n<td>0.7144<\/td>\r\n<\/tr>\r\n<tr>\r\n<td>inherit<\/td>\r\n<td>0.2093<\/td>\r\n<td>0.3197<\/td>\r\n<td>0.65<\/td>\r\n<td>0.5127<\/td>\r\n<\/tr>\r\n<tr>\r\n<td>password<\/td>\r\n<td>\u20131.4929<\/td>\r\n<td>0.5295<\/td>\r\n<td>\u20132.82<\/td>\r\n<td>0.0048<\/td>\r\n<\/tr>\r\n<\/tbody>\r\n<\/table>\r\nJust like multiple regression, we could trim some variables from the model using the\u00a0<em>p<\/em>-value. Using backward elimination with a <em>p<\/em>-value cutoff of 0.05 (start with the full model and trim the predictors with <em>p<\/em>-values greater than 0.05), we ultimately eliminate the exclaim_subj, dollar, inherit, and cc predictors. The remainder of this section will\u00a0rely on this smaller model, which is summarized in Table 3.\r\n<table>\r\n<thead>\r\n<tr>\r\n<th colspan=\"5\">Table 3. Summary table for the logistic regression model for the spam\u00a0filter, where variable selection has been performed<\/th>\r\n<\/tr>\r\n<\/thead>\r\n<tbody>\r\n<tr>\r\n<td><\/td>\r\n<td>Estimate<\/td>\r\n<td>Std. Error<\/td>\r\n<td>z value<\/td>\r\n<td>Pr (<em> &gt;<\/em> |z|)<\/td>\r\n<\/tr>\r\n<tr>\r\n<td>(Intercept)<\/td>\r\n<td>\u20130.8595<\/td>\r\n<td>0.0910<\/td>\r\n<td>\u20139.44<\/td>\r\n<td>0.0000<\/td>\r\n<\/tr>\r\n<tr>\r\n<td>to_multiple<\/td>\r\n<td>\u20132.8372<\/td>\r\n<td>0.3092<\/td>\r\n<td>\u20139.18<\/td>\r\n<td>0.0000<\/td>\r\n<\/tr>\r\n<tr>\r\n<td>winner<\/td>\r\n<td>1.7370<\/td>\r\n<td>0.3218<\/td>\r\n<td>5.40<\/td>\r\n<td>0.0000<\/td>\r\n<\/tr>\r\n<tr>\r\n<td>format<\/td>\r\n<td>\u20131.5569<\/td>\r\n<td>0.1207<\/td>\r\n<td>\u201312.90<\/td>\r\n<td>0.0000<\/td>\r\n<\/tr>\r\n<tr>\r\n<td>re_subj<\/td>\r\n<td>\u20133.0482<\/td>\r\n<td>0.3630<\/td>\r\n<td>\u20138.40<\/td>\r\n<td>0.0000<\/td>\r\n<\/tr>\r\n<tr>\r\n<td>attach<\/td>\r\n<td>0.8643<\/td>\r\n<td>0.2042<\/td>\r\n<td>4.23<\/td>\r\n<td>0.0000<\/td>\r\n<\/tr>\r\n<tr>\r\n<td>password<\/td>\r\n<td>\u20131.4871<\/td>\r\n<td>0.5290<\/td>\r\n<td>\u20132.81<\/td>\r\n<td>0.0049<\/td>\r\n<\/tr>\r\n<\/tbody>\r\n<\/table>\r\n&nbsp;\r\n<div class=\"textbox key-takeaways\">\r\n<h3>Try It<\/h3>\r\nExamine the summary of the reduced model in Table 3,\u00a0and in particular, examine the to_multiple row. Is the point estimate the same as\u00a0we found before, \u22121.81, or is it different? Explain why this might be.\r\n\r\nSolution:\r\n\r\nThe new estimate is different: \u22122.87. This new value represents the estimated coefficient when we are also accounting for other variables in the logistic regression model.\r\n\r\n<\/div>\r\n&nbsp;\r\n\r\nPoint estimates will generally change a little\u2014and sometimes a lot\u2014depending on which other variables are included in the model. This is usually due to colinearity in the predictor variables. We previously saw this in the eBay auction example when we compared the coefficient of cond_new in a single-variable model and the corresponding coefficient in the multiple regression model that used three additional variables.\r\n\r\n&nbsp;\r\n<div class=\"textbox exercises\">\r\n<h3>Example<\/h3>\r\nSpam filters are built to be automated, meaning a piece of software\u00a0is written to collect information about emails as they arrive, and this information is put in the form of variables. These variables are then put into an algorithm that uses a statistical model, like the one we've fit, to classify the email. Suppose we write software for a spam filter using the reduced model shown in Table 3. If an incoming email has the word \"winner\" in it, will this raise or lower the model's\u00a0calculated probability that the incoming email is spam?\r\n\r\nSolution:\r\n\r\nThe estimated coefficient of winner is positive (1.7370). A positive coefficient estimate in logistic regression, just like in multiple regression, corresponds to a positive association between the predictor and response variables when accounting for the other variables in the model. Since the response variable takes value 1 if an email is spam and 0 otherwise, the positive coefficient indicates that the presence of \"winner\"\u00a0in an email raises the model probability that the message is spam.\r\n\r\n<\/div>\r\n<div class=\"textbox exercises\">\r\n<h3>Example<\/h3>\r\nSuppose the same email from Example 2\u00a0was in HTML format,\u00a0meaning the format variable took value 1. Does this characteristic increase or decrease the probability that the email is spam according to the model?\r\n\r\nSolution:\r\n\r\nSince HTML corresponds to a value of 1 in the format variable and the coefficient of\u00a0this variable is negative (\u22121.5569), this would lower the probability estimate returned\u00a0from the model.\r\n\r\n<\/div>\r\n<h2>Practical Decisions in the Email Application<\/h2>\r\nExamples 2\u00a0and 3\u00a0highlight a key feature of logistic and multiple regression. In the\u00a0spam filter example, some email characteristics will push an email's classification in the\u00a0direction of spam while other characteristics will push it in the opposite direction.\r\n\r\nIf we were to implement a spam filter using the model we have fit, then each future\u00a0email we analyze would fall into one of three categories based on the email's characteristics:\r\n<ol>\r\n \t<li>The email characteristics generally indicate the email is not spam, and so the resulting\u00a0probability that the email is spam is quite low, say, under 0.05.<\/li>\r\n \t<li>The characteristics generally indicate the email is spam, and so the resulting probability that the email is spam is quite large, say, over 0.95.<\/li>\r\n \t<li>The characteristics roughly balance each other out in terms of evidence for and against\u00a0the message being classified as spam. Its probability falls in the remaining range,\u00a0meaning the email cannot be adequately classified as spam or not spam.<\/li>\r\n<\/ol>\r\nIf we were managing an email service, we would have to think about what should be\u00a0done in each of these three instances. In an email application, there are usually just two possibilities: filter the email out from the regular inbox and put it in a \"spambox,\" or let\u00a0the email go to the regular inbox.\r\n\r\n&nbsp;\r\n<div class=\"textbox key-takeaways\">\r\n<h3>Try It<\/h3>\r\nThe first and second scenarios are intuitive. If the evidence strongly suggests a message is not spam, send it to the inbox. If the evidence strongly\u00a0suggests the message is spam, send it to the spambox. How should we handle emails\u00a0in the third category?\r\n\r\nSolution:\r\n\r\nIn this particular application, we should err on the side of sending more mail to the inbox rather than mistakenly putting good messages in the spambox. So, in summary: emails in the first and last categories go to the regular inbox, and those in the second scenario go to the spambox.\r\n\r\n<\/div>\r\n<div class=\"textbox key-takeaways\">\r\n<h3>Try It<\/h3>\r\nSuppose we apply the logistic model we have built as a\u00a0spam filter and that 100 messages are placed in the spambox over 3 months. If we\u00a0used the guidelines above for putting messages into the spambox, about how many\u00a0legitimate (non-spam) messages would you expect to find among the 100 messages?\r\n\r\nSolution:\r\n\r\nFirst, note that we proposed a cutoff for the predicted probability of 0.95 for spam. In a worst case scenario, all the messages in the spambox had the minimum probability equal to about 0.95. Thus, we should expect to find about 5 or fewer legitimate messages among the 100 messages placed in the spambox.\r\n\r\n<\/div>\r\nAlmost any classifier will have some error. In the spam filter guidelines above, we\u00a0have decided that it is okay to allow up to 5% of the messages in the spambox to be real messages. If we wanted to make it a little harder to classify messages as spam, we could use a cutoff of 0.99. This would have two effects. Because it raises the standard for what can be classified as spam, it reduces the number of good emails that are classified as spam. However, it will also fail to correctly classify an increased fraction of spam messages. No matter the complexity and the confidence we might have in our model, these practical considerations are absolutely crucial to making a helpful spam filter. Without them, we\u00a0could actually do more harm than good by using our statistical model.\r\n<h2>Diagnostics for the Email Classifier<\/h2>\r\n<div class=\"textbox\">\r\n<h3>Logistic Regression Conditions<\/h3>\r\nThere are two key conditions for fitting a logistic regression model:\r\n<ol>\r\n \t<li>Each predictor <em>x<sub>i<\/sub><\/em> is linearly related to logit(<em>p<sub>i<\/sub><\/em>) if all other predictors are\u00a0held constant.<\/li>\r\n \t<li>Each outcome <em>Y<sub>i<\/sub><\/em> is independent of the other outcomes.<\/li>\r\n<\/ol>\r\n<\/div>\r\nThe first condition of the logistic regression model is not easily checked without a\u00a0fairly sizable amount of data. Luckily, we have 3,921 emails in our data set! Let's first visualize these data by plotting the true classification of the emails against the model's fitted probabilities, as shown in Figure 2. The vast majority of emails (spam or not)\u00a0still have fitted probabilities below 0.5.\r\n\r\n[caption id=\"attachment_1470\" align=\"aligncenter\" width=\"779\"]<img class=\"wp-image-1470 size-full\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/132\/2016\/04\/21215329\/Figure8_6.png\" alt=\"Scatter plot of the data from the emails.\" width=\"779\" height=\"272\" \/> Figure 2: The predicted probability that each of the 3,912 emails is spam is classified by their grouping, spam or not. Noise (small, random vertical shifts) have been added to each point so that points with nearly identical values aren't plotted exactly on top of one another. This makes it possible\u00a0to see more observations.[\/caption]\r\n\r\nThis may at first seem very discouraging: we have fit a logistic model to create a spam\u00a0filter, but no emails have a fitted probability of being spam above 0.75. Don't despair; we will discuss ways to improve the model through the use of better variables\u00a0later.\r\n\r\nWe'd like to assess the quality of our model. For example, we might ask: if we look\u00a0at emails that we modeled as having a 10% chance of being spam, do we find about 10% of them actually are spam? To help us out, we'll borrow an advanced statistical method called<strong> natural splines<\/strong> that estimates the local probability over the region 0.00 to 0.75 (the largest predicted probability was 0.73, so we avoid extrapolating). All you need to know about natural splines to understand what we are doing is that they are used to fit\u00a0\ufb02exible lines rather than straight lines.\r\n\r\nThe curve fit using natural splines is shown in Figure 3\u00a0as a solid black line. If\u00a0the logistic model fits well, the curve should closely follow the dashed<em> y<\/em> =<em> x<\/em> line. We have added shading to represent the confidence bound for the curved line to clarify what \ufb02uctuations might plausibly be due to chance. Even with this confidence bound, there are weaknesses in the first model assumption. The solid curve and its confidence bound dips below the dashed line from about 0.1 to 0.3, and then it drifts above the dashed line from about 0.35 to 0.55. These deviations indicate the model relating the parameter to the\u00a0predictors does not closely resemble the true relationship.\r\n\r\nWe could evaluate the second logistic regression model assumption\u2014independence of\u00a0the outcomes\u2014using the model residuals. The residuals for a logistic regression model are calculated the same way as with multiple regression: the observed outcome minus the expected outcome. For logistic regression, the expected value of the outcome is the fitted\u00a0probability for the observation, and the residual may be written as\r\n<p style=\"text-align: center;\">[latex]\\displaystyle{e}_i=Y_i-\\hat{p}_i[\/latex]<\/p>\r\nWe could plot these residuals against a variety of variables or in their order of collection,\u00a0as we did with the residuals in multiple regression. However, since the model will need to be revised to effectively classify spam and you have already seen similar residual plots, we won't investigate the residuals here.\r\n\r\n[caption id=\"attachment_1471\" align=\"aligncenter\" width=\"774\"]<img class=\"size-full wp-image-1471\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/132\/2016\/04\/21215331\/Figure8_7.png\" alt=\"Figure 3: The solid black line provides the empirical estimate of the probability for observations based on their predicted probabilities (confidence bounds are also shown for this line), which is fit using natural splines. A small amount of noise was added to the observations in the plot to allow\u00a0more observations to be seen.\" width=\"774\" height=\"433\" \/> Figure 3: The solid black line provides the empirical estimate of the probability for observations based on their predicted probabilities (confidence bounds are also shown for this line), which is fit using natural splines. A small amount of noise was added to the observations in the plot to allow\u00a0more observations to be seen.[\/caption]\r\n<h2>Improving the Set of Variables for a Spam Filter<\/h2>\r\nIf we were building a spam filter for an email service that managed many accounts (e.g.\u00a0Gmail or Hotmail), we would spend much more time thinking about additional variables that could be useful in classifying emails as spam or not. We also would use transformations or other techniques that would help us include strongly skewed numerical variables as\u00a0predictors.\r\n\r\nTake a few minutes to think about additional variables that might be useful in identifying spam. Below is a list of variables we think might be useful:\r\n<ol>\r\n \t<li>An indicator variable could be used to represent whether there was prior two-way\u00a0correspondence with a message's sender. For instance, if you sent a message to john@example.com and then John sent you an email, this variable would take value 1 for the email that John sent. If you had never sent John an email, then the variable\u00a0would be set to 0.<\/li>\r\n \t<li>A second indicator variable could utilize an account's past spam \ufb02agging information.\u00a0The variable could take value 1 if the sender of the message has previously sent\u00a0messages \ufb02agged as spam.<\/li>\r\n \t<li>A third indicator variable could \ufb02ag emails that contain links included in previous\u00a0spam messages. If such a link is found, then set the variable to 1 for the email.\u00a0Otherwise, set it to 0.<\/li>\r\n<\/ol>\r\nThe variables described above take one of two approaches. Variable (1) is specially designed\u00a0to capitalize on the fact that spam is rarely sent between individuals that have two-way\u00a0communication. Variables (2) and (3) are specially designed to \ufb02ag common spammers or\u00a0spam messages. While we would have to verify using the data that each of the variables is\u00a0effective, these seem like promising ideas.\r\n\r\nTable 4\u00a0shows a contingency table for spam and also for the new variable described\u00a0in (1) above. If we look at the 1,090 emails where there was correspondence with the sender in the preceding 30 days, not one of these message was spam. This suggests variable (1) would be very effective at accurately classifying some messages as not spam. With this single variable, we would be able to send about 28% of messages through to the inbox with\u00a0confidence that almost none are spam.\r\n\r\n&nbsp;\r\n<table>\r\n<thead>\r\n<tr>\r\n<th colspan=\"4\">Table 4.\u00a0A contingency table for spam and a new variable that represents\u00a0whether there had been correspondence with the sender in the preceding\u00a030 days<\/th>\r\n<\/tr>\r\n<\/thead>\r\n<tbody>\r\n<tr>\r\n<td><\/td>\r\n<td colspan=\"2\">prior correspondence<\/td>\r\n<\/tr>\r\n<tr>\r\n<td><\/td>\r\n<td>no<\/td>\r\n<td>yes<\/td>\r\n<td>Total<\/td>\r\n<\/tr>\r\n<tr>\r\n<td>spam<\/td>\r\n<td>367<\/td>\r\n<td>o<\/td>\r\n<td>367<\/td>\r\n<\/tr>\r\n<tr>\r\n<td>not spam<\/td>\r\n<td>2464<\/td>\r\n<td>1090<\/td>\r\n<td>3554<\/td>\r\n<\/tr>\r\n<tr>\r\n<td>Total<\/td>\r\n<td>2831<\/td>\r\n<td>1090<\/td>\r\n<td>3921<\/td>\r\n<\/tr>\r\n<\/tbody>\r\n<\/table>\r\nThe variables described in (2) and (3) would provide an excellent foundation for distinguishing messages coming from known spammers or messages that take a known form of spam. To utilize these variables, we would need to build databases: one holding email addresses of known spammers, and one holding URLs found in known spam messages. Our access to such information is limited, so we cannot implement these two variables in this textbook. However, if we were hired by an email service to build a spam filter, these would\u00a0be important next steps.\r\n\r\nIn addition to finding more and better predictors, we would need to create a customized\u00a0logistic regression model for each email account. This may sound like an intimidating task, but its complexity is not as daunting as it may at first seem. We'll save the details for a\u00a0statistics course where computer programming plays a more central role.\r\n\r\nFor what is the extremely challenging task of classifying spam messages, we have made\u00a0a lot of progress. We have seen that simple email variables, such as the format, inclusion of certain words, and other circumstantial characteristics, provide helpful information for spam classification. Many challenges remain, from better understanding logistic regression to carrying out the necessary computer programming, but completing such a task is very\u00a0nearly within your reach.","rendered":"<p>In this section we introduce<strong> logistic regression<\/strong> as a tool for building models when there is\u00a0a categorical response variable with two levels. Logistic regression is a type of<strong> generalized <\/strong><strong>linear model<\/strong> (GLM) for response variables where regular multiple regression does not work very well. In particular, the response variable in these settings often takes a form\u00a0where residuals look completely different from the normal distribution.<\/p>\n<p>GLMs can be thought of as a two-stage modeling approach. We first model the\u00a0response variable using a probability distribution, such as the binomial or Poisson distribution. Second, we model the parameter of the distribution using a collection of predictors\u00a0and a special form of multiple regression.<\/p>\n<p>In this page\u00a0we will data about emails. These emails were\u00a0collected from a single email account, and we will work on developing a basic spam filter using these data. The response variable, spam, has been encoded to take value 0 when a message is not spam and 1 when it is spam. Our task will be to build an appropriate model that classifies messages as spam or not spam using email characteristics coded as predictor variables. While this model will not be the same as those used in large-scale spam filters,\u00a0it shares many of the same features.<\/p>\n<h2>Email Data<\/h2>\n<p>There are several\u00a0variables available that might be useful for classifying spam. Descriptions of these variables are presented in Table 1. The spam variable will be the outcome, and the other 10 variables will be the model predictors. While we have limited the predictors used in this section to be categorical variables (where many are represented as indicator variables), numerical predictors may also be used in logistic\u00a0regression.<a class=\"footnote\" title=\"Recall that if outliers are present in predictor variables, the corresponding observations may be especially influential on the resulting model. This is the motivation for omitting the numerical variables, such as the number of characters and line breaks in emails. These variables exhibit extreme skew. We could resolve this issue by transforming these variables (e.g. using a log-transformation), but we will omit this further investigation for brevity.\" id=\"return-footnote-973-1\" href=\"#footnote-973-1\" aria-label=\"Footnote 1\"><sup class=\"footnote\">[1]<\/sup><\/a><\/p>\n<table>\n<thead>\n<tr>\n<th colspan=\"2\">Table 1. Descriptions for 11 variables in the email data set. Notice that all of the variables are indicator variables, which take the value 1 if the specified characteristic is present and 0 otherwise.<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<th style=\"width: 15%;\">Variable<\/th>\n<th style=\"width: 85%;\">Description<\/th>\n<\/tr>\n<tr>\n<td style=\"width: 15%;\">spam<\/td>\n<td style=\"width: 85%;\">Specifies whether the message was spam<\/td>\n<\/tr>\n<tr>\n<td>to_multiple<\/td>\n<td>An indicator variable for if more than one person was listed in the To field of the email.<\/td>\n<\/tr>\n<tr>\n<td>cc<\/td>\n<td>An indicator for if someone was CCed on the email<\/td>\n<\/tr>\n<tr>\n<td>attach<\/td>\n<td>An indicator for if there was an attachment, such as a document or image<\/td>\n<\/tr>\n<tr>\n<td>dollar<\/td>\n<td>An indicator for if the word \u201cdollar\u201d or dollar symbol ($) appeared in the email.<\/td>\n<\/tr>\n<tr>\n<td>winner<\/td>\n<td>An indicator for if the word \u201cwinner\u201d appeared in the email message<\/td>\n<\/tr>\n<tr>\n<td>inherit<\/td>\n<td>An indicator for if the word \u201cinherit\u201d (or a variation, like \u201cinheritance\u201d) appeared in the email.<\/td>\n<\/tr>\n<tr>\n<td>password<\/td>\n<td>An indicator for if the word \u201cpassword\u201d was present in the email.<\/td>\n<\/tr>\n<tr>\n<td>format<\/td>\n<td>Indicates if the email contained special formatting, such as bolding, tables, or links<\/td>\n<\/tr>\n<tr>\n<td>re_subj<\/td>\n<td>Indicates whether \u201cRe:\u201d was included at the start of the email subject.<\/td>\n<\/tr>\n<tr>\n<td>exclaim_subj<\/td>\n<td>Indicates whether any exclamation point was included in the email subject<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h2>Modeling the Probability of an Event<\/h2>\n<div class=\"textbox\">\n<h3>TIP: Notation for a Logistic Regression Model<\/h3>\n<p>The outcome variable for a GLM is denoted by <em>Y<sub>i<\/sub><\/em>, where the index<em> i<\/em> is used to\u00a0represent observation<em> i<\/em>. In the email application, <em>Y<sub>i<\/sub><\/em> will be used to represent\u00a0whether email<em> i<\/em> is spam (<em>Y<sub>i<\/sub><\/em> = 1) or not (<em>Y<sub>i<\/sub><\/em>\u00a0= 0).<\/p>\n<p>The predictor variables are represented as follows: <em>x<\/em><sub>1, <em>i<\/em><\/sub> is the value of variable 1 for\u00a0observation<em> i<\/em>, <em>x<\/em><sub>2, <em>i<\/em><\/sub> is the value of variable 2 for observation <em>i<\/em>, and so on.<\/p>\n<\/div>\n<p>Logistic regression is a generalized linear model where the outcome is a two-level\u00a0categorical variable. The outcome,<em> <em>Y<sub>i<\/sub><\/em><\/em>, takes the value 1 (in our application, this represents a spam message) with probability <em>p<sub>i<\/sub><\/em>\u00a0and the value 0 with probability 1 \u2212 <em>p<sub>i<\/sub><\/em>. It is the\u00a0probability <em>p<sub>i<\/sub><\/em>\u00a0that we model in relation to the predictor variables.<\/p>\n<p>The logistic regression model relates the probability an email is spam (<em>p<sub>i<\/sub><\/em>) to the\u00a0predictors<em> <em>x<\/em><\/em><sub>1, <em>i<\/em><\/sub>,<em> <em>x<\/em><\/em><sub>2, <em>i<\/em><\/sub>\u00a0&#8230;,<em> <em>x<\/em><sub>k, <em>i<\/em><\/sub><\/em>\u00a0through a framework much like that of multiple regression:<\/p>\n<p style=\"text-align: center;\">[latex]\\text{transformation}\\left(p_i\\right)=\\beta_0+\\beta_1x_{1,i}+\\beta_2x_{2,i}+\\dots\\beta_k{x}_{k,i}[\/latex]<\/p>\n<p>We want to choose a transformation in the equation above\u00a0that makes practical and mathematical sense. For example, we want a transformation that makes the range of possibilities on the left hand side of the equation above\u00a0equal to the range of possibilities for the right hand side; if there was no transformation for this equation, the left hand side could only take values between 0 and 1, but the right hand side could take values outside of this range. A\u00a0common transformation for<i>\u00a0<\/i><em>p<sub>i<\/sub><\/em>\u00a0is the<strong> logit transformation<\/strong>, which may be written as<\/p>\n<p style=\"text-align: center;\">[latex]\\displaystyle\\text{logit}\\left(p_i\\right)=\\log_{e}\\left(\\frac{p_i}{1-p_i}\\right)[\/latex]<\/p>\n<p>The logit transformation is shown in Figure 1. Below, we rewrite the transformation equation\u00a0using\u00a0the logit transformation of<i>\u00a0<\/i><em>p<sub>i<\/sub><\/em>:<\/p>\n<p style=\"text-align: center;\">[latex]\\displaystyle\\log_e\\left(\\frac{p_i}{1-p_i}\\right)=\\beta_0+\\beta_1x_{1,i}+\\beta_2x_{2,i}+\\dots\\beta_k{x}_{k,i}[\/latex]<\/p>\n<div id=\"attachment_1468\" style=\"width: 778px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-1468\" class=\"wp-image-1468 size-full\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/132\/2016\/04\/21215327\/Figure8_5.png\" alt=\"A graph of values of pi against values of logit(pi). Points include (-5.0, 0.007), (-1.0, 0.27), (0.0, 0.50), (2.0, 0.88), and (4.0, 0.982).\" width=\"768\" height=\"435\" \/><\/p>\n<p id=\"caption-attachment-1468\" class=\"wp-caption-text\">Figure 1. Values of <i>p<sub>i<\/sub><\/i> against values of logit(<em>p<sub>i<\/sub><\/em>).<\/p>\n<\/div>\n<p>In our spam example, there are 10 predictor variables, so<em> k<\/em> = 10. This model isn&#8217;t very\u00a0intuitive, but it still has some resemblance to multiple regression, and we can fit this model using software. In fact, once we look at results from software, it will start to feel like we&#8217;re\u00a0back in multiple regression, even if the interpretation of the coefficients is more complex.<\/p>\n<p>&nbsp;<\/p>\n<div class=\"textbox exercises\">\n<h3>Example<\/h3>\n<p>Here we create a spam filter with a single predictor: to_multiple.<\/p>\n<p>This variable indicates whether more than one email address was listed in the<em> To<\/em> field of the email. The following logistic regression model was fit using statistical software:<\/p>\n<p>[latex]\\displaystyle\\log\\left(\\frac{p_i}{1-p_i}\\right)=-2.12-1.81\\times\\text{to_multiple}[\/latex]<\/p>\n<p>If an email is randomly selected and it has just one address in the<em> To<\/em> field, what is\u00a0the probability it is spam? What if more than one address is listed in the<em> To<\/em> field?<\/p>\n<p>Solution:<\/p>\n<p>If there is only one email in the<em> To<\/em> field, then to_multiple takes value 0 and the\u00a0right side of the model equation equals \u22122.12. Solving for\u00a0<em>p<sub>i<\/sub><\/em>:<\/p>\n<p>[latex]\\displaystyle\\frac{e^{-2.12}}{1+e^{-2.12}}=0.11[\/latex]<\/p>\n<p>Just as\u00a0we labeled a fitted value of<em>\u00a0y<sub>i<\/sub><\/em>\u00a0with a &#8220;hat&#8221; in single-variable and multiple regression,\u00a0we will do the same for this probability: [latex]\\hat{p}_i=0.11[\/latex].<\/p>\n<p>If there is more than one address listed in the<em> To<\/em> field, then the right side of the model\u00a0equation is \u22122<em>.<\/em>12 \u2212\u00a01<em>.<\/em>81 \u00d7\u00a01 = \u22123<em>.<\/em>93, which corresponds to a probability [latex]\\hat{p}_i=0.02[\/latex].<\/p>\n<p>Notice that we could examine \u22122.12 and \u22123.93 in Figure 1 to estimate the probability\u00a0before formally calculating the value.<\/p>\n<\/div>\n<p>To convert from values on the regression-scale (e.g. \u22122.12 and \u22123.93 in Example 1),\u00a0use the following formula, which is the result of solving for<em> p<\/em><em>i<\/em> in the regression model:<\/p>\n<p style=\"text-align: center;\">[latex]\\displaystyle{p}_i=\\frac{e^{\\beta_0+\\beta_1x_{1,i}+\\beta_2x_{2,i}+\\dots\\beta_k{x}_{k,i}}}{1 + e^{\\beta_0+\\beta_1x_{1,i}+\\beta_2x_{2,i}+\\dots\\beta_k{x}_{k,i}}}[\/latex]<\/p>\n<p>As with most applied data problems, we substitute the point estimates for the parameters\u00a0(the\u00a0<em>\u03b2<sub>i<\/sub><\/em>) so that we may make use of this formula. In Example 1, the probabilities were\u00a0calculated as<\/p>\n<p style=\"text-align: center;\">[latex]\\displaystyle\\begin{array}\\text{ }\\frac{e^{-2.12}}{1+e^{-2.12}}=0.11\\hfill&\\text{ }\\hfill&\\frac{e^{-2.12-1.81}}{1+e^{-2.12-1.81}}=0.02\\end{array}[\/latex]<\/p>\n<p>While the information about whether the email is addressed to multiple people is a helpful start in classifying email as spam or not, the probabilities of 11% and 2% are not dramatically different, and neither provides very strong evidence about which particular email messages are spam. To get more precise estimates, we&#8217;ll need to include many more\u00a0variables in the model.<\/p>\n<p>We used statistical software to fit the logistic regression model with all ten predictors\u00a0described in Table 1. Like multiple regression, the result may be presented in a summary table, which is shown in Table 2. The structure of this table is almost identical to that of multiple regression; the only notable difference is that the <em>p<\/em>-values are calculated using\u00a0the normal distribution rather than the<em> t<\/em>-distribution.<\/p>\n<p>&nbsp;<\/p>\n<table>\n<thead>\n<tr>\n<th colspan=\"5\">Table 2. Summary table for the full logistic regression model for the\u00a0spam filter example<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><\/td>\n<td>Estimate<\/td>\n<td>Std. Error<\/td>\n<td>z value<\/td>\n<td>Pr (<em> &gt;<\/em> |z|)<\/td>\n<\/tr>\n<tr>\n<td>(Intercept)<\/td>\n<td>\u20130.8362<\/td>\n<td>0.0962<\/td>\n<td>\u20138.69<\/td>\n<td>0.0000<\/td>\n<\/tr>\n<tr>\n<td>to_multiple<\/td>\n<td>\u20132.8836<\/td>\n<td>0.3121<\/td>\n<td>\u20139.24<\/td>\n<td>0.0000<\/td>\n<\/tr>\n<tr>\n<td>winner<\/td>\n<td>1.7038<\/td>\n<td>0.3254<\/td>\n<td>5.24<\/td>\n<td>0.0000<\/td>\n<\/tr>\n<tr>\n<td>format<\/td>\n<td>\u20131.5902<\/td>\n<td>0.1239<\/td>\n<td>\u201312.84<\/td>\n<td>0.0000<\/td>\n<\/tr>\n<tr>\n<td>re_subj<\/td>\n<td>\u20132.9082<\/td>\n<td>0.3708<\/td>\n<td>\u20137.84<\/td>\n<td>0.0000<\/td>\n<\/tr>\n<tr>\n<td>exclaim_subj<\/td>\n<td>0.1355<\/td>\n<td>0.2268<\/td>\n<td>0.60<\/td>\n<td>0.5503<\/td>\n<\/tr>\n<tr>\n<td>cc<\/td>\n<td>\u20130.4863<\/td>\n<td>0.3054<\/td>\n<td>\u20131.59<\/td>\n<td>0.1113<\/td>\n<\/tr>\n<tr>\n<td>attach<\/td>\n<td>0.9790<\/td>\n<td>0.2170<\/td>\n<td>4.51<\/td>\n<td>0.0000<\/td>\n<\/tr>\n<tr>\n<td>dollar<\/td>\n<td>\u20130.0582<\/td>\n<td>0.1589<\/td>\n<td>\u20130.37<\/td>\n<td>0.7144<\/td>\n<\/tr>\n<tr>\n<td>inherit<\/td>\n<td>0.2093<\/td>\n<td>0.3197<\/td>\n<td>0.65<\/td>\n<td>0.5127<\/td>\n<\/tr>\n<tr>\n<td>password<\/td>\n<td>\u20131.4929<\/td>\n<td>0.5295<\/td>\n<td>\u20132.82<\/td>\n<td>0.0048<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Just like multiple regression, we could trim some variables from the model using the\u00a0<em>p<\/em>-value. Using backward elimination with a <em>p<\/em>-value cutoff of 0.05 (start with the full model and trim the predictors with <em>p<\/em>-values greater than 0.05), we ultimately eliminate the exclaim_subj, dollar, inherit, and cc predictors. The remainder of this section will\u00a0rely on this smaller model, which is summarized in Table 3.<\/p>\n<table>\n<thead>\n<tr>\n<th colspan=\"5\">Table 3. Summary table for the logistic regression model for the spam\u00a0filter, where variable selection has been performed<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><\/td>\n<td>Estimate<\/td>\n<td>Std. Error<\/td>\n<td>z value<\/td>\n<td>Pr (<em> &gt;<\/em> |z|)<\/td>\n<\/tr>\n<tr>\n<td>(Intercept)<\/td>\n<td>\u20130.8595<\/td>\n<td>0.0910<\/td>\n<td>\u20139.44<\/td>\n<td>0.0000<\/td>\n<\/tr>\n<tr>\n<td>to_multiple<\/td>\n<td>\u20132.8372<\/td>\n<td>0.3092<\/td>\n<td>\u20139.18<\/td>\n<td>0.0000<\/td>\n<\/tr>\n<tr>\n<td>winner<\/td>\n<td>1.7370<\/td>\n<td>0.3218<\/td>\n<td>5.40<\/td>\n<td>0.0000<\/td>\n<\/tr>\n<tr>\n<td>format<\/td>\n<td>\u20131.5569<\/td>\n<td>0.1207<\/td>\n<td>\u201312.90<\/td>\n<td>0.0000<\/td>\n<\/tr>\n<tr>\n<td>re_subj<\/td>\n<td>\u20133.0482<\/td>\n<td>0.3630<\/td>\n<td>\u20138.40<\/td>\n<td>0.0000<\/td>\n<\/tr>\n<tr>\n<td>attach<\/td>\n<td>0.8643<\/td>\n<td>0.2042<\/td>\n<td>4.23<\/td>\n<td>0.0000<\/td>\n<\/tr>\n<tr>\n<td>password<\/td>\n<td>\u20131.4871<\/td>\n<td>0.5290<\/td>\n<td>\u20132.81<\/td>\n<td>0.0049<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>&nbsp;<\/p>\n<div class=\"textbox key-takeaways\">\n<h3>Try It<\/h3>\n<p>Examine the summary of the reduced model in Table 3,\u00a0and in particular, examine the to_multiple row. Is the point estimate the same as\u00a0we found before, \u22121.81, or is it different? Explain why this might be.<\/p>\n<p>Solution:<\/p>\n<p>The new estimate is different: \u22122.87. This new value represents the estimated coefficient when we are also accounting for other variables in the logistic regression model.<\/p>\n<\/div>\n<p>&nbsp;<\/p>\n<p>Point estimates will generally change a little\u2014and sometimes a lot\u2014depending on which other variables are included in the model. This is usually due to colinearity in the predictor variables. We previously saw this in the eBay auction example when we compared the coefficient of cond_new in a single-variable model and the corresponding coefficient in the multiple regression model that used three additional variables.<\/p>\n<p>&nbsp;<\/p>\n<div class=\"textbox exercises\">\n<h3>Example<\/h3>\n<p>Spam filters are built to be automated, meaning a piece of software\u00a0is written to collect information about emails as they arrive, and this information is put in the form of variables. These variables are then put into an algorithm that uses a statistical model, like the one we&#8217;ve fit, to classify the email. Suppose we write software for a spam filter using the reduced model shown in Table 3. If an incoming email has the word &#8220;winner&#8221; in it, will this raise or lower the model&#8217;s\u00a0calculated probability that the incoming email is spam?<\/p>\n<p>Solution:<\/p>\n<p>The estimated coefficient of winner is positive (1.7370). A positive coefficient estimate in logistic regression, just like in multiple regression, corresponds to a positive association between the predictor and response variables when accounting for the other variables in the model. Since the response variable takes value 1 if an email is spam and 0 otherwise, the positive coefficient indicates that the presence of &#8220;winner&#8221;\u00a0in an email raises the model probability that the message is spam.<\/p>\n<\/div>\n<div class=\"textbox exercises\">\n<h3>Example<\/h3>\n<p>Suppose the same email from Example 2\u00a0was in HTML format,\u00a0meaning the format variable took value 1. Does this characteristic increase or decrease the probability that the email is spam according to the model?<\/p>\n<p>Solution:<\/p>\n<p>Since HTML corresponds to a value of 1 in the format variable and the coefficient of\u00a0this variable is negative (\u22121.5569), this would lower the probability estimate returned\u00a0from the model.<\/p>\n<\/div>\n<h2>Practical Decisions in the Email Application<\/h2>\n<p>Examples 2\u00a0and 3\u00a0highlight a key feature of logistic and multiple regression. In the\u00a0spam filter example, some email characteristics will push an email&#8217;s classification in the\u00a0direction of spam while other characteristics will push it in the opposite direction.<\/p>\n<p>If we were to implement a spam filter using the model we have fit, then each future\u00a0email we analyze would fall into one of three categories based on the email&#8217;s characteristics:<\/p>\n<ol>\n<li>The email characteristics generally indicate the email is not spam, and so the resulting\u00a0probability that the email is spam is quite low, say, under 0.05.<\/li>\n<li>The characteristics generally indicate the email is spam, and so the resulting probability that the email is spam is quite large, say, over 0.95.<\/li>\n<li>The characteristics roughly balance each other out in terms of evidence for and against\u00a0the message being classified as spam. Its probability falls in the remaining range,\u00a0meaning the email cannot be adequately classified as spam or not spam.<\/li>\n<\/ol>\n<p>If we were managing an email service, we would have to think about what should be\u00a0done in each of these three instances. In an email application, there are usually just two possibilities: filter the email out from the regular inbox and put it in a &#8220;spambox,&#8221; or let\u00a0the email go to the regular inbox.<\/p>\n<p>&nbsp;<\/p>\n<div class=\"textbox key-takeaways\">\n<h3>Try It<\/h3>\n<p>The first and second scenarios are intuitive. If the evidence strongly suggests a message is not spam, send it to the inbox. If the evidence strongly\u00a0suggests the message is spam, send it to the spambox. How should we handle emails\u00a0in the third category?<\/p>\n<p>Solution:<\/p>\n<p>In this particular application, we should err on the side of sending more mail to the inbox rather than mistakenly putting good messages in the spambox. So, in summary: emails in the first and last categories go to the regular inbox, and those in the second scenario go to the spambox.<\/p>\n<\/div>\n<div class=\"textbox key-takeaways\">\n<h3>Try It<\/h3>\n<p>Suppose we apply the logistic model we have built as a\u00a0spam filter and that 100 messages are placed in the spambox over 3 months. If we\u00a0used the guidelines above for putting messages into the spambox, about how many\u00a0legitimate (non-spam) messages would you expect to find among the 100 messages?<\/p>\n<p>Solution:<\/p>\n<p>First, note that we proposed a cutoff for the predicted probability of 0.95 for spam. In a worst case scenario, all the messages in the spambox had the minimum probability equal to about 0.95. Thus, we should expect to find about 5 or fewer legitimate messages among the 100 messages placed in the spambox.<\/p>\n<\/div>\n<p>Almost any classifier will have some error. In the spam filter guidelines above, we\u00a0have decided that it is okay to allow up to 5% of the messages in the spambox to be real messages. If we wanted to make it a little harder to classify messages as spam, we could use a cutoff of 0.99. This would have two effects. Because it raises the standard for what can be classified as spam, it reduces the number of good emails that are classified as spam. However, it will also fail to correctly classify an increased fraction of spam messages. No matter the complexity and the confidence we might have in our model, these practical considerations are absolutely crucial to making a helpful spam filter. Without them, we\u00a0could actually do more harm than good by using our statistical model.<\/p>\n<h2>Diagnostics for the Email Classifier<\/h2>\n<div class=\"textbox\">\n<h3>Logistic Regression Conditions<\/h3>\n<p>There are two key conditions for fitting a logistic regression model:<\/p>\n<ol>\n<li>Each predictor <em>x<sub>i<\/sub><\/em> is linearly related to logit(<em>p<sub>i<\/sub><\/em>) if all other predictors are\u00a0held constant.<\/li>\n<li>Each outcome <em>Y<sub>i<\/sub><\/em> is independent of the other outcomes.<\/li>\n<\/ol>\n<\/div>\n<p>The first condition of the logistic regression model is not easily checked without a\u00a0fairly sizable amount of data. Luckily, we have 3,921 emails in our data set! Let&#8217;s first visualize these data by plotting the true classification of the emails against the model&#8217;s fitted probabilities, as shown in Figure 2. The vast majority of emails (spam or not)\u00a0still have fitted probabilities below 0.5.<\/p>\n<div id=\"attachment_1470\" style=\"width: 789px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-1470\" class=\"wp-image-1470 size-full\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/132\/2016\/04\/21215329\/Figure8_6.png\" alt=\"Scatter plot of the data from the emails.\" width=\"779\" height=\"272\" \/><\/p>\n<p id=\"caption-attachment-1470\" class=\"wp-caption-text\">Figure 2: The predicted probability that each of the 3,912 emails is spam is classified by their grouping, spam or not. Noise (small, random vertical shifts) have been added to each point so that points with nearly identical values aren&#8217;t plotted exactly on top of one another. This makes it possible\u00a0to see more observations.<\/p>\n<\/div>\n<p>This may at first seem very discouraging: we have fit a logistic model to create a spam\u00a0filter, but no emails have a fitted probability of being spam above 0.75. Don&#8217;t despair; we will discuss ways to improve the model through the use of better variables\u00a0later.<\/p>\n<p>We&#8217;d like to assess the quality of our model. For example, we might ask: if we look\u00a0at emails that we modeled as having a 10% chance of being spam, do we find about 10% of them actually are spam? To help us out, we&#8217;ll borrow an advanced statistical method called<strong> natural splines<\/strong> that estimates the local probability over the region 0.00 to 0.75 (the largest predicted probability was 0.73, so we avoid extrapolating). All you need to know about natural splines to understand what we are doing is that they are used to fit\u00a0\ufb02exible lines rather than straight lines.<\/p>\n<p>The curve fit using natural splines is shown in Figure 3\u00a0as a solid black line. If\u00a0the logistic model fits well, the curve should closely follow the dashed<em> y<\/em> =<em> x<\/em> line. We have added shading to represent the confidence bound for the curved line to clarify what \ufb02uctuations might plausibly be due to chance. Even with this confidence bound, there are weaknesses in the first model assumption. The solid curve and its confidence bound dips below the dashed line from about 0.1 to 0.3, and then it drifts above the dashed line from about 0.35 to 0.55. These deviations indicate the model relating the parameter to the\u00a0predictors does not closely resemble the true relationship.<\/p>\n<p>We could evaluate the second logistic regression model assumption\u2014independence of\u00a0the outcomes\u2014using the model residuals. The residuals for a logistic regression model are calculated the same way as with multiple regression: the observed outcome minus the expected outcome. For logistic regression, the expected value of the outcome is the fitted\u00a0probability for the observation, and the residual may be written as<\/p>\n<p style=\"text-align: center;\">[latex]\\displaystyle{e}_i=Y_i-\\hat{p}_i[\/latex]<\/p>\n<p>We could plot these residuals against a variety of variables or in their order of collection,\u00a0as we did with the residuals in multiple regression. However, since the model will need to be revised to effectively classify spam and you have already seen similar residual plots, we won&#8217;t investigate the residuals here.<\/p>\n<div id=\"attachment_1471\" style=\"width: 784px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-1471\" class=\"size-full wp-image-1471\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/132\/2016\/04\/21215331\/Figure8_7.png\" alt=\"Figure 3: The solid black line provides the empirical estimate of the probability for observations based on their predicted probabilities (confidence bounds are also shown for this line), which is fit using natural splines. A small amount of noise was added to the observations in the plot to allow\u00a0more observations to be seen.\" width=\"774\" height=\"433\" \/><\/p>\n<p id=\"caption-attachment-1471\" class=\"wp-caption-text\">Figure 3: The solid black line provides the empirical estimate of the probability for observations based on their predicted probabilities (confidence bounds are also shown for this line), which is fit using natural splines. A small amount of noise was added to the observations in the plot to allow\u00a0more observations to be seen.<\/p>\n<\/div>\n<h2>Improving the Set of Variables for a Spam Filter<\/h2>\n<p>If we were building a spam filter for an email service that managed many accounts (e.g.\u00a0Gmail or Hotmail), we would spend much more time thinking about additional variables that could be useful in classifying emails as spam or not. We also would use transformations or other techniques that would help us include strongly skewed numerical variables as\u00a0predictors.<\/p>\n<p>Take a few minutes to think about additional variables that might be useful in identifying spam. Below is a list of variables we think might be useful:<\/p>\n<ol>\n<li>An indicator variable could be used to represent whether there was prior two-way\u00a0correspondence with a message&#8217;s sender. For instance, if you sent a message to john@example.com and then John sent you an email, this variable would take value 1 for the email that John sent. If you had never sent John an email, then the variable\u00a0would be set to 0.<\/li>\n<li>A second indicator variable could utilize an account&#8217;s past spam \ufb02agging information.\u00a0The variable could take value 1 if the sender of the message has previously sent\u00a0messages \ufb02agged as spam.<\/li>\n<li>A third indicator variable could \ufb02ag emails that contain links included in previous\u00a0spam messages. If such a link is found, then set the variable to 1 for the email.\u00a0Otherwise, set it to 0.<\/li>\n<\/ol>\n<p>The variables described above take one of two approaches. Variable (1) is specially designed\u00a0to capitalize on the fact that spam is rarely sent between individuals that have two-way\u00a0communication. Variables (2) and (3) are specially designed to \ufb02ag common spammers or\u00a0spam messages. While we would have to verify using the data that each of the variables is\u00a0effective, these seem like promising ideas.<\/p>\n<p>Table 4\u00a0shows a contingency table for spam and also for the new variable described\u00a0in (1) above. If we look at the 1,090 emails where there was correspondence with the sender in the preceding 30 days, not one of these message was spam. This suggests variable (1) would be very effective at accurately classifying some messages as not spam. With this single variable, we would be able to send about 28% of messages through to the inbox with\u00a0confidence that almost none are spam.<\/p>\n<p>&nbsp;<\/p>\n<table>\n<thead>\n<tr>\n<th colspan=\"4\">Table 4.\u00a0A contingency table for spam and a new variable that represents\u00a0whether there had been correspondence with the sender in the preceding\u00a030 days<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><\/td>\n<td colspan=\"2\">prior correspondence<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td>no<\/td>\n<td>yes<\/td>\n<td>Total<\/td>\n<\/tr>\n<tr>\n<td>spam<\/td>\n<td>367<\/td>\n<td>o<\/td>\n<td>367<\/td>\n<\/tr>\n<tr>\n<td>not spam<\/td>\n<td>2464<\/td>\n<td>1090<\/td>\n<td>3554<\/td>\n<\/tr>\n<tr>\n<td>Total<\/td>\n<td>2831<\/td>\n<td>1090<\/td>\n<td>3921<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The variables described in (2) and (3) would provide an excellent foundation for distinguishing messages coming from known spammers or messages that take a known form of spam. To utilize these variables, we would need to build databases: one holding email addresses of known spammers, and one holding URLs found in known spam messages. Our access to such information is limited, so we cannot implement these two variables in this textbook. However, if we were hired by an email service to build a spam filter, these would\u00a0be important next steps.<\/p>\n<p>In addition to finding more and better predictors, we would need to create a customized\u00a0logistic regression model for each email account. This may sound like an intimidating task, but its complexity is not as daunting as it may at first seem. We&#8217;ll save the details for a\u00a0statistics course where computer programming plays a more central role.<\/p>\n<p>For what is the extremely challenging task of classifying spam messages, we have made\u00a0a lot of progress. We have seen that simple email variables, such as the format, inclusion of certain words, and other circumstantial characteristics, provide helpful information for spam classification. Many challenges remain, from better understanding logistic regression to carrying out the necessary computer programming, but completing such a task is very\u00a0nearly within your reach.<\/p>\n\n\t\t\t <section class=\"citations-section\" role=\"contentinfo\">\n\t\t\t <h3>Candela Citations<\/h3>\n\t\t\t\t\t <div>\n\t\t\t\t\t\t <div id=\"citation-list-973\">\n\t\t\t\t\t\t\t <div class=\"licensing\"><div class=\"license-attribution-dropdown-subheading\">CC licensed content, Shared previously<\/div><ul class=\"citation-list\"><li>OpenIntro Statistics. <strong>Authored by<\/strong>: David M Diez, Christopher D Barr, and Mine Cetinkaya-Rundel. <strong>Provided by<\/strong>: OpenIntro. <strong>Located at<\/strong>: <a target=\"_blank\" href=\"https:\/\/www.openintro.org\/stat\/textbook.php\">https:\/\/www.openintro.org\/stat\/textbook.php<\/a>. <strong>License<\/strong>: <em><a target=\"_blank\" rel=\"license\" href=\"https:\/\/creativecommons.org\/licenses\/by-sa\/4.0\/\">CC BY-SA: Attribution-ShareAlike<\/a><\/em>. <strong>License Terms<\/strong>: This textbook is available under a Creative Commons license. Visit openintro.org for a free  PDF, to download the textbook&#039;s source files.<\/li><\/ul><\/div>\n\t\t\t\t\t\t <\/div>\n\t\t\t\t\t <\/div>\n\t\t\t <\/section><hr class=\"before-footnotes clear\" \/><div class=\"footnotes\"><ol><li id=\"footnote-973-1\">Recall that if outliers are present in predictor variables, the corresponding observations may be especially influential on the resulting model. This is the motivation for omitting the numerical variables, such as the number of characters and line breaks in emails. These variables exhibit extreme skew. We could resolve this issue by transforming these variables (e.g. using a log-transformation), but we will omit this further investigation for brevity. <a href=\"#return-footnote-973-1\" class=\"return-footnote\" aria-label=\"Return to footnote 1\">&crarr;<\/a><\/li><\/ol><\/div>","protected":false},"author":21,"menu_order":4,"template":"","meta":{"_candela_citation":"[{\"type\":\"cc\",\"description\":\"OpenIntro Statistics\",\"author\":\"David M Diez, Christopher D Barr, and Mine Cetinkaya-Rundel\",\"organization\":\"OpenIntro\",\"url\":\"https:\/\/www.openintro.org\/stat\/textbook.php\",\"project\":\"\",\"license\":\"cc-by-sa\",\"license_terms\":\"This textbook is available under a Creative Commons license. Visit openintro.org for a free  PDF, to download the textbook\\'s source files.\"}]","CANDELA_OUTCOMES_GUID":"","pb_show_title":"on","pb_short_title":"","pb_subtitle":"","pb_authors":[],"pb_section_license":""},"chapter-type":[],"contributor":[],"license":[],"class_list":["post-973","chapter","type-chapter","status-publish","hentry"],"part":961,"_links":{"self":[{"href":"https:\/\/courses.lumenlearning.com\/ntcc-introstats1\/wp-json\/pressbooks\/v2\/chapters\/973","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/courses.lumenlearning.com\/ntcc-introstats1\/wp-json\/pressbooks\/v2\/chapters"}],"about":[{"href":"https:\/\/courses.lumenlearning.com\/ntcc-introstats1\/wp-json\/wp\/v2\/types\/chapter"}],"author":[{"embeddable":true,"href":"https:\/\/courses.lumenlearning.com\/ntcc-introstats1\/wp-json\/wp\/v2\/users\/21"}],"version-history":[{"count":3,"href":"https:\/\/courses.lumenlearning.com\/ntcc-introstats1\/wp-json\/pressbooks\/v2\/chapters\/973\/revisions"}],"predecessor-version":[{"id":1528,"href":"https:\/\/courses.lumenlearning.com\/ntcc-introstats1\/wp-json\/pressbooks\/v2\/chapters\/973\/revisions\/1528"}],"part":[{"href":"https:\/\/courses.lumenlearning.com\/ntcc-introstats1\/wp-json\/pressbooks\/v2\/parts\/961"}],"metadata":[{"href":"https:\/\/courses.lumenlearning.com\/ntcc-introstats1\/wp-json\/pressbooks\/v2\/chapters\/973\/metadata\/"}],"wp:attachment":[{"href":"https:\/\/courses.lumenlearning.com\/ntcc-introstats1\/wp-json\/wp\/v2\/media?parent=973"}],"wp:term":[{"taxonomy":"chapter-type","embeddable":true,"href":"https:\/\/courses.lumenlearning.com\/ntcc-introstats1\/wp-json\/pressbooks\/v2\/chapter-type?post=973"},{"taxonomy":"contributor","embeddable":true,"href":"https:\/\/courses.lumenlearning.com\/ntcc-introstats1\/wp-json\/wp\/v2\/contributor?post=973"},{"taxonomy":"license","embeddable":true,"href":"https:\/\/courses.lumenlearning.com\/ntcc-introstats1\/wp-json\/wp\/v2\/license?post=973"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}