{"id":877,"date":"2017-05-11T17:17:28","date_gmt":"2017-05-11T17:17:28","guid":{"rendered":"https:\/\/courses.lumenlearning.com\/suny-natural-resources-biometrics\/chapter\/chapter-7-correlation-and-simple-linear-regression\/"},"modified":"2017-05-11T17:17:28","modified_gmt":"2017-05-11T17:17:28","slug":"chapter-7-correlation-and-simple-linear-regression","status":"publish","type":"chapter","link":"https:\/\/courses.lumenlearning.com\/suny-natural-resources-biometrics\/chapter\/chapter-7-correlation-and-simple-linear-regression\/","title":{"raw":"Chapter 7:  Correlation and Simple Linear Regression","rendered":"Chapter 7:  Correlation and Simple Linear Regression"},"content":{"raw":"<div class=\"Basic-Text-Frame\">\n<p class=\"Chapter-Number\">In many studies, we measure more than one variable for each individual. For example, we measure precipitation and plant growth, or number of young with nesting habitat, or soil erosion and volume of water. We collect pairs of data and instead of examining each variable separately (univariate data), we want to find ways to describe <strong class=\"Strong-2\">bivariate data<\/strong>, in which two variables are measured on each subject in our sample. Given such data, we begin by determining if there is a relationship between these two variables. As the values of one variable change, do we see corresponding changes in the other variable?<\/p>\nWe can describe the relationship between these two variables graphically and numerically. We begin by considering the concept of correlation.\n<p class=\"Callout\"><span class=\"pullquote-left\">Correlation is defined as the statistical association between two variables.<\/span><\/p>\nA correlation exists between two variables when one of them is related to the other in some way. A scatterplot is the best place to start. A scatterplot (or scatter diagram) is a graph of the paired (x, y) sample data with a horizontal x-axis and a vertical y-axis. Each individual (x, y) pair is plotted as a single point.\n\n[caption id=\"\" align=\"aligncenter\" width=\"556\"]<img alt=\"11280.png\" class=\"frame-66\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171506\/11280.png\" width=\"556\" height=\"369\" \/> Figure 1. Scatterplot of chest girth versus length.[\/caption]\n\nIn this example, we plot bear chest girth (y) against bear length (x). When examining a scatterplot, we should study the overall pattern of the plotted points. In this example, we see that the value for chest girth does tend to increase as the value of length increases. We can see an upward slope and a straight-line pattern in the plotted data points.\n\nA scatterplot can identify several different types of relationships between two variables.\n<ul><li class=\"List-Paragraph\">A relationship has <strong class=\"Strong-2\">no correlation<\/strong> when the points on a scatterplot do not show any pattern.<\/li>\n \t<li class=\"List-Paragraph\">A relationship is <strong class=\"Strong-2\">non-linear<\/strong> when the points on a scatterplot follow a pattern but not a straight line.<\/li>\n \t<li class=\"List-Paragraph\">A relationship is <strong class=\"Strong-2\">linear<\/strong> when the points on a scatterplot follow a somewhat straight line pattern. This is the relationship that we will examine.<\/li>\n<\/ul>\nLinear relationships can be either positive or negative. Positive relationships have points that incline upwards to the right. As <em>x<\/em> values increase, <em>y<\/em> values increase. As <em>x<\/em> values decrease, <em>y<\/em> values decrease. For example, when studying plants, height typically increases as diameter increases.\n\n[caption id=\"\" align=\"aligncenter\" width=\"602\"]<img alt=\"11268.png\" class=\"frame-80\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171508\/11268.png\" width=\"602\" height=\"399\" \/> Figure 2. Scatterplot of height versus diameter.[\/caption]\n<p class=\"Caption\">Negative relationships have points that decline downward to the right. As <em>x<\/em> values increase, <em>y<\/em> values decrease. As <em>x<\/em> values decrease, <em>y<\/em> values increase. For example, as wind speed increases, wind chill temperature decreases.<\/p>\n\n\n[caption id=\"\" align=\"aligncenter\" width=\"619\"]<img alt=\"11256.png\" class=\"frame-13\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171510\/11256.png\" width=\"619\" height=\"410\" \/> Figure 3. Scatterplot of temperature versus wind speed.[\/caption]\n\nNon-linear relationships have an apparent pattern, just not linear. For example, as age increases height increases up to a point then levels off after reaching a maximum height.\n\n[caption id=\"\" align=\"aligncenter\" width=\"627\"]<img alt=\"11245.png\" class=\"frame-4\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171513\/11245.png\" width=\"627\" height=\"419\" \/> Figure 4. Scatterplot of height versus age.[\/caption]\n\nWhen two variables have no relationship, there is no straight-line relationship or non-linear relationship. When one variable changes, it does not influence the other variable.\n\n[caption id=\"\" align=\"aligncenter\" width=\"670\"]<img alt=\"11236.png\" class=\"frame-50\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171515\/11236.png\" width=\"670\" height=\"448\" \/> Figure 5. Scatterplot of growth versus area.[\/caption]\n<h2>Linear Correlation Coefficient<\/h2>\nBecause visual examinations are largely subjective, we need a more precise and objective measure to define the correlation between the two variables. To quantify the strength and direction of the relationship between two variables, we use the linear correlation coefficient:\n<p class=\"Centered\"><img alt=\"11226.png\" class=\"frame-74 aligncenter\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171517\/11226.png\" \/><\/p>\nwhere <span class=\"Symbols\" xml:lang=\"ar-SA\"><em>x\u0304<\/em><\/span> and <em>s<span class=\"Subscript SmallText\">x<\/span><\/em> are the sample mean and sample standard deviation of the <em>x<\/em>\u2019s, and <span class=\"Symbols\" xml:lang=\"ar-SA\"><em>y\u0304<\/em><\/span> and <em>s<span class=\"Subscript SmallText\">y<\/span><\/em> are the mean and standard deviation of the <em>y<\/em>\u2019s. The sample size is <em>n<\/em>.\n\nAn alternate computation of the correlation coefficient is:\n<p class=\"No-Caption\"><span class=\"Picture\"><img alt=\"11679.png\" class=\"frame-45 aligncenter\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171518\/11679.png\" \/><\/span><\/p>\n<p class=\"Centered\">where <span class=\"Inline-Equation-Large\"><img alt=\"11691.png\" class=\"frame-7\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171519\/11691.png\" \/><\/span><\/p>\n<p class=\"Centered\"><span class=\"Inline-Equation-Large\"><img alt=\"11702.png\" class=\"frame-55\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171520\/11702.png\" \/><\/span><\/p>\n<p class=\"Centered\"><span class=\"Inline-Equation-Large\"><img alt=\"11709.png\" class=\"frame-7\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171521\/11709.png\" \/><\/span><\/p>\nThe linear correlation coefficient is also referred to as Pearson\u2019s product moment correlation coefficient in honor of Karl Pearson, who originally developed it. This statistic numerically describes how strong the straight-line or linear relationship is between the two variables and the direction, positive or negative.\n\n<strong class=\"Strong-2\">The properties of \u201cr\u201d:<\/strong>\n<ul><li class=\"List-Paragraph\">It is always between -1 and +1.<\/li>\n \t<li class=\"List-Paragraph\">It is a unitless measure so \u201cr\u201d would be the same value whether you measured the two variables in pounds and inches or in grams and centimeters.<\/li>\n \t<li class=\"List-Paragraph\">Positive values of \u201cr\u201d are associated with positive relationships.<\/li>\n \t<li class=\"List-Paragraph\">Negative values of \u201cr\u201d are associated with negative relationships.<\/li>\n<\/ul><h3>Examples of Positive Correlation<\/h3>\n[caption id=\"\" align=\"alignnone\" width=\"1021\"]<img alt=\"11215.png\" class=\"frame-13\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171524\/11215.png\" width=\"1021\" height=\"864\" \/> Figure 6. Examples of positive correlation.[\/caption]\n<h3>Examples of Negative Correlation<\/h3>\n[caption id=\"\" align=\"alignnone\" width=\"982\"]<img alt=\"11205.png\" class=\"frame-13\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171527\/11205.png\" width=\"982\" height=\"802\" \/> Figure 7. Examples of negative correlation.[\/caption]\n<p class=\"Callout\"><span class=\"pullquote-left\"><strong class=\"char-style-override-2\">Correlation is not causation!!!<\/strong> Just because two variables are correlated does not mean that one variable causes another variable to change.<\/span><\/p>\nExamine these next two scatterplots. Both of these data sets have an r = 0.01, but they are very different. Plot 1 shows little linear relationship between <em>x<\/em> and <em>y<\/em> variables. Plot 2 shows a strong non-linear relationship. Pearson\u2019s linear correlation coefficient only measures the strength and direction of a linear relationship. Ignoring the scatterplot could result in a serious mistake when describing the relationship between two variables.\n\n[caption id=\"\" align=\"aligncenter\" width=\"938\"]<img alt=\"11196.png\" class=\"frame-13\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171530\/11196.png\" width=\"938\" height=\"301\" \/> Figure 8. Comparison of scatterplots.[\/caption]\n<p class=\"Caption\"><span class=\"Picture\" \/>When you investigate the relationship between two variables, always begin with a scatterplot. This graph allows you to look for patterns (both linear and non-linear). The next step is to quantitatively describe the strength and direction of the linear relationship using \u201cr\u201d. Once you have established that a linear relationship exists, you can take the next step in model building.<\/p>\n\n<h2>Simple Linear Regression<\/h2>\nOnce we have identified two variables that are correlated, we would like to model this relationship. We want to use one variable as a <strong class=\"Strong-2\">predictor<\/strong> or <strong class=\"Strong-2\">explanatory<\/strong> variable to explain the other variable, the <strong class=\"Strong-2\">response<\/strong> or <strong class=\"Strong-2\">dependent<\/strong> variable. In order to do this, we need a good relationship between our two variables. The model can then be used to predict changes in our response variable. A strong relationship between the predictor variable and the response variable leads to a good model.\n\n[caption id=\"\" align=\"aligncenter\" width=\"471\"]<img alt=\"11187.png\" class=\"frame-172\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171532\/11187.png\" width=\"471\" height=\"311\" \/> Figure 9. Scatterplot with regression model.[\/caption]\n<p class=\"Callout\"><span class=\"pullquote-left\">A simple linear regression model is a mathematical equation that allows us to predict a response for a given predictor value.<\/span><\/p>\nOur model will take the form of <em><span class=\"Symbols\" xml:lang=\"ar-SA\"><em>y\u0302<\/em><\/span> = b <span class=\"Subscript SmallText\">0<\/span> + b<span class=\"Subscript SmallText\">1<\/span>x<\/em> where <em>b<\/em><span class=\"Subscript SmallText\">0<\/span> is the y-intercept, <em>b<\/em><span class=\"Subscript SmallText\">1<\/span> is the slope, <em>x<\/em> is the predictor variable, and <span class=\"Symbols\" xml:lang=\"ar-SA\"><em>y\u0302<\/em><\/span> an estimate of the mean value of the response variable for any value of the predictor variable.\n\nThe y-intercept is the predicted value for the response (<em>y<\/em>) when <em>x<\/em> = 0. The slope describes the change in <em>y<\/em> for each one unit change in <em>x<\/em>. Let\u2019s look at this example to clarify the interpretation of the slope and intercept.\n<div class=\"textbox examples\">\n<h3>Example 1<\/h3>\n<p class=\"Example\">A hydrologist creates a model to predict the volume flow for a stream at a bridge crossing with a predictor variable of daily rainfall in inches.<\/p>\n<p class=\"Example\"><span class=\"Symbols\" xml:lang=\"ar-SA\"><em>y\u0302<\/em><\/span> = 1.6 + 29<em>x<\/em>. The y-intercept of 1.6 can be interpreted this way: On a day with no rainfall, there will be 1.6 gal. of water\/min. flowing in the stream at that bridge crossing. The slope tells us that if it rained one inch that day the flow in the stream would increase by an additional 29 gal.\/min. If it rained 2 inches that day, the flow would increase by an additional 58 gal.\/min.<\/p>\n\n<\/div>\n<div class=\"textbox examples\">\n<h3>Example 2<\/h3>\n<p class=\"ExampleHeading\">What would be the average stream flow if it rained 0.45 inches that day?<\/p>\n<p class=\"ExampleCenter\" style=\"text-align: center\"><span class=\"Symbols\" xml:lang=\"ar-SA\"><em>y\u0302<\/em><\/span> = 1.6 + 29<em>x<\/em> = 1.6 + 29(0.45) = 14.65 gal.\/min.<\/p>\n\n<\/div>\n\n<hr \/><p class=\"Call-out-First-line\" style=\"text-align: center\">The Least-Squares Regression Line (shortcut equations)<\/p>\n<p class=\"Call-out-Middle\" style=\"text-align: center\">The equation is given by <em><span class=\"Symbols\" xml:lang=\"ar-SA\"><em>y\u0302<\/em><\/span> = b <span class=\"Subscript SmallText\">0<\/span> + b<span class=\"Subscript SmallText\">1<\/span> x<\/em><\/p>\n<p class=\"Call-out-Middle\" style=\"text-align: center\">where <span class=\"Inline-Equation-Large\"><img alt=\"13279.png\" class=\"frame-44\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171534\/13279.png\" \/><\/span> is the slope and <em>b<span class=\"Subscript SmallText\">0<\/span> = <span class=\"Symbols\" xml:lang=\"ar-SA\"><em>y\u0302<\/em><\/span> - b<span class=\"Subscript SmallText\">1<\/span> <span class=\"Symbols\" xml:lang=\"ar-SA\"><em>x\u0304<\/em><\/span><\/em> is the y-intercept of the regression line.<\/p>\n<p class=\"Call-out-Middle\" style=\"text-align: center\">An alternate computational equation for slope is:<\/p>\n<p class=\"Call-out-End\" style=\"text-align: center\"><img alt=\"13297.png\" class=\"frame-74\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171535\/13297.png\" \/><\/p>\n\n\n<hr \/>\n\nThis simple model is the line of best fit for our sample data. The regression line does not go through every point; instead it balances the difference between all data points and the straight-line model. The difference between the observed data value and the predicted value (the value on the straight line) is the error or <strong class=\"Strong-2\">residual.<\/strong> The criterion to determine the line that best describes the relation between two variables is based on the residuals.\n<p class=\"Centered\" style=\"text-align: center\">Residual = Observed - Predicted<\/p>\nFor example, if you wanted to predict the chest girth of a black bear given its weight, you could use the following model.\n<p class=\"Centered\" style=\"text-align: center\">Chest girth = 13.2 +0.43 weight<\/p>\nThe predicted chest girth of a bear that weighed 120 lb. is 64.8 in.\n<p class=\"Centered\" style=\"text-align: center\">Chest girth = 13.2 + 0.43(120) = 64.8 in.<\/p>\nBut a measured bear chest girth (observed value) for a bear that weighed 120 lb. was actually 62.1 in.\n<p class=\"Centered\" style=\"text-align: center\">The residual would be 62.1 \u2013 64.8 = -2.7 in.<\/p>\nA negative residual indicates that the model is over-predicting. A positive residual indicates that the model is under-predicting. In this instance, the model over-predicted the chest girth of a bear that actually weighed 120 lb.\n\n[caption id=\"\" align=\"aligncenter\" width=\"899\"]<img alt=\"Image37921.PNG\" class=\"frame-13\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171537\/Image37921_fmt.png\" width=\"899\" height=\"537\" \/> Figure 10. Scatterplot with regression model illustrating a residual value.[\/caption]\n\nThis random error (residual) takes into account all unpredictable and unknown factors that are not included in the model. An ordinary least squares regression line minimizes the sum of the squared errors between the observed and predicted values to create a best fitting line. The differences between the observed and predicted values are squared to deal with the positive and negative differences.\n<h2>Coefficient of Determination<\/h2>\nAfter we fit our regression line (compute <em>b<\/em><span class=\"Subscript SmallText\">0<\/span> and <em>b<\/em><span class=\"Subscript SmallText\">1<\/span>), we usually wish to know how well the model fits our data. To determine this, we need to think back to the idea of analysis of variance. In ANOVA, we partitioned the variation using sums of squares so we could identify a treatment effect opposed to random variation that occurred in our data. The idea is the same for regression. We want to partition the total variability into two parts: the variation due to the regression and the variation due to random error. And we are again going to compute sums of squares to help us do this.\n\nSuppose the total variability in the sample measurements about the sample mean is denoted by <span class=\"Inline-Equation\"><img alt=\"11856.png\" class=\"frame-71\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171540\/11856.png\" \/><\/span>, called the <strong class=\"Strong-2\">sums of squares of total variability about the mean (SST)<\/strong>. The squared difference between the predicted value <span class=\"Inline-Equation\"><img alt=\"13147.png\" class=\"frame-5\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171541\/13147.png\" \/><\/span> and the sample mean is denoted by <span class=\"Inline-Equation\"><img alt=\"11878.png\" class=\"frame-17\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171541\/11878.png\" \/><\/span>, called the <strong class=\"Strong-2\">sums of squares due to regression (SSR)<\/strong>. The SSR represents the variability explained by the regression line. Finally, the variability which cannot be explained by the regression line is called the <strong class=\"Strong-2\">sums of squares due to error (SSE)<\/strong> and is denoted by <span class=\"Inline-Equation\"><img alt=\"11892.png\" class=\"frame-17\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171542\/11892.png\" \/><\/span>. SSE is actually the squared residual.\n<table class=\"Table\"><colgroup><col \/><col \/><col \/><\/colgroup><tbody><tr><td class=\"Table\">\n<p class=\"Table-Heading\">SST<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table-Heading\">= SSR<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table-Heading\">+ SSE<\/p>\n<\/td>\n<\/tr><tr><td class=\"Table\">\n<p class=\"Table\"><span class=\"Inline-Equation\"><img alt=\"11902.png\" class=\"frame-101\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171543\/11902.png\" \/><\/span><\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">= <span class=\"Inline-Equation\"><img alt=\"11906.png\" class=\"frame-102\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171545\/11906.png\" \/><\/span><\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">+<span class=\"Inline-Equation\"><img alt=\"11912.png\" class=\"frame-103\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171546\/11912.png\" \/><\/span><\/p>\n<\/td>\n<\/tr><\/tbody><\/table>\n[caption id=\"\" align=\"aligncenter\" width=\"750\"]<img alt=\"11168.png\" class=\"frame-40\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171547\/11168.png\" width=\"750\" height=\"500\" \/> Figure 11. An illustration of the relationship between the mean of the y\u2019s and the predicted and observed value of a specific y.[\/caption]\n\nThe sums of squares and mean sums of squares (just like ANOVA) are typically presented in the regression analysis of variance table. The ratio of the mean sums of squares for the regression (MSR) and mean sums of squares for error (MSE) form an F-test statistic used to test the regression model.\n\nThe relationship between these sums of square is defined as\n<p class=\"Centered\"><strong class=\"Strong-2\">Total Variation = Explained Variation + Unexplained Variation<\/strong><\/p>\nThe larger the explained variation, the better the model is at prediction. The larger the unexplained variation, the worse the model is at prediction. A quantitative measure of the explanatory power of a model is R<span class=\"Superscript SmallText\">2<\/span>, the Coefficient of Determination:\n<p class=\"Centered\"><img alt=\"11934.png\" class=\"frame-99 aligncenter\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171549\/11934.png\" \/><\/p>\nThe Coefficient of Determination measures the percent variation in the response variable (<em>y<\/em>) that is explained by the model.\n<ul><li class=\"List-Paragraph\">Values range from 0 to 1.<\/li>\n \t<li class=\"List-Paragraph\">An R<span class=\"Superscript SmallText\">2<\/span> close to zero indicates a model with very little explanatory power.<\/li>\n \t<li class=\"List-Paragraph\">An R<span class=\"Superscript SmallText\">2<\/span> close to one indicates a model with more explanatory power.<\/li>\n<\/ul>\nThe Coefficient of Determination and the linear correlation coefficient are related mathematically.\n<p class=\"Centered\" style=\"text-align: center\">R<sup>2<\/sup>\u00a0= r<sup>2<\/sup><\/p>\nHowever, they have two very different meanings: <em>r<\/em> is a measure of the strength and direction of a linear relationship between two variables; <em>R<\/em><span class=\"Superscript SmallText\">2<\/span> describes the percent variation in \u201c<em>y<\/em>\u201d that is explained by the model.\n<h2>Residual and Normal Probability Plots<\/h2>\nEven though you have determined, using a scatterplot, correlation coefficient and R<span class=\"Superscript SmallText\">2<\/span>, that <em>x<\/em> is useful in predicting the value of <em>y<\/em>, the results of a regression analysis are valid only when the data satisfy the necessary regression assumptions.\n<ol><li class=\"List-Paragraph-Number-1\">The response variable (y) is a random variable while the predictor variable (x) is assumed non-random or fixed and measured without error.<\/li>\n \t<li class=\"List-Paragraph-Number-1\">The relationship between <em>y<\/em> and <em>x<\/em> must be linear, given by the model <img alt=\"13333.png\" class=\"frame-6\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171551\/13333.png\" \/>.<\/li>\n \t<li class=\"List-Paragraph-Number-1\">The error of random term the values <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b5<\/span> are independent, have a mean of 0 and a common variance <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03c3<\/span><span class=\"Superscript SmallText\">2<\/span>, independent of <em>x<\/em>, and are normally distributed.<\/li>\n<\/ol>\nWe can use <strong class=\"Strong-2\">residual plots<\/strong> to check for a constant variance, as well as to make sure that the linear model is in fact adequate. A residual plot is a scatterplot of the residual (= observed - predicted values) versus the predicted or fitted (as used in the residual plot) value. The center horizontal axis is set at zero. One property of the residuals is that they sum to zero and have a mean of zero. A residual plot should be free of any patterns and the residuals should appear as a random scatter of points about zero.\n\nA residual plot with no appearance of any patterns indicates that the model assumptions are satisfied for these data.\n\n[caption id=\"\" align=\"aligncenter\" width=\"600\"]<img alt=\"11155.png\" class=\"frame-109\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171553\/11155.png\" width=\"600\" height=\"399\" \/> Figure 12. A residual plot.[\/caption]\n\nA residual plot that has a \u201cfan shape\u201d indicates a heterogeneous variance (non-constant variance). The residuals tend to fan out or fan in as error variance increases or decreases.\n\n[caption id=\"\" align=\"aligncenter\" width=\"629\"]<img alt=\"11142.png\" class=\"frame-13\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171555\/11142.png\" width=\"629\" height=\"419\" \/> Figure 13. A residual plot that indicates a non-constant variance.[\/caption]\n\nA residual plot that tends to \u201cswoop\u201d indicates that a linear model may not be appropriate. The model may need higher-order terms of <em>x<\/em>, or a non-linear model may be needed to better describe the relationship between <em>y<\/em> and <em>x<\/em>. Transformations on <em>x<\/em> or <em>y<\/em> may also be considered.\n\n[caption id=\"\" align=\"aligncenter\" width=\"657\"]<img alt=\"11131.png\" class=\"frame-47\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171557\/11131.png\" width=\"657\" height=\"440\" \/> Figure 14. A residual plot that indicates the need for a higher order model.[\/caption]\n<p class=\"Caption\">A <strong class=\"Strong-2\">normal probability plot<\/strong> allows us to check that the errors are normally distributed. It plots the residuals against the expected value of the residual as if it had come from a normal distribution. Recall that when the residuals are normally distributed, they will follow a straight-line pattern, sloping upward.<\/p>\nThis plot is not unusual and does not indicate any non-normality with the residuals.\n\n[caption id=\"\" align=\"aligncenter\" width=\"657\"]<img alt=\"11121.png\" class=\"frame-47\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171559\/11121.png\" width=\"657\" height=\"440\" \/> Figure 15. A normal probability plot.[\/caption]\n\nThis next plot clearly illustrates a non-normal distribution of the residuals.\n\n[caption id=\"\" align=\"aligncenter\" width=\"657\"]<img alt=\"11111.png\" class=\"frame-47\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171602\/11111.png\" width=\"657\" height=\"440\" \/> Figure 16. A normal probability plot, which illustrates non-normal distribution.[\/caption]\n\nThe most serious violations of normality usually appear in the tails of the distribution because this is where the normal distribution differs most from other types of distributions with a similar mean and spread. Curvature in either or both ends of a normal probability plot is indicative of nonnormality.\n<h2>Population Model<\/h2>\nOur regression model is based on a sample of <em>n<\/em> bivariate observations drawn from a larger population of measurements.\n<p class=\"Centered\"><img alt=\"11952.png\" class=\"frame-46 aligncenter\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171606\/11952.png\" \/><\/p>\nWe use the means and standard deviations of our sample data to compute the slope (<em>b<\/em><span class=\"Subscript SmallText\">1<\/span>) and y-intercept (<em>b<\/em><span class=\"Subscript SmallText\">0<\/span>) in order to create an ordinary least-squares regression line. But we want to describe the relationship between <em>y<\/em> and <em>x<\/em> in the population, not just within our sample data. We want to construct a <strong class=\"Strong-2\">population model<\/strong>. Now we will think of the least-squares line computed from a sample as an estimate of the true regression line for the population.\n\n<hr \/><p class=\"Callout\"><strong class=\"char-style-override-2\">The Population Model<\/strong>\n<span class=\"Picture\"><img alt=\"11964.png\" class=\"frame-45\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171607\/11964.png\" \/><\/span>, where <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03bc<\/span><span class=\"Subscript SmallText\">y<\/span> is the population mean response, <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b2<\/span><span class=\"Subscript SmallText\">0<\/span> is the y-intercept, and <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b2<\/span><span class=\"Subscript SmallText\">1<\/span> is the slope for the population model.<\/p>\n\n\n<hr \/>\n\nIn our population, there could be many different responses for a value of <em>x<\/em>. In simple linear regression, the model assumes that for each value of <em>x<\/em> the observed values of the response variable <em>y<\/em> are normally distributed with a mean that depends on <em>x<\/em>. We use <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03bc<\/span><span class=\"Subscript SmallText\">y<\/span> to represent these means. We also assume that these means all lie on a straight line when plotted against <em>x<\/em> (a line of means).\n\n[caption id=\"\" align=\"aligncenter\" width=\"901\"]<img alt=\"11100.png\" class=\"frame-13\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171608\/11100.png\" width=\"901\" height=\"457\" \/> Figure 17. The statistical model for linear regression; the mean response is a straight-line function of the predictor variable.[\/caption]\n\nThe sample data then fit the statistical model:\n<p class=\"Centered\" style=\"text-align: center\">Data = fit + residual<\/p>\n<p class=\"Centered\"><img alt=\"11974.png\" class=\"frame-7 aligncenter\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171610\/11974.png\" \/><\/p>\nwhere the errors (<span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b5<\/span><span class=\"Subscript SmallText\">i<\/span>) are independent and normally distributed <em>N<\/em> (0, <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03c3<\/span>). Linear regression also assumes equal variance of <em>y<\/em> (<span class=\"Symbols\" xml:lang=\"ar-SA\">\u03c3<\/span> is the same for all values of <em>x<\/em>). We use <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b5<\/span> (Greek epsilon) to stand for the residual part of the statistical model. A response <em>y<\/em> is the sum of its mean and chance deviation <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b5<\/span> from the mean. The deviations <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b5<\/span> represents the \u201cnoise\u201d in the data. In other words, the noise is the variation in <em>y<\/em> due to other causes that prevent the observed (<em>x, y<\/em>) from forming a perfectly straight line.\n\nThe sample data used for regression are the observed values of <em>y<\/em> and <em>x<\/em>. The response <em>y<\/em> to a given <em>x<\/em> is a random variable, and the regression model describes the mean and standard deviation of this random variable <em>y<\/em>. The intercept <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b2<\/span><span class=\"Subscript SmallText\">0<\/span>, slope <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b2<\/span><span class=\"Subscript SmallText\">1<\/span>, and standard deviation <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03c3<\/span> of <em>y<\/em> are the unknown parameters of the regression model and must be estimated from the sample data.\n<ul><li class=\"List-Paragraph\">The value of <span class=\"Symbols\" xml:lang=\"ar-SA\"><em>y\u0302<\/em><\/span> from the least squares regression line is really a prediction of the mean value of <em>y<\/em> (<span class=\"Symbols\" xml:lang=\"ar-SA\">\u03bc<\/span><span class=\"Subscript SmallText\">y<\/span>) for a given value of <em>x<\/em>.<\/li>\n \t<li class=\"List-Paragraph\">The least squares regression line ( <img alt=\"12009.png\" class=\"frame-12\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171611\/12009.png\" \/>) obtained from sample data is the best estimate of the true population regression line\n(<img alt=\"12014.png\" class=\"frame-12\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171612\/12014.png\" style=\"font-size: 0.917em;line-height: 1.273\" \/>).<\/li>\n<\/ul><p class=\"Callout\"><span class=\"pullquote-left\"><span class=\"Symbols\" xml:lang=\"ar-SA\"><em>y\u0302<\/em><\/span> is an unbiased estimate for the mean response <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03bc<\/span><span class=\"Subscript SmallText\">y\n<\/span><em class=\"char-style-override-2\">b<\/em><span class=\"Subscript SmallText\">0<\/span> is an unbiased estimate for the intercept <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b2<\/span><span class=\"Subscript SmallText\">0\n<\/span><em class=\"char-style-override-2\">b<\/em><span class=\"Subscript SmallText\">1<\/span> is an unbiased estimate for the slope <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b2<\/span><span class=\"Subscript SmallText\">1<\/span><\/span><\/p>\n\n<h2>Parameter Estimation<\/h2>\nOnce we have estimates of <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b2<\/span><span class=\"Subscript SmallText\">0<\/span> and <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b2<\/span><span class=\"Subscript SmallText\">1<\/span> (from our sample data <em>b<\/em><span class=\"Subscript SmallText\">0<\/span> and <em>b<\/em><span class=\"Subscript SmallText\">1<\/span>), the linear relationship determines the estimates of <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03bc<\/span><span class=\"Subscript SmallText\">y<\/span> for all values of <em>x<\/em> in our population, not just for the observed values of <em>x<\/em>. We now want to use the least-squares line as a basis for inference about a population from which our sample was drawn.\n\nModel assumptions tell us that <em>b<\/em><span class=\"Subscript SmallText\">0<\/span> and <em>b<\/em><span class=\"Subscript SmallText\">1<\/span> are normally distributed with means <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b2<\/span><span class=\"Subscript SmallText\">0<\/span> and <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b2<\/span><span class=\"Subscript SmallText\">1<\/span> with standard deviations that can be estimated from the data. Procedures for inference about the population regression line will be similar to those described in the previous chapter for means. As always, it is important to examine the data for outliers and influential observations.\n\nIn order to do this, we need to estimate <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03c3<\/span>, the regression standard error. This is the standard deviation of the model errors. It measures the variation of <em>y<\/em> about the population regression line. We will use the residuals to compute this value. Remember, the predicted value of <em>y<\/em> (<span class=\"Symbols\" xml:lang=\"ar-SA\"><em>p\u0302<\/em><\/span>) for a specific <em>x<\/em> is the point on the regression line. It is the unbiased estimate of the mean response (<span class=\"Symbols\" xml:lang=\"ar-SA\">\u03bc<\/span><span class=\"Subscript SmallText\">y<\/span>) for that <em>x<\/em>. The residual is:\n<p class=\"Centered\">residual = observed \u2013 predicted<\/p>\n<p class=\"Centered\"><em>e<\/em><span class=\"Subscript SmallText\">i<\/span> = <em>y<\/em><span class=\"Subscript SmallText\">i<\/span> \u2013 <span class=\"Symbols\" xml:lang=\"ar-SA\"><em>y\u0302<\/em><\/span> = <span class=\"Inline-Equation\"><img alt=\"12066.png\" class=\"frame-6\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171613\/12066.png\" \/><\/span><\/p>\nThe residual <em>e<\/em><span class=\"Subscript SmallText\">i<\/span> corresponds to model deviation <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b5<\/span><span class=\"Subscript SmallText\">i<\/span> where <strong class=\"SymbolsBold\" xml:lang=\"ar-SA\">\u03a3<\/strong> <em>e<\/em><span class=\"Subscript SmallText\">i<\/span> = 0 with a mean of 0. The regression standard error <em>s<\/em> is an unbiased estimate of <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03c3<\/span>.\n<p class=\"Centered\"><img alt=\"12076.png\" class=\"frame-104\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171614\/12076.png\" \/><\/p>\nThe quantity <em>s<\/em> is the estimate of the regression standard error (<span class=\"Symbols\" xml:lang=\"ar-SA\">\u03c3<\/span>) and <em>s<\/em><span class=\"Superscript SmallText\">2<\/span> is often called the mean square error (MSE). A small value of <em>s<\/em> suggests that observed values of <em>y<\/em> fall close to the true regression line and the line <span class=\"Inline-Equation\"><img alt=\"12100.png\" class=\"frame-71\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171616\/12100.png\" \/><\/span> should provide accurate estimates and predictions.\n<h2>Confidence Intervals and Significance Tests for Model Parameters<\/h2>\nIn an earlier chapter, we constructed confidence intervals and did significance tests for the population parameter <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03bc<\/span> (the population mean). We relied on sample statistics such as the mean and standard deviation for point estimates, margins of errors, and test statistics. Inference for the population parameters <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b2<\/span><span class=\"Subscript SmallText\">0<\/span> (slope) and <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b2<\/span><span class=\"Subscript SmallText\">1<\/span> (y-intercept) is very similar.\n\nInference for the slope and intercept are based on the normal distribution using the estimates <em>b<\/em><span class=\"Subscript SmallText\">0<\/span> and <em>b<\/em><span class=\"Subscript SmallText\">1<\/span>. The standard deviations of these estimates are multiples of <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03c3<\/span>, the population regression standard error. Remember, we estimate <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03c3<\/span> with <em>s<\/em> (the variability of the data about the regression line). Because we use <em>s<\/em>, we rely on the student t-distribution with (<em>n<\/em> \u2013 2) degrees of freedom.\n<p class=\"Centered\"><img alt=\"12112.png\" class=\"frame-37\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171617\/12112.png\" \/><\/p>\n<p class=\"Centered\">The standard error for estimate of <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b2<\/span><span class=\"Subscript SmallText\">0<\/span><\/p>\n<p class=\"Centered\"><img alt=\"12122.png\" class=\"frame-37\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171619\/12122.png\" \/><\/p>\n<p class=\"Centered\">The standard error for estimate of <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b2<\/span><span class=\"Subscript SmallText\">1<\/span><\/p>\nWe can construct confidence intervals for the regression slope and intercept in much the same way as we did when estimating the population mean.\n\n<hr \/><p class=\"Call-out-First-line\">A <strong class=\"char-style-override-2\">confidence interval<\/strong> for <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b2<\/span><span class=\"Subscript SmallText\">0<\/span> <em class=\"char-style-override-2\">: b<\/em><span class=\"Subscript SmallText\">0<\/span> \u00b1 t <span class=\"Symbol-Subscript SmallText\" xml:lang=\"ar-SA\">\u03b1<\/span><span class=\"Subscript SmallText\">\/2<\/span> SE<span class=\"Subscript SmallText\">b0<\/span><\/p>\n<p class=\"Call-out-Middle\">A <strong class=\"char-style-override-2\">confidence interval<\/strong> for <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b2<\/span><span class=\"Subscript SmallText\">1<\/span> <em class=\"char-style-override-2\">:<\/em> <em class=\"char-style-override-2\">b<\/em><span class=\"Subscript SmallText\">1<\/span> \u00b1 t <span class=\"Symbol-Subscript SmallText\" xml:lang=\"ar-SA\">\u03b1<\/span><span class=\"Subscript SmallText\">\/2<\/span> SE<span class=\"Subscript SmallText\">b1<\/span><\/p>\n<p class=\"Call-out-End\">where SE<span class=\"Subscript SmallText\">b0<\/span> and SE<span class=\"Subscript SmallText\">b1<\/span> are the standard errors for the y-intercept and slope, respectively.<\/p>\n\n\n<hr \/>\n\nWe can also test the hypothesis H<span class=\"Subscript SmallText\">0<\/span>: <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b2<\/span><span class=\"Subscript SmallText\">1<\/span> = 0. When we substitute <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b2<\/span><span class=\"Subscript SmallText\">1<\/span> = 0 in the model, the x-term drops out and we are left with <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03bc<\/span><span class=\"Subscript SmallText\">y<\/span> = <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b2<\/span><span class=\"Subscript SmallText\">0<\/span>. This tells us that the mean of <em>y<\/em> does NOT vary with <em>x<\/em>. In other words, there is no straight line relationship between <em>x<\/em> and <em>y<\/em> and the regression of <em>y<\/em> on <em>x<\/em> is of no value for predicting <em>y<\/em>.\n\n<hr \/><p class=\"Call-out-First-line\">Hypothesis test for <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b2<\/span><span class=\"Subscript SmallText\">1<\/span><\/p>\n<p class=\"Call-out-Middle\">H<span class=\"Subscript SmallText\">0<\/span>: <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b2<\/span><span class=\"Subscript SmallText\">1<\/span> =0<\/p>\n<p class=\"Call-out-Middle\">H<span class=\"Subscript SmallText\">1<\/span>: <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b2<\/span><span class=\"Subscript SmallText\">1<\/span> <em class=\"char-style-override-2\">\u2260<\/em>0<\/p>\n<p class=\"Call-out-Middle\">The test statistic is t = b<span class=\"Subscript SmallText\">1<\/span> \/ SE<span class=\"Subscript SmallText\">b1<\/span><\/p>\n<p class=\"Call-out-Middle\">We can also use the F-statistic (MSR\/MSE) in the regression ANOVA table*<\/p>\n<p class=\"Call-out-End\">*Recall that t<span class=\"Superscript SmallText\">2<\/span> = F<\/p>\n\n\n<hr \/>\n\nSo let\u2019s pull all of this together in an example.\n<div class=\"textbox examples\">\n<h3>Example 3<\/h3>\n<p class=\"Example\">The index of biotic integrity (IBI) is a measure of water quality in streams. As a manager for the natural resources in this region, you must monitor, track, and predict changes in water quality. You want to create a simple linear regression model that will allow you to predict changes in IBI in forested area. The following table conveys sample data from a coastal forest region and gives the data for IBI and forested area in square kilometers. Let forest area be the predictor variable (x) and IBI be the response variable (y).<\/p>\n\n\n[caption id=\"\" align=\"aligncenter\" width=\"901\"]<img alt=\"11090.png\" class=\"frame-13\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171621\/11090.png\" width=\"901\" height=\"358\" \/> Table 1. Observed data of biotic integrity and forest area.[\/caption]\n<p class=\"Example\">We begin with a computing descriptive statistics and a scatterplot of IBI against Forest Area.<\/p>\n<p class=\"ExampleCenter\" style=\"text-align: center\"><span class=\"Symbols\" xml:lang=\"ar-SA\"><em>x\u0304<\/em><\/span> = 47.42; <em>s<span class=\"Subscript SmallText\">x<\/span><\/em> 27.37; <span class=\"Symbols\" xml:lang=\"ar-SA\"><em>y\u0304<\/em><\/span> = 58.80; <em>s<span class=\"Subscript SmallText\">y<\/span><\/em> = 21.38; r = 0.735<\/p>\n\n\n[caption id=\"\" align=\"aligncenter\" width=\"621\"]<img alt=\"11080.png\" class=\"frame-4\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171623\/11080.png\" width=\"621\" height=\"415\" \/> Figure 18. Scatterplot of IBI vs. Forest Area.[\/caption]\n<p class=\"Example\">There appears to be a positive linear relationship between the two variables. The linear correlation coefficient is r = 0.735. This indicates a strong, positive, linear relationship. In other words, forest area is a good predictor of IBI. Now let\u2019s create a simple linear regression model using forest area to predict IBI (response).<\/p>\n<p class=\"Example\">First, we will compute <em>b<\/em><span class=\"Subscript SmallText\">0<\/span> and <em>b<\/em><span class=\"Subscript SmallText\">1<\/span> using the shortcut equations.<\/p>\n<p class=\"ExampleCenter\"><span class=\"Inline-Equation-Large\"><img alt=\"12180.png\" class=\"frame-14\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171625\/12180.png\" \/><\/span>=<span class=\"Inline-Equation-Large\"><img alt=\"12189.png\" class=\"frame-12\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171626\/12189.png\" \/><\/span>=0.574<\/p>\n<p class=\"ExampleCenter\"><span class=\"Inline-Equation\"><img alt=\"12198.png\" class=\"frame-45\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171627\/12198.png\" \/><\/span><span class=\"Inline-Equation\"><img alt=\"12205.png\" class=\"frame-55\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171628\/12205.png\" \/><\/span>= 31.581<\/p>\n<p class=\"Example\">The regression equation is <span class=\"Inline-Equation\"><img alt=\"12216.png\" class=\"frame-87\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171629\/12216.png\" \/><\/span>.<\/p>\n<p class=\"Example\">Now let\u2019s use Minitab to compute the regression model. The output appears below.<\/p>\n\n<h4>Regression Analysis: IBI versus Forest Area<\/h4>\n<p class=\"Example\">The regression equation is IBI = 31.6 + 0.574 Forest Area<\/p>\n\n<table class=\"Table\" style=\"margin-left: 23px\"><colgroup><col \/><col \/><col \/><col \/><col \/><\/colgroup><tbody><tr><td class=\"Table\">\n<p class=\"Table\">Predictor<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">Coef<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">SE Coef<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">T<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">P<\/p>\n<\/td>\n<\/tr><tr><td class=\"Table\">\n<p class=\"Table\">Constant<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">31.583<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\"><strong class=\"Strong-2\">4.177<\/strong><\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">7.56<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">0.000<\/p>\n<\/td>\n<\/tr><tr><td class=\"Table\">\n<p class=\"Table\">Forest Area<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">0.57396<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\"><strong class=\"Strong-2\">0.07648<\/strong><\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">7.50<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">0.000<\/p>\n<\/td>\n<\/tr><tr><td class=\"Table\">\n<p class=\"Table\"><strong class=\"Strong-2\">S = 14.6505<\/strong><\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\"><strong class=\"Strong-2\">R-Sq = 54.0%<\/strong><\/p>\n<\/td>\n<td class=\"Table\" colspan=\"2\">\n<p class=\"Table\">R-Sq(adj) = 53.0%<\/p>\n<\/td>\n<td class=\"Table\" \/>\n<\/tr><\/tbody><\/table><table class=\"Table\" style=\"margin-left: 23px\"><colgroup><col \/><col \/><col \/><col \/><col \/><col \/><\/colgroup><tbody><tr><td class=\"Table-Heading\" colspan=\"6\">\n<p class=\"Table-Heading\">Analysis of Variance<\/p>\n<\/td>\n<\/tr><tr><td class=\"Table\">\n<p class=\"Table\">Source<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">DF<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">SS<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">MS<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">F<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">P<\/p>\n<\/td>\n<\/tr><tr><td class=\"Table\">\n<p class=\"Table\">Regression<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">1<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">12089<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">12089<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">56.32<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">0.000<\/p>\n<\/td>\n<\/tr><tr><td class=\"Table\">\n<p class=\"Table\">Residual Error<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">48<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">10303<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\"><strong class=\"Strong-2\">215<\/strong><\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\"><strong class=\"Strong-2\">\u00a0<\/strong><\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\"><strong class=\"Strong-2\">\u00a0<\/strong><\/p>\n<\/td>\n<\/tr><tr><td class=\"Table\">\n<p class=\"Table\">Total<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">49<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">22392<\/p>\n<\/td>\n<td class=\"Table\" \/>\n<td class=\"Table\" \/>\n<td class=\"Table\" \/>\n<\/tr><\/tbody><\/table><p class=\"Example\">The estimates for <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b2<\/span><span class=\"Subscript SmallText\">0<\/span> and <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b2<\/span><span class=\"Subscript SmallText\">1<\/span> are 31.6 and 0.574, respectively. We can interpret the y-intercept to mean that when there is zero forested area, the IBI will equal 31.6. For each additional square kilometer of forested area added, the IBI will increase by 0.574 units.<\/p>\n<p class=\"Example\">The coefficient of determination, R<span class=\"Superscript SmallText\">2<\/span>, is 54.0%. This means that 54% of the variation in IBI is explained by this model. Approximately 46% of the variation in IBI is due to other factors or random variation. We would like R<span class=\"Superscript SmallText\">2<\/span> to be as high as possible (maximum value of 100%).<\/p>\n<p class=\"Example\">The residual and normal probability plots do not indicate any problems.<\/p>\n\n\n[caption id=\"\" align=\"aligncenter\" width=\"990\"]<img alt=\"11070.png\" class=\"frame-13\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171631\/11070.png\" width=\"990\" height=\"311\" \/> Figure 19. A residual and normal probability plot.[\/caption]\n<p class=\"Example\">The estimate of <strong class=\"SymbolsBold\" xml:lang=\"ar-SA\">\u03c3<\/strong>, the regression standard error, is <em>s<\/em> = 14.6505. This is a measure of the variation of the observed values about the population regression line. We would like this value to be as small as possible. The MSE is equal to 215. Remember, the <span class=\"Inline-Equation\"><img alt=\"12275.png\" class=\"frame-23\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171633\/12275.png\" \/><\/span>= <em>s<\/em>. The standard errors for the coefficients are 4.177 for the y-intercept and 0.07648 for the slope.<\/p>\n<p class=\"Example\">We know that the values <em>b<\/em><span class=\"Subscript SmallText\">0<\/span> = 31.6 and <em>b<\/em><span class=\"Subscript SmallText\">1<\/span> = 0.574 are sample estimates of the true, but unknown, population parameters <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b2<\/span><span class=\"Subscript SmallText\">0<\/span> and <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b2<\/span><span class=\"Subscript SmallText\">1<\/span>. We can construct 95% confidence intervals to better estimate these parameters. The critical value (t<span class=\"Symbol-Subscript SmallText\" xml:lang=\"ar-SA\">\u03b1<\/span><span class=\"Subscript SmallText\">\/2<\/span>) comes from the student t-distribution with (<span class=\"BoldItalic Strong-2\">n<\/span> \u2013 2) degrees of freedom. Our sample size is 50 so we would have 48 degrees of freedom. The closest table value is 2.009.<\/p>\n<p class=\"ExampleCenter\" style=\"text-align: center\">95% confidence intervals for <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b2<\/span><span class=\"Subscript SmallText\">0<\/span> and <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b2<\/span><span class=\"Subscript SmallText\">1<\/span><\/p>\n<p class=\"ExampleCenter\" style=\"text-align: center\"><em>b<\/em><span class=\"Subscript SmallText\">0<\/span> \u00b1 t<span class=\"Symbol-Subscript SmallText\" xml:lang=\"ar-SA\">\u03b1<\/span><span class=\"Subscript SmallText\">\/2<\/span> SE<span class=\"Subscript SmallText\">b0<\/span> = 31.6 \u00b1 2.009(4.177) = (23.21, 39.99)<\/p>\n<p class=\"ExampleCenter\" style=\"text-align: center\"><em>b<\/em><span class=\"Subscript SmallText\">1<\/span> \u00b1 t<span class=\"Symbol-Subscript SmallText\" xml:lang=\"ar-SA\">\u03b1<\/span><span class=\"Subscript SmallText\">\/2<\/span> SE<span class=\"Subscript SmallText\">b1<\/span> = 0.574 \u00b1 2.009(0.07648) = (0.4204, 0.7277)<\/p>\n<p class=\"Example\">The next step is to test that the slope is significantly different from zero using a 5% level of significance.<\/p>\n\n<table class=\"Table\" style=\"margin-left: 23px\"><colgroup><col \/><col \/><\/colgroup><tbody><tr><td class=\"Table-Heading\">\n<p class=\"Table-Heading\">H<span class=\"Subscript SmallText\">0<\/span>: <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b2<\/span><span class=\"Subscript SmallText\">1<\/span> =0<\/p>\n<\/td>\n<td class=\"Table-Heading\">\n<p class=\"Table-Heading\">H<span class=\"Subscript SmallText\">1<\/span>: <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b2<\/span><span class=\"Subscript SmallText\">1<\/span> <em>\u2260<\/em>0<\/p>\n<\/td>\n<\/tr><\/tbody><\/table><p class=\"ExampleCenter\">t = b<span class=\"Subscript SmallText\">1<\/span> \/ SE<span class=\"Subscript SmallText\">b1<\/span> = 0.574\/0.07648 = 7.50523<\/p>\n<p class=\"Example\">We have 48 degrees of freedom and the closest critical value from the student t-distribution is 2.009. The test statistic is greater than the critical value, so we will reject the null hypothesis. The slope is significantly different from zero. We have found a statistically significant relationship between Forest Area and IBI.<\/p>\n<p class=\"Example\">The Minitab output also report the test statistic and p-value for this test.<\/p>\n\n<table class=\"Table\" style=\"margin-left: 23px\"><colgroup><col \/><col \/><col \/><col \/><col \/><\/colgroup><tbody><tr><td class=\"Table\" colspan=\"5\">\n<p class=\"Table\">The regression equation is IBI = 31.6 + 0.574 Forest Area<\/p>\n<\/td>\n<\/tr><tr><td class=\"Table\">\n<p class=\"Table\">Predictor<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">Coef<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">SE Coef<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">T<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">P<\/p>\n<\/td>\n<\/tr><tr><td class=\"Table\">\n<p class=\"Table\">Constant<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">31.583<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">4.177<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">7.56<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">0.000<\/p>\n<\/td>\n<\/tr><tr><td class=\"Table\">\n<p class=\"Table\">Forest Area<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">0.57396<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">0.07648<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\"><strong class=\"Strong-2\">7.50<\/strong><\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\"><strong class=\"Strong-2\">0.000<\/strong><\/p>\n<\/td>\n<\/tr><tr><td class=\"Table\">\n<p class=\"Table\">S = 14.6505<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">R-Sq = 54.0%<\/p>\n<\/td>\n<td class=\"Table\" colspan=\"2\">\n<p class=\"Table\">R-Sq(adj) = 53.0%<\/p>\n<\/td>\n<td class=\"Table\" \/>\n<\/tr><\/tbody><\/table><table class=\"Table\" style=\"margin-left: 23px\"><colgroup><col \/><col \/><col \/><col \/><col \/><col \/><\/colgroup><tbody><tr><td class=\"Table\" colspan=\"6\">\n<p class=\"Table\">Analysis of Variance<\/p>\n<\/td>\n<\/tr><tr><td class=\"Table\">\n<p class=\"Table\">Source<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">DF<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">SS<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">MS<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">F<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">P<\/p>\n<\/td>\n<\/tr><tr><td class=\"Table\">\n<p class=\"Table\">Regression<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">1<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">12089<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">12089<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\"><strong class=\"Strong-2\">56.32<\/strong><\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\"><strong class=\"Strong-2\">0.000<\/strong><\/p>\n<\/td>\n<\/tr><tr><td class=\"Table\">\n<p class=\"Table\">Residual Error<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">48<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">10303<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">215<\/p>\n<\/td>\n<td class=\"Table\" \/>\n<td class=\"Table\" \/>\n<\/tr><tr><td class=\"Table\">\n<p class=\"Table\">Total<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">49<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">22392<\/p>\n<\/td>\n<td class=\"Table\" \/>\n<td class=\"Table\" \/>\n<td class=\"Table\" \/>\n<\/tr><\/tbody><\/table><p class=\"Example\">The t test statistic is 7.50 with an associated p-value of 0.000. The p-value is less than the level of significance (5%) so we will reject the null hypothesis. The slope is significantly different from zero. The same result can be found from the F-test statistic of 56.32 (7.505<span class=\"Superscript SmallText\">2<\/span> = 56.32). The p-value is the same (0.000) as the conclusion.<\/p>\n\n<\/div>\n<h2 class=\"ExampleHeading\">Confidence Interval for <strong class=\"SymbolsBold\" xml:lang=\"ar-SA\">\u03bc<\/strong><span class=\"Subscript SmallText\">y<\/span><\/h2>\nNow that we have created a regression model built on a significant relationship between the predictor variable and the response variable, we are ready to use the model for\n<ul><li class=\"List-Paragraph\">estimating the average value of <em>y<\/em> for a given value of <em>x<\/em><\/li>\n \t<li class=\"List-Paragraph\">predicting a particular value of <em>y<\/em> for a given value of <em>x<\/em><\/li>\n<\/ul>\nLet\u2019s examine the first option. The sample data of <em>n<\/em> pairs that was drawn from a population was used to compute the regression coefficients <em>b<\/em><span class=\"Subscript SmallText\">0<\/span> and <em>b<\/em><span class=\"Subscript SmallText\">1<\/span> for our model, and gives us the average value of <em>y<\/em> for a specific value of <em>x<\/em> through our population model\n\n<span class=\"Picture\"><img alt=\"12315.png\" class=\"frame-56\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171634\/12315.png\" \/><\/span>. For every specific value of x, there is an average y (<span class=\"Symbols\" xml:lang=\"ar-SA\"><em>\u03bc<\/em><span class=\"Subscript SmallText\"><em>y<\/em><\/span><\/span>), which falls on the straight line equation (a line of means). Remember, that there can be many different observed values of the <em>y<\/em> for a particular <em>x<\/em>, and these values are assumed to have a normal distribution with a mean equal to <span class=\"Inline-Equation\"><img alt=\"12336.png\" class=\"frame-21\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171635\/12336.png\" \/><\/span> and a variance of <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03c3<\/span><span class=\"Superscript SmallText\">2<\/span>. Since the computed values of <em>b<\/em><span class=\"Subscript SmallText\">0<\/span> and <em>b<\/em><span class=\"Subscript SmallText\">1<\/span> vary from sample to sample, each new sample may produce a slightly different regression equation. Each new model can be used to estimate a value of <em>y<\/em> for a value of <em>x<\/em>. How far will our estimator <span class=\"Inline-Equation\"><img alt=\"12346.png\" class=\"frame-44\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171636\/12346.png\" \/><\/span> be from the true population mean for that value of <em>x<\/em>? This depends, as always, on the variability in our estimator, measured by the standard error.\n\nIt can be shown that the estimated value of <em>y<\/em> when <em>x<\/em> = <em>x<\/em><span class=\"Subscript SmallText\">0<\/span> (some specified value of <em>x<\/em>), is an unbiased estimator of the population mean, and that <span class=\"Symbols\" xml:lang=\"ar-SA\"><em>p\u0302<\/em><\/span> is normally distributed with a standard error of\n<p class=\"Centered\"><img alt=\"12371.png\" class=\"frame-32 aligncenter\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171637\/12371.png\" \/><\/p>\nWe can construct a confidence interval to better estimate this parameter (<span class=\"Symbols\" xml:lang=\"ar-SA\">\u03bc<\/span><span class=\"Subscript SmallText\">y<\/span>) following the same procedure illustrated previously in this chapter.\n\n<span class=\"Picture\"><img alt=\"12387.png\" class=\"frame-56 aligncenter\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171638\/12387.png\" \/><\/span>where the critical value t<span class=\"Symbol-Subscript SmallText\" xml:lang=\"ar-SA\">\u03b1<\/span><span class=\"Subscript SmallText\">\/2<\/span> comes from the student t-table with (<em>n<\/em> \u2013 2) degrees of freedom.\n\nStatistical software, such as Minitab, will compute the confidence intervals for you. Using the data from the previous example, we will use Minitab to compute the 95% confidence interval for the mean response for an average forested area of 32 km.\n<table class=\"Table\"><colgroup><col \/><col \/><col \/><col \/><\/colgroup><tbody><tr><td class=\"Table-Heading\" colspan=\"4\">\n<p class=\"Table-Heading\">Predicted Values for New Observations<\/p>\n<\/td>\n<\/tr><tr><td class=\"Table\">\n<p class=\"Table\">New\u00a0Obs Fit<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">SE Fit<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">95%<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">CI<\/p>\n<\/td>\n<\/tr><tr><td class=\"Table\">\n<p class=\"Table\">1<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">49.9496<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">2.38400<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">(45.1562,54.7429)<\/p>\n<\/td>\n<\/tr><\/tbody><\/table>\nIf you sampled many areas that averaged 32 km. of forested area, your estimate of the average IBI would be from 45.1562 to 54.7429.\n\nYou can repeat this process many times for several different values of <em>x<\/em> and plot the confidence intervals for the mean response.\n<table class=\"Table\"><colgroup><col \/><col \/><\/colgroup><tbody><tr><td class=\"Table-Heading\">\n<p class=\"Table-Heading\"><strong>x<\/strong><\/p>\n<\/td>\n<td class=\"Table-Heading\">\n<p class=\"Table-Heading\"><strong>95% CI<\/strong><\/p>\n<\/td>\n<\/tr><tr><td class=\"Table\">\n<p class=\"Table\">20<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">(37.13, 48.88)<\/p>\n<\/td>\n<\/tr><tr><td class=\"Table\">\n<p class=\"Table\">40<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">(50.22, 58.86)<\/p>\n<\/td>\n<\/tr><tr><td class=\"Table\">\n<p class=\"Table\">60<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">(61.43, 70.61)<\/p>\n<\/td>\n<\/tr><tr><td class=\"Table\">\n<p class=\"Table\">80<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">(70.98, 84.02)<\/p>\n<\/td>\n<\/tr><tr><td class=\"Table\">\n<p class=\"Table\">100<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">(79.88, 98.07)<\/p>\n<\/td>\n<\/tr><\/tbody><\/table>\n[caption id=\"\" align=\"aligncenter\" width=\"901\"]<img alt=\"11060.png\" class=\"frame-13\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171640\/11060.png\" width=\"901\" height=\"480\" \/> Figure 20. 95% confidence intervals for the mean response.[\/caption]\n<p class=\"Caption\"><span class=\"Picture\" \/>Notice how the width of the 95% confidence interval varies for the different values of <em>x<\/em>. Since the confidence interval width is narrower for the central values of <em>x<\/em>, it follows that <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03bc<\/span><span class=\"Subscript SmallText\">y<\/span> is estimated more precisely for values of <em>x<\/em> in this area. As you move towards the extreme limits of the data, the width of the intervals increases, indicating that it would be unwise to extrapolate beyond the limits of the data used to create this model.<\/p>\n\n<h2>Prediction Intervals<\/h2>\nWhat if you want to predict a <em>particular<\/em> value of <em>y<\/em> when <em>x<\/em> = <em>x<\/em><span class=\"Subscript SmallText\">0<\/span>? Or, perhaps you want to predict the next measurement for a given value of <em>x<\/em>? This problem differs from constructing a confidence interval for <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03bc<\/span><span class=\"Subscript SmallText\">y<\/span>. Instead of constructing a confidence interval to estimate a population parameter, we need to construct a prediction interval. Choosing to predict a particular value of <em>y<\/em> incurs some additional error in the prediction because of the deviation of <em>y<\/em> from the line of means. Examine the figure below. You can see that the error in prediction has two components:\n<ol><li class=\"List-Paragraph-Number-1\">The error in using the fitted line to estimate the line of means<\/li>\n \t<li class=\"List-Paragraph-Number-1\">The error caused by the deviation of y from the line of means, measured by <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03c3<\/span><span class=\"Superscript SmallText\">2<\/span><\/li>\n<\/ol>\n[caption id=\"\" align=\"aligncenter\" width=\"553\"]<img alt=\"136.tif\" class=\"frame-13\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171642\/136_fmt.png\" width=\"553\" height=\"268\" \/> Figure 21. Illustrating the two components in the error of prediction.[\/caption]\n\nThe variance of the difference between y and <span class=\"Inline-Equation\"><img alt=\"13215.png\" class=\"frame-5\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171643\/13215.png\" \/><\/span> is the sum of these two variances and forms the basis for the standard error of <span class=\"Inline-Equation\"><img alt=\"12547.png\" class=\"frame-43\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171644\/12547.png\" \/><\/span> used for prediction. The resulting form of a prediction interval is as follows:\n<p class=\"Centered\"><img alt=\"12568.png\" class=\"frame-710 aligncenter\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171645\/12568.png\" \/><\/p>\nwhere <em>x<\/em><span class=\"Subscript SmallText\">0<\/span> is the given value for the predictor variable, <em>n<\/em> is the number of observations, and t<span class=\"Symbol-Subscript SmallText\" xml:lang=\"ar-SA\">\u03b1<\/span><span class=\"Subscript SmallText\">\/2<\/span> is the critical value with (<em>n<\/em> \u2013 2) degrees of freedom.\n\nSoftware, such as Minitab, can compute the prediction intervals. Using the data from the previous example, we will use Minitab to compute the 95% prediction interval for the IBI of a specific forested area of 32 km.\n<table class=\"Table\"><colgroup><col \/><col \/><col \/><col \/><\/colgroup><tbody><tr><td class=\"Table-Heading\" colspan=\"4\">\n<p class=\"Table-Heading\">Predicted Values for New Observations<\/p>\n<\/td>\n<\/tr><tr><td class=\"Table\">\n<p class=\"Table\">New Obs<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">Fit<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">SE Fit<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">95% PI<\/p>\n<\/td>\n<\/tr><tr><td class=\"Table\">\n<p class=\"Table\">1<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">49.9496<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">2.38400<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">(20.1053, 79.7939)<\/p>\n<\/td>\n<\/tr><\/tbody><\/table>\nYou can repeat this process many times for several different values of <em>x<\/em> and plot the prediction intervals for the mean response.\n<table class=\"Table\"><colgroup><col \/><col \/><\/colgroup><tbody><tr><td class=\"Table-Heading\">\n<p class=\"Table-Heading\"><strong>x<\/strong><\/p>\n<\/td>\n<td class=\"Table-Heading\">\n<p class=\"Table-Heading\"><strong>95% PI<\/strong><\/p>\n<\/td>\n<\/tr><tr><td class=\"Table\">\n<p class=\"Table\">20<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">(13.01, 73.11)<\/p>\n<\/td>\n<\/tr><tr><td class=\"Table\">\n<p class=\"Table\">40<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">(24.77, 84.31)<\/p>\n<\/td>\n<\/tr><tr><td class=\"Table\">\n<p class=\"Table\">60<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">(36.21, 95.83)<\/p>\n<\/td>\n<\/tr><tr><td class=\"Table\">\n<p class=\"Table\">80<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">(47.33, 107.67)<\/p>\n<\/td>\n<\/tr><tr><td class=\"Table\">\n<p class=\"Table\">100<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">(58.15, 119.81)<\/p>\n<\/td>\n<\/tr><\/tbody><\/table>\nNotice that the prediction interval bands are wider than the corresponding confidence interval bands, reflecting the fact that we are predicting the value of a random variable rather than estimating a population parameter. We would expect predictions for an individual value to be more variable than estimates of an average value.\n\n[caption id=\"\" align=\"aligncenter\" width=\"901\"]<img alt=\"10592.png\" class=\"frame-13\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171647\/10592.png\" width=\"901\" height=\"480\" \/> Figure 22. A comparison of confidence and prediction intervals.[\/caption]\n<h2>Transformations to Linearize Data Relationships<\/h2>\nIn many situations, the relationship between <em>x<\/em> and <em>y<\/em> is non-linear. In order to simplify the underlying model, we can transform or convert either <em>x<\/em> or <em>y<\/em> or both to result in a more linear relationship. There are many common transformations such as logarithmic and reciprocal. Including higher order terms on <em>x<\/em> may also help to linearize the relationship between <em>x<\/em> and <em>y<\/em>. Shown below are some common shapes of scatterplots and possible choices for transformations. However, the choice of transformation is frequently more a matter of trial and error than set rules.\n\n<img alt=\"Ch7DataRelationship4\" class=\"frame-13\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171650\/Ch7DataRelationship4.png\" width=\"1638\" height=\"661\" \/><img alt=\"Ch7DataRelationship3\" class=\"frame-13\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171652\/Ch7DataRelationship3.png\" width=\"1638\" height=\"661\" \/><img alt=\"Ch7DataRelationship2\" class=\"frame-13\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171655\/Ch7DataRelationship2.png\" width=\"1638\" height=\"661\" \/>\n\n[caption id=\"\" align=\"aligncenter\" width=\"1638\"]<img alt=\"Ch7DataRelationship1\" class=\"frame-13\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171658\/Ch7DataRelationship1.png\" width=\"1638\" height=\"661\" \/> Figure 23. Examples of possible transformations for x and y variables.[\/caption]\n\n<div class=\"textbox examples\">\n<h3>Example 4<\/h3>\n<p class=\"Example\">A forester needs to create a simple linear regression model to predict tree volume using diameter-at-breast height (dbh) for sugar maple trees. He collects dbh and volume for 236 sugar maple trees and plots volume versus dbh. Given below is the scatterplot, correlation coefficient, and regression output from Minitab.<\/p>\n\n\n[caption id=\"\" align=\"aligncenter\" width=\"741\"]<img alt=\"10541.png\" class=\"frame-13\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171700\/10541.png\" width=\"741\" height=\"496\" \/> Figure 24. Scatterplot of volume versus dbh.[\/caption]\n<p class=\"Example\">Pearson\u2019s linear correlation coefficient is 0.894, which indicates a strong, positive, linear relationship. However, the scatterplot shows a distinct nonlinear relationship.<\/p>\n\n<h4>Regression Analysis: volume versus dbh<\/h4>\n<table class=\"Table\" style=\"margin-left: 23px\"><colgroup><col \/><col \/><col \/><col \/><col \/><\/colgroup><tbody><tr><td class=\"Table\" colspan=\"5\">The regression equation is volume = - 51.1 + 7.15 dbh<\/td>\n<\/tr><tr><td class=\"Table\">\n<p class=\"Table\">Predictor<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">Coef<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">SE Coef<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">T<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">P<\/p>\n<\/td>\n<\/tr><tr><td class=\"Table\">\n<p class=\"Table\">Constant<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">-51.097<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">3.271<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">-15.62<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">0.000<\/p>\n<\/td>\n<\/tr><tr><td class=\"Table\">\n<p class=\"Table\">dbh<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">7.1500<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">0.2342<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">30.53<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">0.000<\/p>\n<\/td>\n<\/tr><tr><td class=\"Table\">\n<p class=\"Table\">S = 19.5820<\/p>\n<\/td>\n<td class=\"Table\" colspan=\"2\">\n<p class=\"Table\">R-Sq = 79.9%<\/p>\n<\/td>\n<td class=\"Table\" colspan=\"2\">\n<p class=\"Table\">R-Sq(adj) = 79.8%<\/p>\n<\/td>\n<\/tr><\/tbody><\/table><table class=\"Table\" style=\"margin-left: 23px\"><colgroup><col \/><col \/><col \/><col \/><col \/><col \/><\/colgroup><tbody><tr><td class=\"Table\" colspan=\"6\">\n<p class=\"Table\">Analysis of Variance<\/p>\n<\/td>\n<\/tr><tr><td class=\"Table\">\n<p class=\"Table\">Source<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">DF<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">SS<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">MS<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">F<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">P<\/p>\n<\/td>\n<\/tr><tr><td class=\"Table\">\n<p class=\"Table\">Regression<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">1<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">357397<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">357397<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">932.04<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">0.000<\/p>\n<\/td>\n<\/tr><tr><td class=\"Table\">\n<p class=\"Table\">Residual Error<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">234<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">89728<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">383<\/p>\n<\/td>\n<td class=\"Table\" \/>\n<td class=\"Table\" \/>\n<\/tr><tr><td class=\"Table\">\n<p class=\"Table\">Total<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">235<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">447125<\/p>\n<\/td>\n<td class=\"Table\" \/>\n<td class=\"Table\" \/>\n<td class=\"Table\" \/>\n<\/tr><\/tbody><\/table><p class=\"Example\">The R<span class=\"Superscript SmallText\">2<\/span> is 79.9% indicating a fairly strong model and the slope is significantly different from zero. However, both the residual plot and the residual normal probability plot indicate serious problems with this model. A transformation may help to create a more linear relationship between volume and dbh.<\/p>\n\n\n[caption id=\"\" align=\"aligncenter\" width=\"996\"]<img alt=\"10531.png\" class=\"frame-13\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171703\/10531.png\" width=\"996\" height=\"363\" \/> Figure 25. Residual and normal probability plots.[\/caption]\n<p class=\"Example\">Volume was transformed to the natural log of volume and plotted against dbh (see scatterplot below). Unfortunately, this did little to improve the linearity of this relationship. The forester then took the natural log transformation of dbh. The scatterplot of the natural log of volume versus the natural log of dbh indicated a more linear relationship between these two variables. The linear correlation coefficient is 0.954.<\/p>\n\n\n[caption id=\"\" align=\"aligncenter\" width=\"921\"]<img alt=\"10521.png\" class=\"frame-13\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171705\/10521.png\" width=\"921\" height=\"310\" \/> Figure 26. Scatterplots of natural log of volume versus dbh and natural log of volume versus natural log of dbh.[\/caption]\n<p class=\"Example\">The regression analysis output from Minitab is given below.<\/p>\n\n<h4>Regression Analysis: lnVOL vs. lnDBH<\/h4>\n<table id=\"Table9\" class=\"Table\" style=\"margin-left: 23px\"><colgroup><col \/><col \/><col \/><col \/><col \/><\/colgroup><tbody><tr><td class=\"Table\" colspan=\"5\">\n<p class=\"Table\">The regression equation is lnVOL = - 2.86 + 2.44 lnDBH<\/p>\n<\/td>\n<\/tr><tr><td class=\"Table\">\n<p class=\"Table\">Predictor<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">Coef<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">SE Coef<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">T<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">P<\/p>\n<\/td>\n<\/tr><tr><td class=\"Table\">\n<p class=\"Table\">Constant<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">-2.8571<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">0.1253<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">-22.80<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">0.000<\/p>\n<\/td>\n<\/tr><tr><td class=\"Table\">\n<p class=\"Table\">lnDBH<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">2.44383<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">0.05007<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">48.80<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">0.000<\/p>\n<\/td>\n<\/tr><tr><td class=\"Table\">\n<p class=\"Table\">S = 0.327327<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">R-Sq = 91.1%<\/p>\n<\/td>\n<td class=\"Table\" colspan=\"2\">\n<p class=\"Table\">R-Sq(adj) = 91.0%<\/p>\n<\/td>\n<td class=\"Table\" \/>\n<\/tr><\/tbody><\/table><table id=\"table-20\" class=\"Table\"><colgroup><col \/><col \/><col \/><col \/><col \/><col \/><\/colgroup><tbody><tr><td class=\"Table\" colspan=\"6\">\n<p class=\"Table\">Analysis of Variance<\/p>\n<\/td>\n<\/tr><tr><td class=\"Table\">\n<p class=\"Table\">Source<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">DF<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">SS<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">MS<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">F<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">P<\/p>\n<\/td>\n<\/tr><tr><td class=\"Table\">\n<p class=\"Table\">Regression<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">1<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">255.19<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">255.19<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">2381.78<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">0.000<\/p>\n<\/td>\n<\/tr><tr><td class=\"Table\">\n<p class=\"Table\">Residual Error<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">234<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">25.07<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">0.11<\/p>\n<\/td>\n<td class=\"Table\" \/>\n<td class=\"Table\" \/>\n<\/tr><tr><td class=\"Table\">\n<p class=\"Table\">Total<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">235<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">280.26<\/p>\n<\/td>\n<td class=\"Table\" \/>\n<td class=\"Table\" \/>\n<td class=\"Table\" \/>\n<\/tr><\/tbody><\/table>\n[caption id=\"\" align=\"aligncenter\" width=\"1050\"]<img alt=\"10512.png\" class=\"frame-13\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171708\/10512.png\" width=\"1050\" height=\"367\" \/> Figure 27. Residual and normal probability plots.[\/caption]\n<p class=\"Example\">The model using the transformed values of volume and dbh has a more linear relationship and a more positive correlation coefficient. The slope is significantly different from zero and the R<span class=\"Superscript SmallText\">2<\/span> has increased from 79.9% to 91.1%. The residual plot shows a more random pattern and the normal probability plot shows some improvement.<\/p>\n<p class=\"Example\">There are many possible transformation combinations possible to linearize data. Each situation is unique and the user may need to try several alternatives before selecting the best transformation for <em>x<\/em> or <em>y<\/em> or both.<\/p>\n\n<\/div>\n<h2 class=\"ExampleHeading\">Software Solutions<\/h2>\n<h3>Minitab<\/h3>\n<p class=\"Centered\"><img alt=\"145_1.tif\" class=\"frame-52 aligncenter\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171711\/145_1_fmt.png\" \/><img alt=\"145_2.tif\" class=\"frame-52 aligncenter\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171716\/145_2_fmt.png\" \/><\/p>\nThe Minitab output is shown above in Ex. 4.\n<h3>Excel<\/h3>\n<p class=\"No-Caption\"><span class=\"Picture\"><img alt=\"143_1.tif\" class=\"frame-106 aligncenter\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171719\/143_1_fmt.png\" \/><\/span><\/p>\n<p class=\"No-Caption\"><span class=\"Picture\"><img alt=\"143_2.tif\" class=\"frame-106 aligncenter\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171722\/143_2_fmt.png\" \/><\/span><\/p>\n<span class=\"Picture\"><img alt=\"143_3.tif\" class=\"frame-13 aligncenter\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171724\/143_3_fmt.png\" \/><\/span>\n\n[caption id=\"\" align=\"aligncenter\" width=\"578\"]<img alt=\"144.tif\" class=\"frame-13\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171726\/144_fmt.png\" width=\"578\" height=\"489\" \/> Figure 28. Residual and normal probability plots.[\/caption]\n\n<\/div>","rendered":"<div class=\"Basic-Text-Frame\">\n<p class=\"Chapter-Number\">In many studies, we measure more than one variable for each individual. For example, we measure precipitation and plant growth, or number of young with nesting habitat, or soil erosion and volume of water. We collect pairs of data and instead of examining each variable separately (univariate data), we want to find ways to describe <strong class=\"Strong-2\">bivariate data<\/strong>, in which two variables are measured on each subject in our sample. Given such data, we begin by determining if there is a relationship between these two variables. As the values of one variable change, do we see corresponding changes in the other variable?<\/p>\n<p>We can describe the relationship between these two variables graphically and numerically. We begin by considering the concept of correlation.<\/p>\n<p class=\"Callout\"><span class=\"pullquote-left\">Correlation is defined as the statistical association between two variables.<\/span><\/p>\n<p>A correlation exists between two variables when one of them is related to the other in some way. A scatterplot is the best place to start. A scatterplot (or scatter diagram) is a graph of the paired (x, y) sample data with a horizontal x-axis and a vertical y-axis. Each individual (x, y) pair is plotted as a single point.<\/p>\n<div style=\"width: 566px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" alt=\"11280.png\" class=\"frame-66\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171506\/11280.png\" width=\"556\" height=\"369\" \/><\/p>\n<p class=\"wp-caption-text\">Figure 1. Scatterplot of chest girth versus length.<\/p>\n<\/div>\n<p>In this example, we plot bear chest girth (y) against bear length (x). When examining a scatterplot, we should study the overall pattern of the plotted points. In this example, we see that the value for chest girth does tend to increase as the value of length increases. We can see an upward slope and a straight-line pattern in the plotted data points.<\/p>\n<p>A scatterplot can identify several different types of relationships between two variables.<\/p>\n<ul>\n<li class=\"List-Paragraph\">A relationship has <strong class=\"Strong-2\">no correlation<\/strong> when the points on a scatterplot do not show any pattern.<\/li>\n<li class=\"List-Paragraph\">A relationship is <strong class=\"Strong-2\">non-linear<\/strong> when the points on a scatterplot follow a pattern but not a straight line.<\/li>\n<li class=\"List-Paragraph\">A relationship is <strong class=\"Strong-2\">linear<\/strong> when the points on a scatterplot follow a somewhat straight line pattern. This is the relationship that we will examine.<\/li>\n<\/ul>\n<p>Linear relationships can be either positive or negative. Positive relationships have points that incline upwards to the right. As <em>x<\/em> values increase, <em>y<\/em> values increase. As <em>x<\/em> values decrease, <em>y<\/em> values decrease. For example, when studying plants, height typically increases as diameter increases.<\/p>\n<div style=\"width: 612px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" alt=\"11268.png\" class=\"frame-80\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171508\/11268.png\" width=\"602\" height=\"399\" \/><\/p>\n<p class=\"wp-caption-text\">Figure 2. Scatterplot of height versus diameter.<\/p>\n<\/div>\n<p class=\"Caption\">Negative relationships have points that decline downward to the right. As <em>x<\/em> values increase, <em>y<\/em> values decrease. As <em>x<\/em> values decrease, <em>y<\/em> values increase. For example, as wind speed increases, wind chill temperature decreases.<\/p>\n<div style=\"width: 629px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" alt=\"11256.png\" class=\"frame-13\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171510\/11256.png\" width=\"619\" height=\"410\" \/><\/p>\n<p class=\"wp-caption-text\">Figure 3. Scatterplot of temperature versus wind speed.<\/p>\n<\/div>\n<p>Non-linear relationships have an apparent pattern, just not linear. For example, as age increases height increases up to a point then levels off after reaching a maximum height.<\/p>\n<div style=\"width: 637px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" alt=\"11245.png\" class=\"frame-4\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171513\/11245.png\" width=\"627\" height=\"419\" \/><\/p>\n<p class=\"wp-caption-text\">Figure 4. Scatterplot of height versus age.<\/p>\n<\/div>\n<p>When two variables have no relationship, there is no straight-line relationship or non-linear relationship. When one variable changes, it does not influence the other variable.<\/p>\n<div style=\"width: 680px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" alt=\"11236.png\" class=\"frame-50\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171515\/11236.png\" width=\"670\" height=\"448\" \/><\/p>\n<p class=\"wp-caption-text\">Figure 5. Scatterplot of growth versus area.<\/p>\n<\/div>\n<h2>Linear Correlation Coefficient<\/h2>\n<p>Because visual examinations are largely subjective, we need a more precise and objective measure to define the correlation between the two variables. To quantify the strength and direction of the relationship between two variables, we use the linear correlation coefficient:<\/p>\n<p class=\"Centered\"><img decoding=\"async\" alt=\"11226.png\" class=\"frame-74 aligncenter\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171517\/11226.png\" \/><\/p>\n<p>where <span class=\"Symbols\" xml:lang=\"ar-SA\"><em>x\u0304<\/em><\/span> and <em>s<span class=\"Subscript SmallText\">x<\/span><\/em> are the sample mean and sample standard deviation of the <em>x<\/em>\u2019s, and <span class=\"Symbols\" xml:lang=\"ar-SA\"><em>y\u0304<\/em><\/span> and <em>s<span class=\"Subscript SmallText\">y<\/span><\/em> are the mean and standard deviation of the <em>y<\/em>\u2019s. The sample size is <em>n<\/em>.<\/p>\n<p>An alternate computation of the correlation coefficient is:<\/p>\n<p class=\"No-Caption\"><span class=\"Picture\"><img decoding=\"async\" alt=\"11679.png\" class=\"frame-45 aligncenter\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171518\/11679.png\" \/><\/span><\/p>\n<p class=\"Centered\">where <span class=\"Inline-Equation-Large\"><img decoding=\"async\" alt=\"11691.png\" class=\"frame-7\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171519\/11691.png\" \/><\/span><\/p>\n<p class=\"Centered\"><span class=\"Inline-Equation-Large\"><img decoding=\"async\" alt=\"11702.png\" class=\"frame-55\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171520\/11702.png\" \/><\/span><\/p>\n<p class=\"Centered\"><span class=\"Inline-Equation-Large\"><img decoding=\"async\" alt=\"11709.png\" class=\"frame-7\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171521\/11709.png\" \/><\/span><\/p>\n<p>The linear correlation coefficient is also referred to as Pearson\u2019s product moment correlation coefficient in honor of Karl Pearson, who originally developed it. This statistic numerically describes how strong the straight-line or linear relationship is between the two variables and the direction, positive or negative.<\/p>\n<p><strong class=\"Strong-2\">The properties of \u201cr\u201d:<\/strong><\/p>\n<ul>\n<li class=\"List-Paragraph\">It is always between -1 and +1.<\/li>\n<li class=\"List-Paragraph\">It is a unitless measure so \u201cr\u201d would be the same value whether you measured the two variables in pounds and inches or in grams and centimeters.<\/li>\n<li class=\"List-Paragraph\">Positive values of \u201cr\u201d are associated with positive relationships.<\/li>\n<li class=\"List-Paragraph\">Negative values of \u201cr\u201d are associated with negative relationships.<\/li>\n<\/ul>\n<h3>Examples of Positive Correlation<\/h3>\n<div style=\"width: 1031px\" class=\"wp-caption alignnone\"><img loading=\"lazy\" decoding=\"async\" alt=\"11215.png\" class=\"frame-13\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171524\/11215.png\" width=\"1021\" height=\"864\" \/><\/p>\n<p class=\"wp-caption-text\">Figure 6. Examples of positive correlation.<\/p>\n<\/div>\n<h3>Examples of Negative Correlation<\/h3>\n<div style=\"width: 992px\" class=\"wp-caption alignnone\"><img loading=\"lazy\" decoding=\"async\" alt=\"11205.png\" class=\"frame-13\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171527\/11205.png\" width=\"982\" height=\"802\" \/><\/p>\n<p class=\"wp-caption-text\">Figure 7. Examples of negative correlation.<\/p>\n<\/div>\n<p class=\"Callout\"><span class=\"pullquote-left\"><strong class=\"char-style-override-2\">Correlation is not causation!!!<\/strong> Just because two variables are correlated does not mean that one variable causes another variable to change.<\/span><\/p>\n<p>Examine these next two scatterplots. Both of these data sets have an r = 0.01, but they are very different. Plot 1 shows little linear relationship between <em>x<\/em> and <em>y<\/em> variables. Plot 2 shows a strong non-linear relationship. Pearson\u2019s linear correlation coefficient only measures the strength and direction of a linear relationship. Ignoring the scatterplot could result in a serious mistake when describing the relationship between two variables.<\/p>\n<div style=\"width: 948px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" alt=\"11196.png\" class=\"frame-13\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171530\/11196.png\" width=\"938\" height=\"301\" \/><\/p>\n<p class=\"wp-caption-text\">Figure 8. Comparison of scatterplots.<\/p>\n<\/div>\n<p class=\"Caption\"><span class=\"Picture\">When you investigate the relationship between two variables, always begin with a scatterplot. This graph allows you to look for patterns (both linear and non-linear). The next step is to quantitatively describe the strength and direction of the linear relationship using \u201cr\u201d. Once you have established that a linear relationship exists, you can take the next step in model building.<\/span><\/p>\n<h2>Simple Linear Regression<\/h2>\n<p>Once we have identified two variables that are correlated, we would like to model this relationship. We want to use one variable as a <strong class=\"Strong-2\">predictor<\/strong> or <strong class=\"Strong-2\">explanatory<\/strong> variable to explain the other variable, the <strong class=\"Strong-2\">response<\/strong> or <strong class=\"Strong-2\">dependent<\/strong> variable. In order to do this, we need a good relationship between our two variables. The model can then be used to predict changes in our response variable. A strong relationship between the predictor variable and the response variable leads to a good model.<\/p>\n<div style=\"width: 481px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" alt=\"11187.png\" class=\"frame-172\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171532\/11187.png\" width=\"471\" height=\"311\" \/><\/p>\n<p class=\"wp-caption-text\">Figure 9. Scatterplot with regression model.<\/p>\n<\/div>\n<p class=\"Callout\"><span class=\"pullquote-left\">A simple linear regression model is a mathematical equation that allows us to predict a response for a given predictor value.<\/span><\/p>\n<p>Our model will take the form of <em><span class=\"Symbols\" xml:lang=\"ar-SA\"><em>y\u0302<\/em><\/span> = b <span class=\"Subscript SmallText\">0<\/span> + b<span class=\"Subscript SmallText\">1<\/span>x<\/em> where <em>b<\/em><span class=\"Subscript SmallText\">0<\/span> is the y-intercept, <em>b<\/em><span class=\"Subscript SmallText\">1<\/span> is the slope, <em>x<\/em> is the predictor variable, and <span class=\"Symbols\" xml:lang=\"ar-SA\"><em>y\u0302<\/em><\/span> an estimate of the mean value of the response variable for any value of the predictor variable.<\/p>\n<p>The y-intercept is the predicted value for the response (<em>y<\/em>) when <em>x<\/em> = 0. The slope describes the change in <em>y<\/em> for each one unit change in <em>x<\/em>. Let\u2019s look at this example to clarify the interpretation of the slope and intercept.<\/p>\n<div class=\"textbox examples\">\n<h3>Example 1<\/h3>\n<p class=\"Example\">A hydrologist creates a model to predict the volume flow for a stream at a bridge crossing with a predictor variable of daily rainfall in inches.<\/p>\n<p class=\"Example\"><span class=\"Symbols\" xml:lang=\"ar-SA\"><em>y\u0302<\/em><\/span> = 1.6 + 29<em>x<\/em>. The y-intercept of 1.6 can be interpreted this way: On a day with no rainfall, there will be 1.6 gal. of water\/min. flowing in the stream at that bridge crossing. The slope tells us that if it rained one inch that day the flow in the stream would increase by an additional 29 gal.\/min. If it rained 2 inches that day, the flow would increase by an additional 58 gal.\/min.<\/p>\n<\/div>\n<div class=\"textbox examples\">\n<h3>Example 2<\/h3>\n<p class=\"ExampleHeading\">What would be the average stream flow if it rained 0.45 inches that day?<\/p>\n<p class=\"ExampleCenter\" style=\"text-align: center\"><span class=\"Symbols\" xml:lang=\"ar-SA\"><em>y\u0302<\/em><\/span> = 1.6 + 29<em>x<\/em> = 1.6 + 29(0.45) = 14.65 gal.\/min.<\/p>\n<\/div>\n<hr \/>\n<p class=\"Call-out-First-line\" style=\"text-align: center\">The Least-Squares Regression Line (shortcut equations)<\/p>\n<p class=\"Call-out-Middle\" style=\"text-align: center\">The equation is given by <em><span class=\"Symbols\" xml:lang=\"ar-SA\"><em>y\u0302<\/em><\/span> = b <span class=\"Subscript SmallText\">0<\/span> + b<span class=\"Subscript SmallText\">1<\/span> x<\/em><\/p>\n<p class=\"Call-out-Middle\" style=\"text-align: center\">where <span class=\"Inline-Equation-Large\"><img decoding=\"async\" alt=\"13279.png\" class=\"frame-44\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171534\/13279.png\" \/><\/span> is the slope and <em>b<span class=\"Subscript SmallText\">0<\/span> = <span class=\"Symbols\" xml:lang=\"ar-SA\"><em>y\u0302<\/em><\/span> &#8211; b<span class=\"Subscript SmallText\">1<\/span> <span class=\"Symbols\" xml:lang=\"ar-SA\"><em>x\u0304<\/em><\/span><\/em> is the y-intercept of the regression line.<\/p>\n<p class=\"Call-out-Middle\" style=\"text-align: center\">An alternate computational equation for slope is:<\/p>\n<p class=\"Call-out-End\" style=\"text-align: center\"><img decoding=\"async\" alt=\"13297.png\" class=\"frame-74\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171535\/13297.png\" \/><\/p>\n<hr \/>\n<p>This simple model is the line of best fit for our sample data. The regression line does not go through every point; instead it balances the difference between all data points and the straight-line model. The difference between the observed data value and the predicted value (the value on the straight line) is the error or <strong class=\"Strong-2\">residual.<\/strong> The criterion to determine the line that best describes the relation between two variables is based on the residuals.<\/p>\n<p class=\"Centered\" style=\"text-align: center\">Residual = Observed &#8211; Predicted<\/p>\n<p>For example, if you wanted to predict the chest girth of a black bear given its weight, you could use the following model.<\/p>\n<p class=\"Centered\" style=\"text-align: center\">Chest girth = 13.2 +0.43 weight<\/p>\n<p>The predicted chest girth of a bear that weighed 120 lb. is 64.8 in.<\/p>\n<p class=\"Centered\" style=\"text-align: center\">Chest girth = 13.2 + 0.43(120) = 64.8 in.<\/p>\n<p>But a measured bear chest girth (observed value) for a bear that weighed 120 lb. was actually 62.1 in.<\/p>\n<p class=\"Centered\" style=\"text-align: center\">The residual would be 62.1 \u2013 64.8 = -2.7 in.<\/p>\n<p>A negative residual indicates that the model is over-predicting. A positive residual indicates that the model is under-predicting. In this instance, the model over-predicted the chest girth of a bear that actually weighed 120 lb.<\/p>\n<div style=\"width: 909px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" alt=\"Image37921.PNG\" class=\"frame-13\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171537\/Image37921_fmt.png\" width=\"899\" height=\"537\" \/><\/p>\n<p class=\"wp-caption-text\">Figure 10. Scatterplot with regression model illustrating a residual value.<\/p>\n<\/div>\n<p>This random error (residual) takes into account all unpredictable and unknown factors that are not included in the model. An ordinary least squares regression line minimizes the sum of the squared errors between the observed and predicted values to create a best fitting line. The differences between the observed and predicted values are squared to deal with the positive and negative differences.<\/p>\n<h2>Coefficient of Determination<\/h2>\n<p>After we fit our regression line (compute <em>b<\/em><span class=\"Subscript SmallText\">0<\/span> and <em>b<\/em><span class=\"Subscript SmallText\">1<\/span>), we usually wish to know how well the model fits our data. To determine this, we need to think back to the idea of analysis of variance. In ANOVA, we partitioned the variation using sums of squares so we could identify a treatment effect opposed to random variation that occurred in our data. The idea is the same for regression. We want to partition the total variability into two parts: the variation due to the regression and the variation due to random error. And we are again going to compute sums of squares to help us do this.<\/p>\n<p>Suppose the total variability in the sample measurements about the sample mean is denoted by <span class=\"Inline-Equation\"><img decoding=\"async\" alt=\"11856.png\" class=\"frame-71\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171540\/11856.png\" \/><\/span>, called the <strong class=\"Strong-2\">sums of squares of total variability about the mean (SST)<\/strong>. The squared difference between the predicted value <span class=\"Inline-Equation\"><img decoding=\"async\" alt=\"13147.png\" class=\"frame-5\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171541\/13147.png\" \/><\/span> and the sample mean is denoted by <span class=\"Inline-Equation\"><img decoding=\"async\" alt=\"11878.png\" class=\"frame-17\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171541\/11878.png\" \/><\/span>, called the <strong class=\"Strong-2\">sums of squares due to regression (SSR)<\/strong>. The SSR represents the variability explained by the regression line. Finally, the variability which cannot be explained by the regression line is called the <strong class=\"Strong-2\">sums of squares due to error (SSE)<\/strong> and is denoted by <span class=\"Inline-Equation\"><img decoding=\"async\" alt=\"11892.png\" class=\"frame-17\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171542\/11892.png\" \/><\/span>. SSE is actually the squared residual.<\/p>\n<table class=\"Table\">\n<colgroup>\n<col \/>\n<col \/>\n<col \/><\/colgroup>\n<tbody>\n<tr>\n<td class=\"Table\">\n<p class=\"Table-Heading\">SST<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table-Heading\">= SSR<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table-Heading\">+ SSE<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td class=\"Table\">\n<p class=\"Table\"><span class=\"Inline-Equation\"><img decoding=\"async\" alt=\"11902.png\" class=\"frame-101\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171543\/11902.png\" \/><\/span><\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">= <span class=\"Inline-Equation\"><img decoding=\"async\" alt=\"11906.png\" class=\"frame-102\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171545\/11906.png\" \/><\/span><\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">+<span class=\"Inline-Equation\"><img decoding=\"async\" alt=\"11912.png\" class=\"frame-103\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171546\/11912.png\" \/><\/span><\/p>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<div style=\"width: 760px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" alt=\"11168.png\" class=\"frame-40\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171547\/11168.png\" width=\"750\" height=\"500\" \/><\/p>\n<p class=\"wp-caption-text\">Figure 11. An illustration of the relationship between the mean of the y\u2019s and the predicted and observed value of a specific y.<\/p>\n<\/div>\n<p>The sums of squares and mean sums of squares (just like ANOVA) are typically presented in the regression analysis of variance table. The ratio of the mean sums of squares for the regression (MSR) and mean sums of squares for error (MSE) form an F-test statistic used to test the regression model.<\/p>\n<p>The relationship between these sums of square is defined as<\/p>\n<p class=\"Centered\"><strong class=\"Strong-2\">Total Variation = Explained Variation + Unexplained Variation<\/strong><\/p>\n<p>The larger the explained variation, the better the model is at prediction. The larger the unexplained variation, the worse the model is at prediction. A quantitative measure of the explanatory power of a model is R<span class=\"Superscript SmallText\">2<\/span>, the Coefficient of Determination:<\/p>\n<p class=\"Centered\"><img decoding=\"async\" alt=\"11934.png\" class=\"frame-99 aligncenter\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171549\/11934.png\" \/><\/p>\n<p>The Coefficient of Determination measures the percent variation in the response variable (<em>y<\/em>) that is explained by the model.<\/p>\n<ul>\n<li class=\"List-Paragraph\">Values range from 0 to 1.<\/li>\n<li class=\"List-Paragraph\">An R<span class=\"Superscript SmallText\">2<\/span> close to zero indicates a model with very little explanatory power.<\/li>\n<li class=\"List-Paragraph\">An R<span class=\"Superscript SmallText\">2<\/span> close to one indicates a model with more explanatory power.<\/li>\n<\/ul>\n<p>The Coefficient of Determination and the linear correlation coefficient are related mathematically.<\/p>\n<p class=\"Centered\" style=\"text-align: center\">R<sup>2<\/sup>\u00a0= r<sup>2<\/sup><\/p>\n<p>However, they have two very different meanings: <em>r<\/em> is a measure of the strength and direction of a linear relationship between two variables; <em>R<\/em><span class=\"Superscript SmallText\">2<\/span> describes the percent variation in \u201c<em>y<\/em>\u201d that is explained by the model.<\/p>\n<h2>Residual and Normal Probability Plots<\/h2>\n<p>Even though you have determined, using a scatterplot, correlation coefficient and R<span class=\"Superscript SmallText\">2<\/span>, that <em>x<\/em> is useful in predicting the value of <em>y<\/em>, the results of a regression analysis are valid only when the data satisfy the necessary regression assumptions.<\/p>\n<ol>\n<li class=\"List-Paragraph-Number-1\">The response variable (y) is a random variable while the predictor variable (x) is assumed non-random or fixed and measured without error.<\/li>\n<li class=\"List-Paragraph-Number-1\">The relationship between <em>y<\/em> and <em>x<\/em> must be linear, given by the model <img decoding=\"async\" alt=\"13333.png\" class=\"frame-6\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171551\/13333.png\" \/>.<\/li>\n<li class=\"List-Paragraph-Number-1\">The error of random term the values <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b5<\/span> are independent, have a mean of 0 and a common variance <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03c3<\/span><span class=\"Superscript SmallText\">2<\/span>, independent of <em>x<\/em>, and are normally distributed.<\/li>\n<\/ol>\n<p>We can use <strong class=\"Strong-2\">residual plots<\/strong> to check for a constant variance, as well as to make sure that the linear model is in fact adequate. A residual plot is a scatterplot of the residual (= observed &#8211; predicted values) versus the predicted or fitted (as used in the residual plot) value. The center horizontal axis is set at zero. One property of the residuals is that they sum to zero and have a mean of zero. A residual plot should be free of any patterns and the residuals should appear as a random scatter of points about zero.<\/p>\n<p>A residual plot with no appearance of any patterns indicates that the model assumptions are satisfied for these data.<\/p>\n<div style=\"width: 610px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" alt=\"11155.png\" class=\"frame-109\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171553\/11155.png\" width=\"600\" height=\"399\" \/><\/p>\n<p class=\"wp-caption-text\">Figure 12. A residual plot.<\/p>\n<\/div>\n<p>A residual plot that has a \u201cfan shape\u201d indicates a heterogeneous variance (non-constant variance). The residuals tend to fan out or fan in as error variance increases or decreases.<\/p>\n<div style=\"width: 639px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" alt=\"11142.png\" class=\"frame-13\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171555\/11142.png\" width=\"629\" height=\"419\" \/><\/p>\n<p class=\"wp-caption-text\">Figure 13. A residual plot that indicates a non-constant variance.<\/p>\n<\/div>\n<p>A residual plot that tends to \u201cswoop\u201d indicates that a linear model may not be appropriate. The model may need higher-order terms of <em>x<\/em>, or a non-linear model may be needed to better describe the relationship between <em>y<\/em> and <em>x<\/em>. Transformations on <em>x<\/em> or <em>y<\/em> may also be considered.<\/p>\n<div style=\"width: 667px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" alt=\"11131.png\" class=\"frame-47\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171557\/11131.png\" width=\"657\" height=\"440\" \/><\/p>\n<p class=\"wp-caption-text\">Figure 14. A residual plot that indicates the need for a higher order model.<\/p>\n<\/div>\n<p class=\"Caption\">A <strong class=\"Strong-2\">normal probability plot<\/strong> allows us to check that the errors are normally distributed. It plots the residuals against the expected value of the residual as if it had come from a normal distribution. Recall that when the residuals are normally distributed, they will follow a straight-line pattern, sloping upward.<\/p>\n<p>This plot is not unusual and does not indicate any non-normality with the residuals.<\/p>\n<div style=\"width: 667px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" alt=\"11121.png\" class=\"frame-47\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171559\/11121.png\" width=\"657\" height=\"440\" \/><\/p>\n<p class=\"wp-caption-text\">Figure 15. A normal probability plot.<\/p>\n<\/div>\n<p>This next plot clearly illustrates a non-normal distribution of the residuals.<\/p>\n<div style=\"width: 667px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" alt=\"11111.png\" class=\"frame-47\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171602\/11111.png\" width=\"657\" height=\"440\" \/><\/p>\n<p class=\"wp-caption-text\">Figure 16. A normal probability plot, which illustrates non-normal distribution.<\/p>\n<\/div>\n<p>The most serious violations of normality usually appear in the tails of the distribution because this is where the normal distribution differs most from other types of distributions with a similar mean and spread. Curvature in either or both ends of a normal probability plot is indicative of nonnormality.<\/p>\n<h2>Population Model<\/h2>\n<p>Our regression model is based on a sample of <em>n<\/em> bivariate observations drawn from a larger population of measurements.<\/p>\n<p class=\"Centered\"><img decoding=\"async\" alt=\"11952.png\" class=\"frame-46 aligncenter\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171606\/11952.png\" \/><\/p>\n<p>We use the means and standard deviations of our sample data to compute the slope (<em>b<\/em><span class=\"Subscript SmallText\">1<\/span>) and y-intercept (<em>b<\/em><span class=\"Subscript SmallText\">0<\/span>) in order to create an ordinary least-squares regression line. But we want to describe the relationship between <em>y<\/em> and <em>x<\/em> in the population, not just within our sample data. We want to construct a <strong class=\"Strong-2\">population model<\/strong>. Now we will think of the least-squares line computed from a sample as an estimate of the true regression line for the population.<\/p>\n<hr \/>\n<p class=\"Callout\"><strong class=\"char-style-override-2\">The Population Model<\/strong><br \/>\n<span class=\"Picture\"><img decoding=\"async\" alt=\"11964.png\" class=\"frame-45\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171607\/11964.png\" \/><\/span>, where <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03bc<\/span><span class=\"Subscript SmallText\">y<\/span> is the population mean response, <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b2<\/span><span class=\"Subscript SmallText\">0<\/span> is the y-intercept, and <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b2<\/span><span class=\"Subscript SmallText\">1<\/span> is the slope for the population model.<\/p>\n<hr \/>\n<p>In our population, there could be many different responses for a value of <em>x<\/em>. In simple linear regression, the model assumes that for each value of <em>x<\/em> the observed values of the response variable <em>y<\/em> are normally distributed with a mean that depends on <em>x<\/em>. We use <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03bc<\/span><span class=\"Subscript SmallText\">y<\/span> to represent these means. We also assume that these means all lie on a straight line when plotted against <em>x<\/em> (a line of means).<\/p>\n<div style=\"width: 911px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" alt=\"11100.png\" class=\"frame-13\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171608\/11100.png\" width=\"901\" height=\"457\" \/><\/p>\n<p class=\"wp-caption-text\">Figure 17. The statistical model for linear regression; the mean response is a straight-line function of the predictor variable.<\/p>\n<\/div>\n<p>The sample data then fit the statistical model:<\/p>\n<p class=\"Centered\" style=\"text-align: center\">Data = fit + residual<\/p>\n<p class=\"Centered\"><img decoding=\"async\" alt=\"11974.png\" class=\"frame-7 aligncenter\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171610\/11974.png\" \/><\/p>\n<p>where the errors (<span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b5<\/span><span class=\"Subscript SmallText\">i<\/span>) are independent and normally distributed <em>N<\/em> (0, <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03c3<\/span>). Linear regression also assumes equal variance of <em>y<\/em> (<span class=\"Symbols\" xml:lang=\"ar-SA\">\u03c3<\/span> is the same for all values of <em>x<\/em>). We use <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b5<\/span> (Greek epsilon) to stand for the residual part of the statistical model. A response <em>y<\/em> is the sum of its mean and chance deviation <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b5<\/span> from the mean. The deviations <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b5<\/span> represents the \u201cnoise\u201d in the data. In other words, the noise is the variation in <em>y<\/em> due to other causes that prevent the observed (<em>x, y<\/em>) from forming a perfectly straight line.<\/p>\n<p>The sample data used for regression are the observed values of <em>y<\/em> and <em>x<\/em>. The response <em>y<\/em> to a given <em>x<\/em> is a random variable, and the regression model describes the mean and standard deviation of this random variable <em>y<\/em>. The intercept <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b2<\/span><span class=\"Subscript SmallText\">0<\/span>, slope <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b2<\/span><span class=\"Subscript SmallText\">1<\/span>, and standard deviation <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03c3<\/span> of <em>y<\/em> are the unknown parameters of the regression model and must be estimated from the sample data.<\/p>\n<ul>\n<li class=\"List-Paragraph\">The value of <span class=\"Symbols\" xml:lang=\"ar-SA\"><em>y\u0302<\/em><\/span> from the least squares regression line is really a prediction of the mean value of <em>y<\/em> (<span class=\"Symbols\" xml:lang=\"ar-SA\">\u03bc<\/span><span class=\"Subscript SmallText\">y<\/span>) for a given value of <em>x<\/em>.<\/li>\n<li class=\"List-Paragraph\">The least squares regression line ( <img decoding=\"async\" alt=\"12009.png\" class=\"frame-12\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171611\/12009.png\" \/>) obtained from sample data is the best estimate of the true population regression line<br \/>\n(<img decoding=\"async\" alt=\"12014.png\" class=\"frame-12\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171612\/12014.png\" style=\"font-size: 0.917em;line-height: 1.273\" \/>).<\/li>\n<\/ul>\n<p class=\"Callout\"><span class=\"pullquote-left\"><span class=\"Symbols\" xml:lang=\"ar-SA\"><em>y\u0302<\/em><\/span> is an unbiased estimate for the mean response <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03bc<\/span><span class=\"Subscript SmallText\">y<br \/>\n<\/span><em class=\"char-style-override-2\">b<\/em><span class=\"Subscript SmallText\">0<\/span> is an unbiased estimate for the intercept <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b2<\/span><span class=\"Subscript SmallText\">0<br \/>\n<\/span><em class=\"char-style-override-2\">b<\/em><span class=\"Subscript SmallText\">1<\/span> is an unbiased estimate for the slope <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b2<\/span><span class=\"Subscript SmallText\">1<\/span><\/span><\/p>\n<h2>Parameter Estimation<\/h2>\n<p>Once we have estimates of <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b2<\/span><span class=\"Subscript SmallText\">0<\/span> and <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b2<\/span><span class=\"Subscript SmallText\">1<\/span> (from our sample data <em>b<\/em><span class=\"Subscript SmallText\">0<\/span> and <em>b<\/em><span class=\"Subscript SmallText\">1<\/span>), the linear relationship determines the estimates of <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03bc<\/span><span class=\"Subscript SmallText\">y<\/span> for all values of <em>x<\/em> in our population, not just for the observed values of <em>x<\/em>. We now want to use the least-squares line as a basis for inference about a population from which our sample was drawn.<\/p>\n<p>Model assumptions tell us that <em>b<\/em><span class=\"Subscript SmallText\">0<\/span> and <em>b<\/em><span class=\"Subscript SmallText\">1<\/span> are normally distributed with means <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b2<\/span><span class=\"Subscript SmallText\">0<\/span> and <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b2<\/span><span class=\"Subscript SmallText\">1<\/span> with standard deviations that can be estimated from the data. Procedures for inference about the population regression line will be similar to those described in the previous chapter for means. As always, it is important to examine the data for outliers and influential observations.<\/p>\n<p>In order to do this, we need to estimate <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03c3<\/span>, the regression standard error. This is the standard deviation of the model errors. It measures the variation of <em>y<\/em> about the population regression line. We will use the residuals to compute this value. Remember, the predicted value of <em>y<\/em> (<span class=\"Symbols\" xml:lang=\"ar-SA\"><em>p\u0302<\/em><\/span>) for a specific <em>x<\/em> is the point on the regression line. It is the unbiased estimate of the mean response (<span class=\"Symbols\" xml:lang=\"ar-SA\">\u03bc<\/span><span class=\"Subscript SmallText\">y<\/span>) for that <em>x<\/em>. The residual is:<\/p>\n<p class=\"Centered\">residual = observed \u2013 predicted<\/p>\n<p class=\"Centered\"><em>e<\/em><span class=\"Subscript SmallText\">i<\/span> = <em>y<\/em><span class=\"Subscript SmallText\">i<\/span> \u2013 <span class=\"Symbols\" xml:lang=\"ar-SA\"><em>y\u0302<\/em><\/span> = <span class=\"Inline-Equation\"><img decoding=\"async\" alt=\"12066.png\" class=\"frame-6\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171613\/12066.png\" \/><\/span><\/p>\n<p>The residual <em>e<\/em><span class=\"Subscript SmallText\">i<\/span> corresponds to model deviation <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b5<\/span><span class=\"Subscript SmallText\">i<\/span> where <strong class=\"SymbolsBold\" xml:lang=\"ar-SA\">\u03a3<\/strong> <em>e<\/em><span class=\"Subscript SmallText\">i<\/span> = 0 with a mean of 0. The regression standard error <em>s<\/em> is an unbiased estimate of <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03c3<\/span>.<\/p>\n<p class=\"Centered\"><img decoding=\"async\" alt=\"12076.png\" class=\"frame-104\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171614\/12076.png\" \/><\/p>\n<p>The quantity <em>s<\/em> is the estimate of the regression standard error (<span class=\"Symbols\" xml:lang=\"ar-SA\">\u03c3<\/span>) and <em>s<\/em><span class=\"Superscript SmallText\">2<\/span> is often called the mean square error (MSE). A small value of <em>s<\/em> suggests that observed values of <em>y<\/em> fall close to the true regression line and the line <span class=\"Inline-Equation\"><img decoding=\"async\" alt=\"12100.png\" class=\"frame-71\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171616\/12100.png\" \/><\/span> should provide accurate estimates and predictions.<\/p>\n<h2>Confidence Intervals and Significance Tests for Model Parameters<\/h2>\n<p>In an earlier chapter, we constructed confidence intervals and did significance tests for the population parameter <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03bc<\/span> (the population mean). We relied on sample statistics such as the mean and standard deviation for point estimates, margins of errors, and test statistics. Inference for the population parameters <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b2<\/span><span class=\"Subscript SmallText\">0<\/span> (slope) and <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b2<\/span><span class=\"Subscript SmallText\">1<\/span> (y-intercept) is very similar.<\/p>\n<p>Inference for the slope and intercept are based on the normal distribution using the estimates <em>b<\/em><span class=\"Subscript SmallText\">0<\/span> and <em>b<\/em><span class=\"Subscript SmallText\">1<\/span>. The standard deviations of these estimates are multiples of <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03c3<\/span>, the population regression standard error. Remember, we estimate <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03c3<\/span> with <em>s<\/em> (the variability of the data about the regression line). Because we use <em>s<\/em>, we rely on the student t-distribution with (<em>n<\/em> \u2013 2) degrees of freedom.<\/p>\n<p class=\"Centered\"><img decoding=\"async\" alt=\"12112.png\" class=\"frame-37\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171617\/12112.png\" \/><\/p>\n<p class=\"Centered\">The standard error for estimate of <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b2<\/span><span class=\"Subscript SmallText\">0<\/span><\/p>\n<p class=\"Centered\"><img decoding=\"async\" alt=\"12122.png\" class=\"frame-37\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171619\/12122.png\" \/><\/p>\n<p class=\"Centered\">The standard error for estimate of <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b2<\/span><span class=\"Subscript SmallText\">1<\/span><\/p>\n<p>We can construct confidence intervals for the regression slope and intercept in much the same way as we did when estimating the population mean.<\/p>\n<hr \/>\n<p class=\"Call-out-First-line\">A <strong class=\"char-style-override-2\">confidence interval<\/strong> for <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b2<\/span><span class=\"Subscript SmallText\">0<\/span> <em class=\"char-style-override-2\">: b<\/em><span class=\"Subscript SmallText\">0<\/span> \u00b1 t <span class=\"Symbol-Subscript SmallText\" xml:lang=\"ar-SA\">\u03b1<\/span><span class=\"Subscript SmallText\">\/2<\/span> SE<span class=\"Subscript SmallText\">b0<\/span><\/p>\n<p class=\"Call-out-Middle\">A <strong class=\"char-style-override-2\">confidence interval<\/strong> for <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b2<\/span><span class=\"Subscript SmallText\">1<\/span> <em class=\"char-style-override-2\">:<\/em> <em class=\"char-style-override-2\">b<\/em><span class=\"Subscript SmallText\">1<\/span> \u00b1 t <span class=\"Symbol-Subscript SmallText\" xml:lang=\"ar-SA\">\u03b1<\/span><span class=\"Subscript SmallText\">\/2<\/span> SE<span class=\"Subscript SmallText\">b1<\/span><\/p>\n<p class=\"Call-out-End\">where SE<span class=\"Subscript SmallText\">b0<\/span> and SE<span class=\"Subscript SmallText\">b1<\/span> are the standard errors for the y-intercept and slope, respectively.<\/p>\n<hr \/>\n<p>We can also test the hypothesis H<span class=\"Subscript SmallText\">0<\/span>: <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b2<\/span><span class=\"Subscript SmallText\">1<\/span> = 0. When we substitute <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b2<\/span><span class=\"Subscript SmallText\">1<\/span> = 0 in the model, the x-term drops out and we are left with <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03bc<\/span><span class=\"Subscript SmallText\">y<\/span> = <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b2<\/span><span class=\"Subscript SmallText\">0<\/span>. This tells us that the mean of <em>y<\/em> does NOT vary with <em>x<\/em>. In other words, there is no straight line relationship between <em>x<\/em> and <em>y<\/em> and the regression of <em>y<\/em> on <em>x<\/em> is of no value for predicting <em>y<\/em>.<\/p>\n<hr \/>\n<p class=\"Call-out-First-line\">Hypothesis test for <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b2<\/span><span class=\"Subscript SmallText\">1<\/span><\/p>\n<p class=\"Call-out-Middle\">H<span class=\"Subscript SmallText\">0<\/span>: <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b2<\/span><span class=\"Subscript SmallText\">1<\/span> =0<\/p>\n<p class=\"Call-out-Middle\">H<span class=\"Subscript SmallText\">1<\/span>: <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b2<\/span><span class=\"Subscript SmallText\">1<\/span> <em class=\"char-style-override-2\">\u2260<\/em>0<\/p>\n<p class=\"Call-out-Middle\">The test statistic is t = b<span class=\"Subscript SmallText\">1<\/span> \/ SE<span class=\"Subscript SmallText\">b1<\/span><\/p>\n<p class=\"Call-out-Middle\">We can also use the F-statistic (MSR\/MSE) in the regression ANOVA table*<\/p>\n<p class=\"Call-out-End\">*Recall that t<span class=\"Superscript SmallText\">2<\/span> = F<\/p>\n<hr \/>\n<p>So let\u2019s pull all of this together in an example.<\/p>\n<div class=\"textbox examples\">\n<h3>Example 3<\/h3>\n<p class=\"Example\">The index of biotic integrity (IBI) is a measure of water quality in streams. As a manager for the natural resources in this region, you must monitor, track, and predict changes in water quality. You want to create a simple linear regression model that will allow you to predict changes in IBI in forested area. The following table conveys sample data from a coastal forest region and gives the data for IBI and forested area in square kilometers. Let forest area be the predictor variable (x) and IBI be the response variable (y).<\/p>\n<div style=\"width: 911px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" alt=\"11090.png\" class=\"frame-13\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171621\/11090.png\" width=\"901\" height=\"358\" \/><\/p>\n<p class=\"wp-caption-text\">Table 1. Observed data of biotic integrity and forest area.<\/p>\n<\/div>\n<p class=\"Example\">We begin with a computing descriptive statistics and a scatterplot of IBI against Forest Area.<\/p>\n<p class=\"ExampleCenter\" style=\"text-align: center\"><span class=\"Symbols\" xml:lang=\"ar-SA\"><em>x\u0304<\/em><\/span> = 47.42; <em>s<span class=\"Subscript SmallText\">x<\/span><\/em> 27.37; <span class=\"Symbols\" xml:lang=\"ar-SA\"><em>y\u0304<\/em><\/span> = 58.80; <em>s<span class=\"Subscript SmallText\">y<\/span><\/em> = 21.38; r = 0.735<\/p>\n<div style=\"width: 631px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" alt=\"11080.png\" class=\"frame-4\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171623\/11080.png\" width=\"621\" height=\"415\" \/><\/p>\n<p class=\"wp-caption-text\">Figure 18. Scatterplot of IBI vs. Forest Area.<\/p>\n<\/div>\n<p class=\"Example\">There appears to be a positive linear relationship between the two variables. The linear correlation coefficient is r = 0.735. This indicates a strong, positive, linear relationship. In other words, forest area is a good predictor of IBI. Now let\u2019s create a simple linear regression model using forest area to predict IBI (response).<\/p>\n<p class=\"Example\">First, we will compute <em>b<\/em><span class=\"Subscript SmallText\">0<\/span> and <em>b<\/em><span class=\"Subscript SmallText\">1<\/span> using the shortcut equations.<\/p>\n<p class=\"ExampleCenter\"><span class=\"Inline-Equation-Large\"><img decoding=\"async\" alt=\"12180.png\" class=\"frame-14\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171625\/12180.png\" \/><\/span>=<span class=\"Inline-Equation-Large\"><img decoding=\"async\" alt=\"12189.png\" class=\"frame-12\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171626\/12189.png\" \/><\/span>=0.574<\/p>\n<p class=\"ExampleCenter\"><span class=\"Inline-Equation\"><img decoding=\"async\" alt=\"12198.png\" class=\"frame-45\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171627\/12198.png\" \/><\/span><span class=\"Inline-Equation\"><img decoding=\"async\" alt=\"12205.png\" class=\"frame-55\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171628\/12205.png\" \/><\/span>= 31.581<\/p>\n<p class=\"Example\">The regression equation is <span class=\"Inline-Equation\"><img decoding=\"async\" alt=\"12216.png\" class=\"frame-87\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171629\/12216.png\" \/><\/span>.<\/p>\n<p class=\"Example\">Now let\u2019s use Minitab to compute the regression model. The output appears below.<\/p>\n<h4>Regression Analysis: IBI versus Forest Area<\/h4>\n<p class=\"Example\">The regression equation is IBI = 31.6 + 0.574 Forest Area<\/p>\n<table class=\"Table\" style=\"margin-left: 23px\">\n<colgroup>\n<col \/>\n<col \/>\n<col \/>\n<col \/>\n<col \/><\/colgroup>\n<tbody>\n<tr>\n<td class=\"Table\">\n<p class=\"Table\">Predictor<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">Coef<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">SE Coef<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">T<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">P<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td class=\"Table\">\n<p class=\"Table\">Constant<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">31.583<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\"><strong class=\"Strong-2\">4.177<\/strong><\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">7.56<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">0.000<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td class=\"Table\">\n<p class=\"Table\">Forest Area<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">0.57396<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\"><strong class=\"Strong-2\">0.07648<\/strong><\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">7.50<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">0.000<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td class=\"Table\">\n<p class=\"Table\"><strong class=\"Strong-2\">S = 14.6505<\/strong><\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\"><strong class=\"Strong-2\">R-Sq = 54.0%<\/strong><\/p>\n<\/td>\n<td class=\"Table\" colspan=\"2\">\n<p class=\"Table\">R-Sq(adj) = 53.0%<\/p>\n<\/td>\n<td class=\"Table\">\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<table class=\"Table\" style=\"margin-left: 23px\">\n<colgroup>\n<col \/>\n<col \/>\n<col \/>\n<col \/>\n<col \/>\n<col \/><\/colgroup>\n<tbody>\n<tr>\n<td class=\"Table-Heading\" colspan=\"6\">\n<p class=\"Table-Heading\">Analysis of Variance<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td class=\"Table\">\n<p class=\"Table\">Source<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">DF<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">SS<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">MS<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">F<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">P<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td class=\"Table\">\n<p class=\"Table\">Regression<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">1<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">12089<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">12089<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">56.32<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">0.000<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td class=\"Table\">\n<p class=\"Table\">Residual Error<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">48<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">10303<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\"><strong class=\"Strong-2\">215<\/strong><\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\"><strong class=\"Strong-2\">\u00a0<\/strong><\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\"><strong class=\"Strong-2\">\u00a0<\/strong><\/p>\n<\/td>\n<\/tr>\n<tr>\n<td class=\"Table\">\n<p class=\"Table\">Total<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">49<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">22392<\/p>\n<\/td>\n<td class=\"Table\">\n<\/td>\n<td class=\"Table\">\n<\/td>\n<td class=\"Table\">\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p class=\"Example\">The estimates for <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b2<\/span><span class=\"Subscript SmallText\">0<\/span> and <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b2<\/span><span class=\"Subscript SmallText\">1<\/span> are 31.6 and 0.574, respectively. We can interpret the y-intercept to mean that when there is zero forested area, the IBI will equal 31.6. For each additional square kilometer of forested area added, the IBI will increase by 0.574 units.<\/p>\n<p class=\"Example\">The coefficient of determination, R<span class=\"Superscript SmallText\">2<\/span>, is 54.0%. This means that 54% of the variation in IBI is explained by this model. Approximately 46% of the variation in IBI is due to other factors or random variation. We would like R<span class=\"Superscript SmallText\">2<\/span> to be as high as possible (maximum value of 100%).<\/p>\n<p class=\"Example\">The residual and normal probability plots do not indicate any problems.<\/p>\n<div style=\"width: 1000px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" alt=\"11070.png\" class=\"frame-13\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171631\/11070.png\" width=\"990\" height=\"311\" \/><\/p>\n<p class=\"wp-caption-text\">Figure 19. A residual and normal probability plot.<\/p>\n<\/div>\n<p class=\"Example\">The estimate of <strong class=\"SymbolsBold\" xml:lang=\"ar-SA\">\u03c3<\/strong>, the regression standard error, is <em>s<\/em> = 14.6505. This is a measure of the variation of the observed values about the population regression line. We would like this value to be as small as possible. The MSE is equal to 215. Remember, the <span class=\"Inline-Equation\"><img decoding=\"async\" alt=\"12275.png\" class=\"frame-23\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171633\/12275.png\" \/><\/span>= <em>s<\/em>. The standard errors for the coefficients are 4.177 for the y-intercept and 0.07648 for the slope.<\/p>\n<p class=\"Example\">We know that the values <em>b<\/em><span class=\"Subscript SmallText\">0<\/span> = 31.6 and <em>b<\/em><span class=\"Subscript SmallText\">1<\/span> = 0.574 are sample estimates of the true, but unknown, population parameters <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b2<\/span><span class=\"Subscript SmallText\">0<\/span> and <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b2<\/span><span class=\"Subscript SmallText\">1<\/span>. We can construct 95% confidence intervals to better estimate these parameters. The critical value (t<span class=\"Symbol-Subscript SmallText\" xml:lang=\"ar-SA\">\u03b1<\/span><span class=\"Subscript SmallText\">\/2<\/span>) comes from the student t-distribution with (<span class=\"BoldItalic Strong-2\">n<\/span> \u2013 2) degrees of freedom. Our sample size is 50 so we would have 48 degrees of freedom. The closest table value is 2.009.<\/p>\n<p class=\"ExampleCenter\" style=\"text-align: center\">95% confidence intervals for <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b2<\/span><span class=\"Subscript SmallText\">0<\/span> and <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b2<\/span><span class=\"Subscript SmallText\">1<\/span><\/p>\n<p class=\"ExampleCenter\" style=\"text-align: center\"><em>b<\/em><span class=\"Subscript SmallText\">0<\/span> \u00b1 t<span class=\"Symbol-Subscript SmallText\" xml:lang=\"ar-SA\">\u03b1<\/span><span class=\"Subscript SmallText\">\/2<\/span> SE<span class=\"Subscript SmallText\">b0<\/span> = 31.6 \u00b1 2.009(4.177) = (23.21, 39.99)<\/p>\n<p class=\"ExampleCenter\" style=\"text-align: center\"><em>b<\/em><span class=\"Subscript SmallText\">1<\/span> \u00b1 t<span class=\"Symbol-Subscript SmallText\" xml:lang=\"ar-SA\">\u03b1<\/span><span class=\"Subscript SmallText\">\/2<\/span> SE<span class=\"Subscript SmallText\">b1<\/span> = 0.574 \u00b1 2.009(0.07648) = (0.4204, 0.7277)<\/p>\n<p class=\"Example\">The next step is to test that the slope is significantly different from zero using a 5% level of significance.<\/p>\n<table class=\"Table\" style=\"margin-left: 23px\">\n<colgroup>\n<col \/>\n<col \/><\/colgroup>\n<tbody>\n<tr>\n<td class=\"Table-Heading\">\n<p class=\"Table-Heading\">H<span class=\"Subscript SmallText\">0<\/span>: <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b2<\/span><span class=\"Subscript SmallText\">1<\/span> =0<\/p>\n<\/td>\n<td class=\"Table-Heading\">\n<p class=\"Table-Heading\">H<span class=\"Subscript SmallText\">1<\/span>: <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03b2<\/span><span class=\"Subscript SmallText\">1<\/span> <em>\u2260<\/em>0<\/p>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p class=\"ExampleCenter\">t = b<span class=\"Subscript SmallText\">1<\/span> \/ SE<span class=\"Subscript SmallText\">b1<\/span> = 0.574\/0.07648 = 7.50523<\/p>\n<p class=\"Example\">We have 48 degrees of freedom and the closest critical value from the student t-distribution is 2.009. The test statistic is greater than the critical value, so we will reject the null hypothesis. The slope is significantly different from zero. We have found a statistically significant relationship between Forest Area and IBI.<\/p>\n<p class=\"Example\">The Minitab output also report the test statistic and p-value for this test.<\/p>\n<table class=\"Table\" style=\"margin-left: 23px\">\n<colgroup>\n<col \/>\n<col \/>\n<col \/>\n<col \/>\n<col \/><\/colgroup>\n<tbody>\n<tr>\n<td class=\"Table\" colspan=\"5\">\n<p class=\"Table\">The regression equation is IBI = 31.6 + 0.574 Forest Area<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td class=\"Table\">\n<p class=\"Table\">Predictor<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">Coef<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">SE Coef<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">T<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">P<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td class=\"Table\">\n<p class=\"Table\">Constant<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">31.583<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">4.177<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">7.56<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">0.000<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td class=\"Table\">\n<p class=\"Table\">Forest Area<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">0.57396<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">0.07648<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\"><strong class=\"Strong-2\">7.50<\/strong><\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\"><strong class=\"Strong-2\">0.000<\/strong><\/p>\n<\/td>\n<\/tr>\n<tr>\n<td class=\"Table\">\n<p class=\"Table\">S = 14.6505<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">R-Sq = 54.0%<\/p>\n<\/td>\n<td class=\"Table\" colspan=\"2\">\n<p class=\"Table\">R-Sq(adj) = 53.0%<\/p>\n<\/td>\n<td class=\"Table\">\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<table class=\"Table\" style=\"margin-left: 23px\">\n<colgroup>\n<col \/>\n<col \/>\n<col \/>\n<col \/>\n<col \/>\n<col \/><\/colgroup>\n<tbody>\n<tr>\n<td class=\"Table\" colspan=\"6\">\n<p class=\"Table\">Analysis of Variance<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td class=\"Table\">\n<p class=\"Table\">Source<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">DF<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">SS<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">MS<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">F<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">P<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td class=\"Table\">\n<p class=\"Table\">Regression<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">1<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">12089<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">12089<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\"><strong class=\"Strong-2\">56.32<\/strong><\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\"><strong class=\"Strong-2\">0.000<\/strong><\/p>\n<\/td>\n<\/tr>\n<tr>\n<td class=\"Table\">\n<p class=\"Table\">Residual Error<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">48<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">10303<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">215<\/p>\n<\/td>\n<td class=\"Table\">\n<\/td>\n<td class=\"Table\">\n<\/td>\n<\/tr>\n<tr>\n<td class=\"Table\">\n<p class=\"Table\">Total<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">49<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">22392<\/p>\n<\/td>\n<td class=\"Table\">\n<\/td>\n<td class=\"Table\">\n<\/td>\n<td class=\"Table\">\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p class=\"Example\">The t test statistic is 7.50 with an associated p-value of 0.000. The p-value is less than the level of significance (5%) so we will reject the null hypothesis. The slope is significantly different from zero. The same result can be found from the F-test statistic of 56.32 (7.505<span class=\"Superscript SmallText\">2<\/span> = 56.32). The p-value is the same (0.000) as the conclusion.<\/p>\n<\/div>\n<h2 class=\"ExampleHeading\">Confidence Interval for <strong class=\"SymbolsBold\" xml:lang=\"ar-SA\">\u03bc<\/strong><span class=\"Subscript SmallText\">y<\/span><\/h2>\n<p>Now that we have created a regression model built on a significant relationship between the predictor variable and the response variable, we are ready to use the model for<\/p>\n<ul>\n<li class=\"List-Paragraph\">estimating the average value of <em>y<\/em> for a given value of <em>x<\/em><\/li>\n<li class=\"List-Paragraph\">predicting a particular value of <em>y<\/em> for a given value of <em>x<\/em><\/li>\n<\/ul>\n<p>Let\u2019s examine the first option. The sample data of <em>n<\/em> pairs that was drawn from a population was used to compute the regression coefficients <em>b<\/em><span class=\"Subscript SmallText\">0<\/span> and <em>b<\/em><span class=\"Subscript SmallText\">1<\/span> for our model, and gives us the average value of <em>y<\/em> for a specific value of <em>x<\/em> through our population model<\/p>\n<p><span class=\"Picture\"><img decoding=\"async\" alt=\"12315.png\" class=\"frame-56\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171634\/12315.png\" \/><\/span>. For every specific value of x, there is an average y (<span class=\"Symbols\" xml:lang=\"ar-SA\"><em>\u03bc<\/em><span class=\"Subscript SmallText\"><em>y<\/em><\/span><\/span>), which falls on the straight line equation (a line of means). Remember, that there can be many different observed values of the <em>y<\/em> for a particular <em>x<\/em>, and these values are assumed to have a normal distribution with a mean equal to <span class=\"Inline-Equation\"><img decoding=\"async\" alt=\"12336.png\" class=\"frame-21\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171635\/12336.png\" \/><\/span> and a variance of <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03c3<\/span><span class=\"Superscript SmallText\">2<\/span>. Since the computed values of <em>b<\/em><span class=\"Subscript SmallText\">0<\/span> and <em>b<\/em><span class=\"Subscript SmallText\">1<\/span> vary from sample to sample, each new sample may produce a slightly different regression equation. Each new model can be used to estimate a value of <em>y<\/em> for a value of <em>x<\/em>. How far will our estimator <span class=\"Inline-Equation\"><img decoding=\"async\" alt=\"12346.png\" class=\"frame-44\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171636\/12346.png\" \/><\/span> be from the true population mean for that value of <em>x<\/em>? This depends, as always, on the variability in our estimator, measured by the standard error.<\/p>\n<p>It can be shown that the estimated value of <em>y<\/em> when <em>x<\/em> = <em>x<\/em><span class=\"Subscript SmallText\">0<\/span> (some specified value of <em>x<\/em>), is an unbiased estimator of the population mean, and that <span class=\"Symbols\" xml:lang=\"ar-SA\"><em>p\u0302<\/em><\/span> is normally distributed with a standard error of<\/p>\n<p class=\"Centered\"><img decoding=\"async\" alt=\"12371.png\" class=\"frame-32 aligncenter\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171637\/12371.png\" \/><\/p>\n<p>We can construct a confidence interval to better estimate this parameter (<span class=\"Symbols\" xml:lang=\"ar-SA\">\u03bc<\/span><span class=\"Subscript SmallText\">y<\/span>) following the same procedure illustrated previously in this chapter.<\/p>\n<p><span class=\"Picture\"><img decoding=\"async\" alt=\"12387.png\" class=\"frame-56 aligncenter\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171638\/12387.png\" \/><\/span>where the critical value t<span class=\"Symbol-Subscript SmallText\" xml:lang=\"ar-SA\">\u03b1<\/span><span class=\"Subscript SmallText\">\/2<\/span> comes from the student t-table with (<em>n<\/em> \u2013 2) degrees of freedom.<\/p>\n<p>Statistical software, such as Minitab, will compute the confidence intervals for you. Using the data from the previous example, we will use Minitab to compute the 95% confidence interval for the mean response for an average forested area of 32 km.<\/p>\n<table class=\"Table\">\n<colgroup>\n<col \/>\n<col \/>\n<col \/>\n<col \/><\/colgroup>\n<tbody>\n<tr>\n<td class=\"Table-Heading\" colspan=\"4\">\n<p class=\"Table-Heading\">Predicted Values for New Observations<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td class=\"Table\">\n<p class=\"Table\">New\u00a0Obs Fit<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">SE Fit<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">95%<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">CI<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td class=\"Table\">\n<p class=\"Table\">1<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">49.9496<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">2.38400<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">(45.1562,54.7429)<\/p>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>If you sampled many areas that averaged 32 km. of forested area, your estimate of the average IBI would be from 45.1562 to 54.7429.<\/p>\n<p>You can repeat this process many times for several different values of <em>x<\/em> and plot the confidence intervals for the mean response.<\/p>\n<table class=\"Table\">\n<colgroup>\n<col \/>\n<col \/><\/colgroup>\n<tbody>\n<tr>\n<td class=\"Table-Heading\">\n<p class=\"Table-Heading\"><strong>x<\/strong><\/p>\n<\/td>\n<td class=\"Table-Heading\">\n<p class=\"Table-Heading\"><strong>95% CI<\/strong><\/p>\n<\/td>\n<\/tr>\n<tr>\n<td class=\"Table\">\n<p class=\"Table\">20<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">(37.13, 48.88)<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td class=\"Table\">\n<p class=\"Table\">40<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">(50.22, 58.86)<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td class=\"Table\">\n<p class=\"Table\">60<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">(61.43, 70.61)<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td class=\"Table\">\n<p class=\"Table\">80<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">(70.98, 84.02)<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td class=\"Table\">\n<p class=\"Table\">100<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">(79.88, 98.07)<\/p>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<div style=\"width: 911px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" alt=\"11060.png\" class=\"frame-13\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171640\/11060.png\" width=\"901\" height=\"480\" \/><\/p>\n<p class=\"wp-caption-text\">Figure 20. 95% confidence intervals for the mean response.<\/p>\n<\/div>\n<p class=\"Caption\"><span class=\"Picture\">Notice how the width of the 95% confidence interval varies for the different values of <em>x<\/em>. Since the confidence interval width is narrower for the central values of <em>x<\/em>, it follows that <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03bc<\/span><span class=\"Subscript SmallText\">y<\/span> is estimated more precisely for values of <em>x<\/em> in this area. As you move towards the extreme limits of the data, the width of the intervals increases, indicating that it would be unwise to extrapolate beyond the limits of the data used to create this model.<\/span><\/p>\n<h2>Prediction Intervals<\/h2>\n<p>What if you want to predict a <em>particular<\/em> value of <em>y<\/em> when <em>x<\/em> = <em>x<\/em><span class=\"Subscript SmallText\">0<\/span>? Or, perhaps you want to predict the next measurement for a given value of <em>x<\/em>? This problem differs from constructing a confidence interval for <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03bc<\/span><span class=\"Subscript SmallText\">y<\/span>. Instead of constructing a confidence interval to estimate a population parameter, we need to construct a prediction interval. Choosing to predict a particular value of <em>y<\/em> incurs some additional error in the prediction because of the deviation of <em>y<\/em> from the line of means. Examine the figure below. You can see that the error in prediction has two components:<\/p>\n<ol>\n<li class=\"List-Paragraph-Number-1\">The error in using the fitted line to estimate the line of means<\/li>\n<li class=\"List-Paragraph-Number-1\">The error caused by the deviation of y from the line of means, measured by <span class=\"Symbols\" xml:lang=\"ar-SA\">\u03c3<\/span><span class=\"Superscript SmallText\">2<\/span><\/li>\n<\/ol>\n<div style=\"width: 563px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" alt=\"136.tif\" class=\"frame-13\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171642\/136_fmt.png\" width=\"553\" height=\"268\" \/><\/p>\n<p class=\"wp-caption-text\">Figure 21. Illustrating the two components in the error of prediction.<\/p>\n<\/div>\n<p>The variance of the difference between y and <span class=\"Inline-Equation\"><img decoding=\"async\" alt=\"13215.png\" class=\"frame-5\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171643\/13215.png\" \/><\/span> is the sum of these two variances and forms the basis for the standard error of <span class=\"Inline-Equation\"><img decoding=\"async\" alt=\"12547.png\" class=\"frame-43\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171644\/12547.png\" \/><\/span> used for prediction. The resulting form of a prediction interval is as follows:<\/p>\n<p class=\"Centered\"><img decoding=\"async\" alt=\"12568.png\" class=\"frame-710 aligncenter\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171645\/12568.png\" \/><\/p>\n<p>where <em>x<\/em><span class=\"Subscript SmallText\">0<\/span> is the given value for the predictor variable, <em>n<\/em> is the number of observations, and t<span class=\"Symbol-Subscript SmallText\" xml:lang=\"ar-SA\">\u03b1<\/span><span class=\"Subscript SmallText\">\/2<\/span> is the critical value with (<em>n<\/em> \u2013 2) degrees of freedom.<\/p>\n<p>Software, such as Minitab, can compute the prediction intervals. Using the data from the previous example, we will use Minitab to compute the 95% prediction interval for the IBI of a specific forested area of 32 km.<\/p>\n<table class=\"Table\">\n<colgroup>\n<col \/>\n<col \/>\n<col \/>\n<col \/><\/colgroup>\n<tbody>\n<tr>\n<td class=\"Table-Heading\" colspan=\"4\">\n<p class=\"Table-Heading\">Predicted Values for New Observations<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td class=\"Table\">\n<p class=\"Table\">New Obs<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">Fit<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">SE Fit<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">95% PI<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td class=\"Table\">\n<p class=\"Table\">1<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">49.9496<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">2.38400<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">(20.1053, 79.7939)<\/p>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>You can repeat this process many times for several different values of <em>x<\/em> and plot the prediction intervals for the mean response.<\/p>\n<table class=\"Table\">\n<colgroup>\n<col \/>\n<col \/><\/colgroup>\n<tbody>\n<tr>\n<td class=\"Table-Heading\">\n<p class=\"Table-Heading\"><strong>x<\/strong><\/p>\n<\/td>\n<td class=\"Table-Heading\">\n<p class=\"Table-Heading\"><strong>95% PI<\/strong><\/p>\n<\/td>\n<\/tr>\n<tr>\n<td class=\"Table\">\n<p class=\"Table\">20<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">(13.01, 73.11)<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td class=\"Table\">\n<p class=\"Table\">40<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">(24.77, 84.31)<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td class=\"Table\">\n<p class=\"Table\">60<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">(36.21, 95.83)<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td class=\"Table\">\n<p class=\"Table\">80<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">(47.33, 107.67)<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td class=\"Table\">\n<p class=\"Table\">100<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">(58.15, 119.81)<\/p>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Notice that the prediction interval bands are wider than the corresponding confidence interval bands, reflecting the fact that we are predicting the value of a random variable rather than estimating a population parameter. We would expect predictions for an individual value to be more variable than estimates of an average value.<\/p>\n<div style=\"width: 911px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" alt=\"10592.png\" class=\"frame-13\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171647\/10592.png\" width=\"901\" height=\"480\" \/><\/p>\n<p class=\"wp-caption-text\">Figure 22. A comparison of confidence and prediction intervals.<\/p>\n<\/div>\n<h2>Transformations to Linearize Data Relationships<\/h2>\n<p>In many situations, the relationship between <em>x<\/em> and <em>y<\/em> is non-linear. In order to simplify the underlying model, we can transform or convert either <em>x<\/em> or <em>y<\/em> or both to result in a more linear relationship. There are many common transformations such as logarithmic and reciprocal. Including higher order terms on <em>x<\/em> may also help to linearize the relationship between <em>x<\/em> and <em>y<\/em>. Shown below are some common shapes of scatterplots and possible choices for transformations. However, the choice of transformation is frequently more a matter of trial and error than set rules.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" alt=\"Ch7DataRelationship4\" class=\"frame-13\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171650\/Ch7DataRelationship4.png\" width=\"1638\" height=\"661\" \/><img loading=\"lazy\" decoding=\"async\" alt=\"Ch7DataRelationship3\" class=\"frame-13\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171652\/Ch7DataRelationship3.png\" width=\"1638\" height=\"661\" \/><img loading=\"lazy\" decoding=\"async\" alt=\"Ch7DataRelationship2\" class=\"frame-13\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171655\/Ch7DataRelationship2.png\" width=\"1638\" height=\"661\" \/><\/p>\n<div style=\"width: 1648px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" alt=\"Ch7DataRelationship1\" class=\"frame-13\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171658\/Ch7DataRelationship1.png\" width=\"1638\" height=\"661\" \/><\/p>\n<p class=\"wp-caption-text\">Figure 23. Examples of possible transformations for x and y variables.<\/p>\n<\/div>\n<div class=\"textbox examples\">\n<h3>Example 4<\/h3>\n<p class=\"Example\">A forester needs to create a simple linear regression model to predict tree volume using diameter-at-breast height (dbh) for sugar maple trees. He collects dbh and volume for 236 sugar maple trees and plots volume versus dbh. Given below is the scatterplot, correlation coefficient, and regression output from Minitab.<\/p>\n<div style=\"width: 751px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" alt=\"10541.png\" class=\"frame-13\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171700\/10541.png\" width=\"741\" height=\"496\" \/><\/p>\n<p class=\"wp-caption-text\">Figure 24. Scatterplot of volume versus dbh.<\/p>\n<\/div>\n<p class=\"Example\">Pearson\u2019s linear correlation coefficient is 0.894, which indicates a strong, positive, linear relationship. However, the scatterplot shows a distinct nonlinear relationship.<\/p>\n<h4>Regression Analysis: volume versus dbh<\/h4>\n<table class=\"Table\" style=\"margin-left: 23px\">\n<colgroup>\n<col \/>\n<col \/>\n<col \/>\n<col \/>\n<col \/><\/colgroup>\n<tbody>\n<tr>\n<td class=\"Table\" colspan=\"5\">The regression equation is volume = &#8211; 51.1 + 7.15 dbh<\/td>\n<\/tr>\n<tr>\n<td class=\"Table\">\n<p class=\"Table\">Predictor<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">Coef<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">SE Coef<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">T<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">P<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td class=\"Table\">\n<p class=\"Table\">Constant<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">-51.097<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">3.271<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">-15.62<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">0.000<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td class=\"Table\">\n<p class=\"Table\">dbh<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">7.1500<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">0.2342<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">30.53<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">0.000<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td class=\"Table\">\n<p class=\"Table\">S = 19.5820<\/p>\n<\/td>\n<td class=\"Table\" colspan=\"2\">\n<p class=\"Table\">R-Sq = 79.9%<\/p>\n<\/td>\n<td class=\"Table\" colspan=\"2\">\n<p class=\"Table\">R-Sq(adj) = 79.8%<\/p>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<table class=\"Table\" style=\"margin-left: 23px\">\n<colgroup>\n<col \/>\n<col \/>\n<col \/>\n<col \/>\n<col \/>\n<col \/><\/colgroup>\n<tbody>\n<tr>\n<td class=\"Table\" colspan=\"6\">\n<p class=\"Table\">Analysis of Variance<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td class=\"Table\">\n<p class=\"Table\">Source<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">DF<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">SS<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">MS<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">F<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">P<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td class=\"Table\">\n<p class=\"Table\">Regression<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">1<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">357397<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">357397<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">932.04<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">0.000<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td class=\"Table\">\n<p class=\"Table\">Residual Error<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">234<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">89728<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">383<\/p>\n<\/td>\n<td class=\"Table\">\n<\/td>\n<td class=\"Table\">\n<\/td>\n<\/tr>\n<tr>\n<td class=\"Table\">\n<p class=\"Table\">Total<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">235<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">447125<\/p>\n<\/td>\n<td class=\"Table\">\n<\/td>\n<td class=\"Table\">\n<\/td>\n<td class=\"Table\">\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p class=\"Example\">The R<span class=\"Superscript SmallText\">2<\/span> is 79.9% indicating a fairly strong model and the slope is significantly different from zero. However, both the residual plot and the residual normal probability plot indicate serious problems with this model. A transformation may help to create a more linear relationship between volume and dbh.<\/p>\n<div style=\"width: 1006px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" alt=\"10531.png\" class=\"frame-13\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171703\/10531.png\" width=\"996\" height=\"363\" \/><\/p>\n<p class=\"wp-caption-text\">Figure 25. Residual and normal probability plots.<\/p>\n<\/div>\n<p class=\"Example\">Volume was transformed to the natural log of volume and plotted against dbh (see scatterplot below). Unfortunately, this did little to improve the linearity of this relationship. The forester then took the natural log transformation of dbh. The scatterplot of the natural log of volume versus the natural log of dbh indicated a more linear relationship between these two variables. The linear correlation coefficient is 0.954.<\/p>\n<div style=\"width: 931px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" alt=\"10521.png\" class=\"frame-13\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171705\/10521.png\" width=\"921\" height=\"310\" \/><\/p>\n<p class=\"wp-caption-text\">Figure 26. Scatterplots of natural log of volume versus dbh and natural log of volume versus natural log of dbh.<\/p>\n<\/div>\n<p class=\"Example\">The regression analysis output from Minitab is given below.<\/p>\n<h4>Regression Analysis: lnVOL vs. lnDBH<\/h4>\n<table id=\"Table9\" class=\"Table\" style=\"margin-left: 23px\">\n<colgroup>\n<col \/>\n<col \/>\n<col \/>\n<col \/>\n<col \/><\/colgroup>\n<tbody>\n<tr>\n<td class=\"Table\" colspan=\"5\">\n<p class=\"Table\">The regression equation is lnVOL = &#8211; 2.86 + 2.44 lnDBH<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td class=\"Table\">\n<p class=\"Table\">Predictor<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">Coef<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">SE Coef<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">T<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">P<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td class=\"Table\">\n<p class=\"Table\">Constant<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">-2.8571<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">0.1253<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">-22.80<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">0.000<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td class=\"Table\">\n<p class=\"Table\">lnDBH<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">2.44383<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">0.05007<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">48.80<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">0.000<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td class=\"Table\">\n<p class=\"Table\">S = 0.327327<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">R-Sq = 91.1%<\/p>\n<\/td>\n<td class=\"Table\" colspan=\"2\">\n<p class=\"Table\">R-Sq(adj) = 91.0%<\/p>\n<\/td>\n<td class=\"Table\">\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<table id=\"table-20\" class=\"Table\">\n<colgroup>\n<col \/>\n<col \/>\n<col \/>\n<col \/>\n<col \/>\n<col \/><\/colgroup>\n<tbody>\n<tr>\n<td class=\"Table\" colspan=\"6\">\n<p class=\"Table\">Analysis of Variance<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td class=\"Table\">\n<p class=\"Table\">Source<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">DF<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">SS<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">MS<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">F<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">P<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td class=\"Table\">\n<p class=\"Table\">Regression<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">1<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">255.19<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">255.19<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">2381.78<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">0.000<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td class=\"Table\">\n<p class=\"Table\">Residual Error<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">234<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">25.07<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">0.11<\/p>\n<\/td>\n<td class=\"Table\">\n<\/td>\n<td class=\"Table\">\n<\/td>\n<\/tr>\n<tr>\n<td class=\"Table\">\n<p class=\"Table\">Total<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">235<\/p>\n<\/td>\n<td class=\"Table\">\n<p class=\"Table\">280.26<\/p>\n<\/td>\n<td class=\"Table\">\n<\/td>\n<td class=\"Table\">\n<\/td>\n<td class=\"Table\">\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<div style=\"width: 1060px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" alt=\"10512.png\" class=\"frame-13\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171708\/10512.png\" width=\"1050\" height=\"367\" \/><\/p>\n<p class=\"wp-caption-text\">Figure 27. Residual and normal probability plots.<\/p>\n<\/div>\n<p class=\"Example\">The model using the transformed values of volume and dbh has a more linear relationship and a more positive correlation coefficient. The slope is significantly different from zero and the R<span class=\"Superscript SmallText\">2<\/span> has increased from 79.9% to 91.1%. The residual plot shows a more random pattern and the normal probability plot shows some improvement.<\/p>\n<p class=\"Example\">There are many possible transformation combinations possible to linearize data. Each situation is unique and the user may need to try several alternatives before selecting the best transformation for <em>x<\/em> or <em>y<\/em> or both.<\/p>\n<\/div>\n<h2 class=\"ExampleHeading\">Software Solutions<\/h2>\n<h3>Minitab<\/h3>\n<p class=\"Centered\"><img decoding=\"async\" alt=\"145_1.tif\" class=\"frame-52 aligncenter\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171711\/145_1_fmt.png\" \/><img decoding=\"async\" alt=\"145_2.tif\" class=\"frame-52 aligncenter\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171716\/145_2_fmt.png\" \/><\/p>\n<p>The Minitab output is shown above in Ex. 4.<\/p>\n<h3>Excel<\/h3>\n<p class=\"No-Caption\"><span class=\"Picture\"><img decoding=\"async\" alt=\"143_1.tif\" class=\"frame-106 aligncenter\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171719\/143_1_fmt.png\" \/><\/span><\/p>\n<p class=\"No-Caption\"><span class=\"Picture\"><img decoding=\"async\" alt=\"143_2.tif\" class=\"frame-106 aligncenter\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171722\/143_2_fmt.png\" \/><\/span><\/p>\n<p><span class=\"Picture\"><img decoding=\"async\" alt=\"143_3.tif\" class=\"frame-13 aligncenter\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171724\/143_3_fmt.png\" \/><\/span><\/p>\n<div style=\"width: 588px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" alt=\"144.tif\" class=\"frame-13\" src=\"https:\/\/s3-us-west-2.amazonaws.com\/courses-images\/wp-content\/uploads\/sites\/1888\/2017\/05\/11171726\/144_fmt.png\" width=\"578\" height=\"489\" \/><\/p>\n<p class=\"wp-caption-text\">Figure 28. Residual and normal probability plots.<\/p>\n<\/div>\n<\/div>\n\n\t\t\t <section class=\"citations-section\" role=\"contentinfo\">\n\t\t\t <h3>Candela Citations<\/h3>\n\t\t\t\t\t <div>\n\t\t\t\t\t\t <div id=\"citation-list-877\">\n\t\t\t\t\t\t\t <div class=\"licensing\"><div class=\"license-attribution-dropdown-subheading\">CC licensed content, Shared previously<\/div><ul class=\"citation-list\"><li>Natural Resources Biometrics. <strong>Authored by<\/strong>: Diane Kiernan. <strong>Located at<\/strong>: <a target=\"_blank\" href=\"https:\/\/textbooks.opensuny.org\/natural-resources-biometrics\/\">https:\/\/textbooks.opensuny.org\/natural-resources-biometrics\/<\/a>. <strong>Project<\/strong>: Open SUNY Textbooks. <strong>License<\/strong>: <em><a target=\"_blank\" rel=\"license\" href=\"https:\/\/creativecommons.org\/licenses\/by-nc-sa\/4.0\/\">CC BY-NC-SA: Attribution-NonCommercial-ShareAlike<\/a><\/em><\/li><\/ul><\/div>\n\t\t\t\t\t\t <\/div>\n\t\t\t\t\t <\/div>\n\t\t\t <\/section>","protected":false},"author":622,"menu_order":7,"template":"","meta":{"_candela_citation":"[{\"type\":\"cc\",\"description\":\"Natural Resources Biometrics\",\"author\":\"Diane Kiernan\",\"organization\":\"\",\"url\":\"https:\/\/textbooks.opensuny.org\/natural-resources-biometrics\/\",\"project\":\"Open SUNY Textbooks\",\"license\":\"cc-by-nc-sa\",\"license_terms\":\"\"}]","CANDELA_OUTCOMES_GUID":"","pb_show_title":"on","pb_short_title":"","pb_subtitle":"","pb_authors":[],"pb_section_license":""},"chapter-type":[],"contributor":[],"license":[],"class_list":["post-877","chapter","type-chapter","status-publish","hentry"],"part":21,"_links":{"self":[{"href":"https:\/\/courses.lumenlearning.com\/suny-natural-resources-biometrics\/wp-json\/pressbooks\/v2\/chapters\/877","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/courses.lumenlearning.com\/suny-natural-resources-biometrics\/wp-json\/pressbooks\/v2\/chapters"}],"about":[{"href":"https:\/\/courses.lumenlearning.com\/suny-natural-resources-biometrics\/wp-json\/wp\/v2\/types\/chapter"}],"author":[{"embeddable":true,"href":"https:\/\/courses.lumenlearning.com\/suny-natural-resources-biometrics\/wp-json\/wp\/v2\/users\/622"}],"version-history":[{"count":1,"href":"https:\/\/courses.lumenlearning.com\/suny-natural-resources-biometrics\/wp-json\/pressbooks\/v2\/chapters\/877\/revisions"}],"predecessor-version":[{"id":1254,"href":"https:\/\/courses.lumenlearning.com\/suny-natural-resources-biometrics\/wp-json\/pressbooks\/v2\/chapters\/877\/revisions\/1254"}],"part":[{"href":"https:\/\/courses.lumenlearning.com\/suny-natural-resources-biometrics\/wp-json\/pressbooks\/v2\/parts\/21"}],"metadata":[{"href":"https:\/\/courses.lumenlearning.com\/suny-natural-resources-biometrics\/wp-json\/pressbooks\/v2\/chapters\/877\/metadata\/"}],"wp:attachment":[{"href":"https:\/\/courses.lumenlearning.com\/suny-natural-resources-biometrics\/wp-json\/wp\/v2\/media?parent=877"}],"wp:term":[{"taxonomy":"chapter-type","embeddable":true,"href":"https:\/\/courses.lumenlearning.com\/suny-natural-resources-biometrics\/wp-json\/pressbooks\/v2\/chapter-type?post=877"},{"taxonomy":"contributor","embeddable":true,"href":"https:\/\/courses.lumenlearning.com\/suny-natural-resources-biometrics\/wp-json\/wp\/v2\/contributor?post=877"},{"taxonomy":"license","embeddable":true,"href":"https:\/\/courses.lumenlearning.com\/suny-natural-resources-biometrics\/wp-json\/wp\/v2\/license?post=877"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}