I1.06: Section 4

Section 4: Adjusting the input variables to simplify the intercept parameter of linear models

For the sediment data, both parameters for the model have natural meanings: the slope is the daily rate of sediment build-up, and the intercept is the sediment level immediately after cleaning. This is because the zero point of the input parameter, days since cleaning, has a meaning that is naturally related to the situation—zero corresponds to the date the cleaning took place.

However, sometimes the zero point for an input parameter is artificial, and has no natural relationship to the situation. When this is true, the intercept of a linear model for that data will not have a useful meaning. If the input parameter for a dataset of a company’s annual sales is the calendar year, for example, then the intercept of a linear model fit to that data will be the model’s “prediction” for the year 0, over 2000 years ago. This will probably be a very high or very low number that would be almost impossible to find just by making guesses to adjust the intercept parameter of the model. Even if found, the resulting model formula would be difficult to use because of the very large value it contains.

There are a couple of ways to avoid this problem. The one shown below, changing the input variable from “Year” to “Years since 1990”, is the same technique you used in an earlier topic on graphing. When the input is redefined in this way, it becomes as simple to find the model as it was for the sediment data. In a later topic we will show how to adjust the model itself to give the same effect.

Example 4: Find a good linear model for this data, redefining the input parameter as needed, and use the model to predict sales in 2010.

Solution:

  1. Redefine the input parameter to “Years since 1990”, since that will cause the beginning of the data to have an input parameter of zero. To do this, copy the data from the table to a scratch-pad worksheet, make a column containing the Year data with 1990 subtracted, and copy the modified table to a copy of the linear model template.
  2. Since the first row now has a zero input variable, use the Sales figure for that row, 453, as the initial setting for the model intercept parameter in cell G3.
  3. Adjust the slope parameter in cell G4 until the model points are parallel to the trend of the data. A slope value of 26 is about right, but other nearby values would also be okay if the line looks correct.
  4. Readjust the intercept value to ensure that the model goes through the middle of the data. Increasing it to 460 improves the fit (but again, a nearby value is okay if the graph shows a good fit).
  5. The implied linear model is y = 26 x + 460.
  6. Thus the prediction of the model for sales in 2010 (20 years after 1990) is 980.
Year

House

Sales

1990 426
1991 517
1992 500
1993 558
1994 611
1995 558
1996 601
1997 596
1998 683
1999 693
2000 708
2001 761
2002 771
2003 831
2004 897
2005 822
2006 889

To see what trouble was avoided by redefining the input variable, use this model to “predict” sales in the year 0, which of course is 1990 years before 1990. Since 26∙(−1990) + 460 = −51,280, the corresponding sales model for the unmodified input values would be y = 26 x − 51,280, whose intercept value would be very difficult to guess.

An alternate way of dealing with this same problem is to modify the model formula itself, substituting (x − 1900) for x. If this were done, then the same model would be y = 26∙(x −1990) + 460. This approach has the advantage of not requiring a change in the data, which means that the horizontal scale on the graph would show the year number. On the other hand, the model would have a more complicated formula. Both techniques are used. We will examine this and other ways of modifying the model formula in a later topic.