I1.09: Section 6

Section 6: Using models fitted with Models.xls to predict values

The main purpose of using data to make a model formula is that the formula can then be used to compute predictions of what the output y would be for any input x. This can be used to predict the future, to make inferences about the past (prior to the first data point), to find intermediate values between data points, and even to make a better estimate of what value you would get for the same measurement if you repeated it at one of the input values you already used.

Extrapolation

The y = 1.8 x + 11 equation that fits the sediment-depth data well, for example, can be used to predict what sediment depth can be expected after day 80, the last day for which actual data was given. Using data to predict what measurements for a process will be outside the range of data input values is called extrapolation. (“extra” comes from a Latin word meaning “outside of”).

Example 7: What sediment depth does the y = 1.8 x + 11 model predict at 90 days after cleaning?

Answer: Evaluate the model formula at: [latex]\begin{align}&y=1.8x+11\\&y=1.8\cdot(90)+11\\&y=162+11\\&y=173\text{millimeters}\\\end{align}[/latex]

Warning: Extrapolation can be very useful, but is not always dependable since its accuracy depends on whether the process continues to change in the same way as it did during the time that the data for the model was taken. In general, people use extrapolation only for values that are within a limited distance from the last data point. It would be reasonable to extrapolate the sediment depth to 100 days, or perhaps even to 150 days, but not to 1000 days unless you have other information that indicates that the rate of increase is constant for that period.

Interpolation

The most dependable use of a model formula is to estimate what the output measurement would have been for some input value that is between two of the input values for the data. This process is called interpolation. Interpolation is dependably accurate if the model is a good model.

Example 8a: What sediment depth does the y = 1.8 x + 11 model predict at 37 days after cleaning?

Answer: Evaluate the model formula at: [latex]x=37[/latex]

[latex]\begin{align}&y=1.8x+11\\&y=1.8\cdot(37)+11\\&y=66.6+11\\&y=77.6\text{millimeters}\\\end{align}[/latex]

Note that the computational process for interpolation is exactly the same as for extrapolation. This is characteristic of the use of a model. You treat any input value the same way – just plug it into the formula and evaluate the result. You may decide you don’t trust the answer (e.g., the 1000-day sediment-depth extrapolation), but the model gives answers the same way in all cases.

There is no need to limit interpolation or extrapolation to whole-number inputs when fractional inputs make sense. You could estimate the sediment depth at 22.5 days, or at 98.765 days. But use some judgment here – you would not want to estimate midnight traffic flow based on a history of measurements made at noon.

Example 8b: What sediment depth does the y = 1.8 x + 11 model predict at 56.73 days after cleaning?

Answer: Evaluate the model formula at: [latex]x=56.73[/latex]

[latex]\begin{align}&y=1.8x+11\\&y=1.8\cdot(56.73)+12.7\\&y=102.114+11\\&y=113.114\\&y\cong113.1\text{millimeters}\\\end{align}[/latex]

Note that the final value is rounded to a precision consistent with the precision of the data.

Backwards extrapolation

You could even compute the model’s answer for an input value that comes before any of your data. That does not make sense for the sediment data (since we are told that the tank was changed abruptly by cleaning just before this data was taken), but in other situations it is often possible to make good estimates of what conditions were before data was taken.

Example 9: An accumulated coating of rust on the siding of a building is measured on June 1 for 15 successive years, and the these thickness measurements are found to fit a linear model, where y is the thickness in millimeters and x is the number of years since the first of these measurements in 1987. Estimate what the thickness of the coating was on June 1, 1980.

Answer: Evaluate the model formula at x = –7, since that corresponds to the year 1980 in the formula.

[latex]\begin{align}&y=0.085x+1.52\\&y=0.08\cdot(-7)+1.52\\&y=-0.56+1.52\\&y=0.96\text{millimeters}\\\end{align}[/latex]

As with forward extrapolation, you have to use judgment about how far away from the data you can depend on backward extrapolation. Example 10’s 7-years-before extrapolation is reasonable, but using this model to estimate the rust thickness in 1950 would not be reasonable, since that would give a negative thickness (which implies that the building may have been built after 1950).

What if the same data points are measured again?

For the sediment-depth data, the prediction of the model for sediment depth at 40 days after cleaning is 83 mm, which is 4.4 mm less than the actual data value of 88.6 mm. Which of these values would be best to use if we wanted to predict the depth at 40 days after some subsequent cleaning?

Deviations between data values and model predictions can come from two different kinds of sources:

  • Noise: The deviations may just be random variations in the process, in which case we will do better to use the model, since next time the deviation is just as likely to be in the other direction. In a sense, the model is more accurate than the data in this case. The noise-suppressing smoothing effect that a model provides is an important benefit of the modeling approach.
  • Oversimplified models: The relationship between the input and output variables for the process may not quite be a straight line, in which case even the best linear model will have errors that overestimate the data in some input ranges and underestimate it in others. If the linear model is oversimplified in this way, we will do better to use previous measurements. Better yet, we should use a non-linear model that has sufficient flexibility to follow the data more closely.

We can distinguish between these two cases by examination of the graph. If noise is the source of the deviations, the graph will show random placement of data points above and below the model. The deviations in the sediment-depth data were random, as shown by the graph from the fitting.

But when a straight-line model is used for data whose underlying relationship is curved, the model will pass above most data points in some part of the range, and below most points in other parts. In making predictions for such poor-model situations you should either use the previous measurements directly or, preferably, fit a more suitable model to the data and then use that model.