Section 1: Choosing a numerical indicator of how well a model fits a dataset
In an earlier topic we computed a numerical indicator of the noise of a measurement process, the standard deviation around the average measurement. If we had a similar indicator of the scatter of the data around a model, we could use it to guide our search for the best set of model parameters. Even better, we might be able to get a computer program to help in the search since computers are good at working with numbers. Our original method of judging a model, visually comparing the data and model graphs, is easy for people but beyond the current capabilities of computers.
While scatter indicators of this kind as often referred to as measuring “goodness of fit”, keep in mind that they are showing the amount of deviation, so smaller is better (these are actually badness-of-fit indicators). We will find the best model by adjusting the parameters to make the indicator as close to zero as possible.
We want to minimize both positive and negative deviations, since here we care about how close the data is to the model, but not whether it is higher or lower. In this situation, the most productive way to treat positive and negative deviations the same is to square each deviation (i.e., multiply it by itself, a process that produces positive results for both positive and negative numbers). We will then minimize some indicator based on these squared deviations.
In spreadsheets that arranged similarly to Models.xls (input data in column A, output data in column B, model prediction in column C, and data-model deviation in column D), it is natural to put these squared deviations in column E. This can be done by putting the formula “=D3^2” into cell E3, then spreading that formula down all the data rows.
Example 1: Add a squared-deviations row to the US-population spreadsheet made in an earlier topic
A | B | C | D | E | F | G | H | I | |
1 | x | Y | y | Residual | Squared | Exponential model: y = a * (1+r)^x | |||
2 | Year-1780 | Population | Model | values | Deviations | y = 2.8 * (1.03^x) | |||
3 | 0 | 2.8 | 2.80 | 0.0000 | 0.000000 | 2.8 | a: Initial value at x=0 | ||
4 | 10 | 3.9 | 3.76 | 0.1370 | 0.018778 | 0.03 | r: Growth rate | ||
5 | 20 | 5.3 | 5.06 | 0.2429 | 0.058995 | ||||
6 | 30 | 7.2 | 6.80 | 0.4037 | 0.162945 | ||||
7 | 40 | 9.6 | 9.13 | 0.4663 | 0.217430 | ||||
8 | 50 | 12.9 | 12.27 | 0.6251 | 0.390704 | ||||
9 | 60 | 17.1 | 16.50 | 0.6035 | 0.364226 | ||||
10 | 70 | 23.2 | 22.17 | 1.0301 | 1.061103 | ||||
11 | 80 | 31.4 | 29.79 | 1.6055 | 2.577651 | ||||
12 | 90 | 39.8 | 40.04 | -0.2413 | 0.058230 |
Column E now contains the formulas
“=D3^2” in E3
“=D4^2” in E4
“=D5^2” in E5
et cetera, for each data row
What number will be used to represent the squared-deviation values?
If we are going to make decisions about parameter settings by looking at a numerical goodness-of-fit indicator, that indicator must be a single number so we can decide which settings are better by choosing those that give the lowest indicator value. What indicator should be used to summarize this column of squared deviations? One obvious possibility, which turns out to be the most generally useful, is the sum of the squared deviations. This can be computed by applying the spreadsheet SUM function to column E, as by putting the formula “=SUM(E3:E12)” into some empty cell such as H8.
An alternative indicator that is useful in a few situations is simply the largest squared deviation (which comes from the largest deviation, whether positive or negative). This can be computed with the spreadsheet MAX function, using the formula “=MAX(E3:E12)” in this case. Other indicators are sometimes constructed that take into account the economic cost in a particular situation of different deviation sizes and/or directions. But the sum of the squared deviations is by far the most commonly used indicator for fitting models to data, and we will use it in this course except where another indicator is specifically asked for.
Example 2: Improve the earlier population model by using the sum-of-squared-deviations indicator.
The exponential-model parameter settings a = 2.8 and r = 0.03 that were found earlier fit the data pretty well, but of course other nearby settings of the parameters, such as would change the graph so little we could not tell by looking whether one of these models is slightly better than the other. But the numerical sum-of-squared-deviations calculation in H8 provides a way to choose among nearby model settings to get the very best fit.
[a] Here is the worksheet result from the initial model, which was chosen by setting the first parameter of the model to 2.8 (the data value at x = 0), then adjusting the growth rate until the graphs were close.
A | B | C | D | E | F | G | H | I | |
1 | x | y | y | Residual | Squared | Exponential model: y = a * (1+r)^x | |||
2 | Year-1780 | Population | Model | values | deviations | y = 2.8 * (1.03^x) | |||
3 | 0 | 2.8 | 2.80 | 0.0000 | 0.000000 | 2.8 | a: Initial value at x=0 | ||
4 | 10 | 3.9 | 3.76 | 0.1370 | 0.018778 | 0.03 | r: Growth rate | ||
5 | 20 | 5.3 | 5.06 | 0.2429 | 0.058995 | ||||
6 | 30 | 7.2 | 6.80 | 0.4037 | 0.162945 | ||||
7 | 40 | 9.6 | 9.13 | 0.4663 | 0.217430 | Goodness of fit for these settings | |||
8 | 50 | 12.9 | 12.27 | 0.6251 | 0.390704 | Sum of sq. | 4.910062 | ||
9 | 60 | 17.1 | 16.50 | 0.6035 | 0.364226 | ||||
10 | 70 | 23.2 | 22.17 | 1.0301 | 1.061103 | ||||
11 | 80 | 31.4 | 29.79 | 1.6055 | 2.577651 | ||||
12 | 90 | 39.8 | 40.04 | -0.2413 | 0.058230 |
So for a model of this data, the sum of squared deviations equals about 4.91.
[b] If the initial value for the model is increased by 0.1 million (to 2.9 million), the sum of squared deviations drops from about 4.91 to 3.24, a noticeable improvement. So the model can be considered slightly better.
A | B | C | D | E | F | G | H | I | |
1 | x | y | y | Residual | Squared | Exponential model: y = a * (1+r)^x | |||
2 | Year-1780 | Population | Model | values | deviations | y = 2.9 * (1.03^x) | |||
3 | 0 | 2.8 | 2.90 | -0.10 | 0.0100 | 2.9 | a: Initial value at x=0 | ||
4 | 10 | 3.9 | 3.90 | 0.00 | 0.0000 | 0.03 | r: Growth rate | ||
5 | 20 | 5.3 | 5.24 | 0.06 | 0.0039 | ||||
6 | 30 | 7.2 | 7.04 | 0.16 | 0.0259 | ||||
7 | 40 | 9.6 | 9.46 | 0.14 | 0.0196 | Goodness of fit for these settings | |||
8 | 50 | 12.9 | 12.71 | 0.19 | 0.0348 | Sum of sq. | 3.237818 | ||
9 | 60 | 17.1 | 17.09 | 0.01 | 0.0002 | ||||
10 | 70 | 23.2 | 22.96 | 0.24 | 0.0568 | ||||
11 | 80 | 31.4 | 30.86 | 0.54 | 0.2931 | ||||
12 | 90 | 39.8 | 41.47 | -1.67 | 2.7934 |
Further small adjustments to G3 might reduce the standard deviation further, but it would be very tedious to try all the settings needed to find the very best fit, especially since we would need to explore values for the growth rate in G4 as well. Additionally, since a change in one of the parameters will change the effect of the other parameter, several rounds of adjustment would be needed. This process would be even more difficult for models that have three or more parameters.
But tedious tasks are what computers do best. Most spreadsheets have a built-in routine that will systematically adjust parameters to a result you ask for (in this case, the minimum value for the goodness-of-fit indicator). We will use such a built-in routine, Solver, as a way to automate the model-fitting process.