L1.02: Section 1

Section 1: Choosing a numerical indicator of how well a model fits a dataset

In an earlier topic we computed a numerical indicator of the noise of a measurement process, the standard deviation around the average measurement. If we had a similar indicator of the scatter of the data around a model, we could use it to guide our search for the best set of model parameters. Even better, we might be able to get a computer program to help in the search since computers are good at working with numbers. Our original method of judging a model, visually comparing the data and model graphs, is easy for people but beyond the current capabilities of computers.

While scatter indicators of this kind as often referred to as measuring “goodness of fit”, keep in mind that they are showing the amount of deviation, so smaller is better (these are actually badness-of-fit indicators). We will find the best model by adjusting the parameters to make the indicator as close to zero as possible.

We want to minimize both positive and negative deviations, since here we care about how close the data is to the model, but not whether it is higher or lower. In this situation, the most productive way to treat positive and negative deviations the same is to square each deviation (i.e., multiply it by itself, a process that produces positive results for both positive and negative numbers). We will then minimize some indicator based on these squared deviations.

In spreadsheets that arranged similarly to Models.xls (input data in column A, output data in column B, model prediction in column C, and data-model deviation in column D), it is natural to put these squared deviations in column E. This can be done by putting the formula “=D3^2” into cell E3, then spreading that formula down all the data rows.

Example 1: Add a squared-deviations row to the US-population spreadsheet made in an earlier topic

 

  A B C D E F G H I
1 x Y y Residual Squared Exponential model: y = a * (1+r)^x
2 Year-1780 Population Model values Deviations y = 2.8 * (1.03^x)
3   0 2.8 2.80 0.0000 0.000000 2.8 a: Initial value at x=0
4 10 3.9 3.76 0.1370 0.018778 0.03 r: Growth rate
5 20 5.3 5.06 0.2429 0.058995  
6 30 7.2 6.80 0.4037 0.162945
7 40 9.6 9.13 0.4663 0.217430  
8 50 12.9 12.27 0.6251 0.390704
9 60 17.1 16.50 0.6035 0.364226
10 70 23.2 22.17 1.0301 1.061103  
11 80 31.4 29.79 1.6055 2.577651  
12 90 39.8 40.04 -0.2413 0.058230  

Column E now contains the formulas

“=D3^2” in E3

“=D4^2” in E4

“=D5^2” in E5

et cetera, for each data row

What number will be used to represent the squared-deviation values?

If we are going to make decisions about parameter settings by looking at a numerical goodness-of-fit indicator, that indicator must be a single number so we can decide which settings are better by choosing those that give the lowest indicator value. What indicator should be used to summarize this column of squared deviations? One obvious possibility, which turns out to be the most generally useful, is the sum of the squared deviations. This can be computed by applying the spreadsheet SUM function to column E, as by putting the formula “=SUM(E3:E12)” into some empty cell such as H8.

An alternative indicator that is useful in a few situations is simply the largest squared deviation (which comes from the largest deviation, whether positive or negative). This can be computed with the spreadsheet MAX function, using the formula “=MAX(E3:E12)” in this case. Other indicators are sometimes constructed that take into account the economic cost in a particular situation of different deviation sizes and/or directions. But the sum of the squared deviations is by far the most commonly used indicator for fitting models to data, and we will use it in this course except where another indicator is specifically asked for.

Example 2: Improve the earlier population model by using the sum-of-squared-deviations indicator.

The exponential-model parameter settings a = 2.8 and r = 0.03 that were found earlier fit the data pretty well, but of course other nearby settings of the parameters, such as would change the graph so little we could not tell by looking whether one of these models is slightly better than the other. But the numerical sum-of-squared-deviations calculation in H8 provides a way to choose among nearby model settings to get the very best fit.

[a] Here is the worksheet result from the initial model, which was chosen by setting the first parameter of the model to 2.8 (the data value at x = 0), then adjusting the growth rate until the graphs were close.

  A B C D E F G H I
1 x y y Residual Squared Exponential model: y = a * (1+r)^x
2 Year-1780 Population Model values deviations y = 2.8 * (1.03^x)
3   0 2.8 2.80 0.0000 0.000000 2.8 a: Initial value at x=0
4 10 3.9 3.76 0.1370 0.018778 0.03 r: Growth rate
5 20 5.3 5.06 0.2429 0.058995    
6 30 7.2 6.80 0.4037 0.162945   
7 40 9.6 9.13 0.4663 0.217430 Goodness of fit for these settings
8 50 12.9 12.27 0.6251 0.390704 Sum of sq. 4.910062
9 60 17.1 16.50 0.6035 0.364226
10 70 23.2 22.17 1.0301 1.061103    
11 80 31.4 29.79 1.6055 2.577651
12 90 39.8 40.04 -0.2413 0.058230   

So for a model of this data, the sum of squared deviations equals about 4.91.

[b] If the initial value for the model is increased by 0.1 million (to 2.9 million), the sum of squared deviations drops from about 4.91 to 3.24, a noticeable improvement. So the model can be considered slightly better.

  A B C D E F G H I
1 x y y Residual Squared Exponential model: y = a * (1+r)^x
2 Year-1780 Population Model values deviations y = 2.9 * (1.03^x)
3   0 2.8 2.90 -0.10 0.0100 2.9 a: Initial value at x=0
4 10 3.9 3.90 0.00 0.0000 0.03 r: Growth rate
5 20 5.3 5.24 0.06 0.0039  
6 30 7.2 7.04 0.16 0.0259  
7 40 9.6 9.46 0.14 0.0196 Goodness of fit for these settings
8 50 12.9 12.71 0.19 0.0348 Sum of sq. 3.237818
9 60 17.1 17.09 0.01 0.0002
10 70 23.2 22.96 0.24 0.0568  
11 80 31.4 30.86 0.54 0.2931
12 90 39.8 41.47 -1.67 2.7934  

Further small adjustments to G3 might reduce the standard deviation further, but it would be very tedious to try all the settings needed to find the very best fit, especially since we would need to explore values for the growth rate in G4 as well. Additionally, since a change in one of the parameters will change the effect of the other parameter, several rounds of adjustment would be needed. This process would be even more difficult for models that have three or more parameters.

But tedious tasks are what computers do best. Most spreadsheets have a built-in routine that will systematically adjust parameters to a result you ask for (in this case, the minimum value for the goodness-of-fit indicator). We will use such a built-in routine, Solver, as a way to automate the model-fitting process.