L1.09: Section 8

Section 8: Using alternative goodness-of-fit indicators

The standard deviation (or the sum of squared deviations on which it is based) is an example of what is called a “cost” function, a computed value that an adjustment or decision process tries to make as small as possible. While standard deviation is by far the most commonly used cost function in modeling processes, there are other possible ways to measure a model’s quality that can be more appropriate in some circumstances. Usually these different cost functions are designed to reflect differences in economic costs of different errors (such as when an overestimate has more expensive consequences than an under­estimate), to limit the effect of big deviations on the fitting process (so that a single bad value will not change the model vary much), or to allow for cases where the importance of a deviation depends on the size of the deviation relative to the data value (which is often true for exponential models).

If you use an alternative goodness-of-fit indicator to find a “best-fit” model, make an explicit statement about what indicator was minimized. This is needed both so that people can decide how to use the model, and so that they would be able to reproduce your results from the same data.

Using the maximum deviation as an alternative goodness-of-fit indicator

Standard deviation is a good description of the size of the typical deviation between a model and its data. But in some situations you may be more concerned about the maximum difference, rather than about the size of the typical difference. This can happen when you want to be able to identify a deviation amount that people can confidently expect will not be exceeded (although in that case you will usually need to allow some extra margin based on the typical random noise).

If you want to minimize the greatest difference between the population model and any individual population value, you would use the indicator “=MAX(E3:E12)” (placed in a spare cell such as H9) rather than “=SUM(E3:E12)”. If Solver is asked to minimize a cell with this formula (e.g., H9), it will adjust G3 and G4 (and any other parameters) in ways that may make the standard deviation higher, but will make the maximum distance above or below the data as small as is possible for the chosen kind of model.

Example 11: Find the exponential model that minimizes the maximum deviation from the census data.

  A B C D E F G H I
1 x y data y model Residual Squared Exponential model: y = a * (1+r)^x
2 Year-1780 Population Prediction deviation deviations MODEL PARAMETERS
3   0   2.8 3.420962 -0.620962 0.385594 3.420962 a: Initial value at x=0
4 10   3.9 4.501465 -0.601465 0.361761 0.0278937 r: Growth rate
5 20   5.3 5.923243 -0.623243 0.388432  
6 30   7.2 7.794087 -0.594087 0.352939  
7 40   9.6 10.25583 -0.655832 0.430116 Goodness of fit for these settings
8 50 12.9 13.49511 -0.595114 0.354161 Sum of sq dev 3.592537
9 60 17.1 17.75752 -0.657517 0.432328 Max sq dev 0.432328
10 70 23.2 23.36619 -0.166189 0.027619  
11 80 31.4 30.74635 0.653651 0.427259
12 90 39.8 40.45752 -0.657517 0.432328  

Note that the best-fit parameters using this indicator are slightly different from those that arose from minimizing the sum of squared deviations: a is now 3.42 rather than 3.11, and r is 2.79% instead of 2.89%. This would not make much visible difference in the graph (since the slightly-higher starting value is mostly balanced by the slightly-lower growth rate), but the maximum squared deviation has been cut about in half, to 0.43 rather than 0.95. On the other hand, the sum of squared deviations is more than twice as large (3.59 rather than 1.72), resulting in a standard deviation that is 44% larger. Such trade-offs are typical with different cost functions—the model will be better for what you focus on, but not quite as good for other purposes.

Relative standard deviation

A relative deviation is the data-minus-model difference divided by the value of the model. This ratio can be expressed as a decimal fraction or a percentage – a few percent would be typical of most situations. Relative deviations can be useful when the model will not have zero values (in which case the division would not work) and there is a natural zero to the data scale (so that the ratio has corresponds to some real relationship).

If we wish to use relative deviations in a worksheet like Models.xls, we replace the formula “=B3-C3” in cell D3 (and similarly other column-D cells) by the formula “=(B3-C3)/C3”. Nothing else in the worksheet need be changed. Now Solver will minimize the sum of squared relative deviations, since that is now what is being computed in H8. The relative standard deviation can be computed from that sum in same way as before, with the formula “=SQRT(H8/(10-2))” in this case, which has 10 data points and 2 parameters.

Example 12: Fit an exponential model to the population data, minimizing relative standard deviation; state the best-fit formula and the relative standard deviation.

Solution approach:

  1. Make a copy of the Exponential Model worksheet in Models.xls.
  2. Set cell D3 to the formula “=(B3-C3)/C3” to make column D relative deviations
  3. Copy the data to the worksheet and find the best-fit exponential model using Solver.
  4. Set H12 (or some other cell) to “=SQRT(H8/(10-2))”, to compute standard deviation
  5. Put appropriate labels in cells G8 and G10 to explain the contents of H8 and H10.
  6. Report the resulting formula and relative standard deviation.

The worksheet for the above example should look similar to this:

  A B C D E F G H I
1 x y data y model Residual Squared Exponential model: y = a * (1+r)^x
2 Year-1780 Population Prediction deviation deviations MODEL PARAMETERS
3   0   2.8 2.907038 -0.03682 0.001356 2.907038 a: Initial value at x=0
4 10   3.9 3.906720 -0.00172 2.96E-06 0.029997 r: Growth rate
5 20   5.3 5.250177 0.00949 9.01E-05
6 30   7.2 7.055627 0.02046 0.000419
7 40   9.6 9.481941 0.01245 0.000155 Goodness of fit for these settings
8 50 12.9 12.74263 0.01235 0.000153 Sum of sq rel dev 0.004275
9 60 17.1 17.12461 -0.00144 2.06E-06  
10 70 23.2 23.01348 0.00811 6.57E-05 Relative std dev 0.023117
11 80 31.4 30.92743 0.01528 0.000233
12 90 39.8 41.56287 -0.04242 0.001799  

The reason that these model parameters are slightly different from those that Solver found when minimizing the absolute standard deviation is that using relative standard deviation means that deviations from the smaller values count more in the fitting process than they did before. This is because a deviation of 0.1 million is about 3.5% of the 2.8 million for 1780, while that same 0.1 deviation is only 0.25% of the 39.8 million population 90 years later.

Exponential models are good possible uses for relative standard deviation since such models are based on constant-percentage relative growth or decay, and never have a zero value that would make it impossible to compute the ratio. However, absolute standard deviation is still the most usual fitting method even for this kind of model, and should be used unless you have a specific reason to use some other cost function.

Procedure for using Solver with a template from the Models.xls spreadsheet:

  1. Set up an appropriate worksheet:
    1. Make a copy of the template for the desired model, and put the data into columns A and B.
    2. Spread the formulas in C3, D3, and E3 down to the same number of rows as the data.
    3. Make a graph with a scatter plot that shows both the data and the model.
    4. Set the model parameters to a rough estimate, so that the two graphs are nearby.
  2. Select Solver from the Tools menu and fill out the dialog box it displays:
    1. Enter the name of the cell containing the standard deviation in the “Set Target Cell” section at the top of the Solver dialog box (this the bottom computed cell in column G).
    2. Choose the “Min” (for Minimum) option in the second line, since we want Solver to find the smallest possible value for the standard deviation.
    3. Designate the parameter cells (“G3,G4” for models with two parameters, “G3:G5” for models with three parameters, etc.) in the “By Changing Cells” section on the fourth line.
    4. Press the “Solve” button at top right.
  3. Check the graph to see if the model now is close to the data, with some on each side. If it is not, you have made a mistake (probably in step [ii]) and must repeat the process correctly. If the model goes through the data but is the wrong shape, you are probably using an inappropriate model.
  4. If the graph is acceptable, press “OK” to accept the solution that is found. The parameters (and thus the “Model” values in the column C) are now set to the best-fit values to a very high accuracy.