L1.06: Section 5

Section 5: Identifying and Removing Outliers in Data

Some datasets include a few points that do not fit into the overall pattern. These points are called “outliers”, and require special handling if a model is to be fit to the dataset. In general, the approach to modeling data with outliers is to exclude them from the fitting process but report them separately along with the model formula.

Sometimes outliers are simply mistakes in the data. Measurement is not a perfect process, and errors in a measurement device or in writing down the measurement results can lead to bad values in a dataset. One benefit of looking at the overall pattern of the data is that it usually will reveal any substantial errors of this kind.

On the other hand, some outliers are accurate measurements but report anomalies, situations that are different from the typical situation in which the other measurements were made. Measurements of Sunday pedestrian traffic in downtown Austin, for example, would show an outlier each spring due to the Capital 10K race.

It is important to report outliers, so that people depending on measurements similar to those that produced the data are alerted to the possibility of large deviations from the general trend of the data. Even if the outlier is a mistake, it is an indication that users should watch out for similar mistakes. If the outlier is an anomaly, it is possible that it is the most important part of the data. For example, we would want the people who design bridges to design for the maximum load and not the typical load.

When using Models.xls to find best-fit models, we can exclude outliers from the fitting process by erasing the content of the column E cell for each row that contains outlier data. This means that the deviation associated with the outlier value is not counted in computing the standard deviation, and thus does not influence the fitting process.

Example 7: Using the dataset shown to the right:

  1. Fit and report on a linear model using all of the data
  2. Fit and report on a linear model with the outlier excluded.

Solution approach:

  1. Copy the data to a Linear Model worksheet in Models.xls.
  2. Spread columns C, D, and E down to match the data, as usual.
  3. Make a graph of the data and model together, as usual.
  4. Use Solver to find the best-fit parameters and standard deviation.
  5. Erase cell E6, since the graph shows the data in row 6 is an outlier.
  6. Use Solver again, to find results with the outlier excluded.
Remaining fuel in engine tank
minutes liters
5 174.4
10 160.1
15 149.7
20 35.3
25 123.8
30 105.4
35 96.7
40 87.0
45 72.1
50 57.6

graph1

Notice that removal of the outlier makes a great difference in the size of the standard deviation, since the way standard deviations are computed emphasizes any large deviations.

Is this particular outlier a mistake or an anomaly? You can’t tell from the numbers, since any type of outlier consists of big deviations. But in this case the caption for the data indicates a process that logically must change smoothly. Thus this outlier is a mistake. Examination of the data suggests that the “35.3” y value for the outlier should have been about 100 higher, so perhaps an actual measurement of “135.3” was copied incorrectly.

Example 8: For the U.S. airline-passenger data provided to the right below:

  1. Report the best-fit linear model formula and its standard deviation, using all the data points provided.
  2. Examine the graph and list of deviations in column D to identify any data points which are outliers from the data trend from 1990 to 2000.
  3. Report the best-fit line and σ when the outliers are excluded.
  4. In this case, are the outlier points mistakes or anomalies?
  5. How many 2006 passengers would there have been (to the nearest million) if US air traffic had continued its 1990-2000 trend?
  6. Has the airline industry fully recovered its 1990-2000 trend? Why?
US Airline Traffic

Years

since 1990

Passengers
(millions)
0 465.6
1 452.2
2 473.3
3 487.2
4 528.4
5 547.4
6 581.2
7 598.9
8 612.9
9 635.4
10 665.5
11 622.1
12 612.9
13 646.5
14 702.9
15 738.6
16 744.6

Worksheet showing the result of fitting the data:

  A B C D E F            G                         H
1 X y data y model Residual Squared Linear Model: y = m * x + b
2 Year-1990 Passengers prediction deviation deviation y = 22.06643 x + 439.4858
3 0 465.6 439.49 26.11 681.9503 439.4858 b: Intercept
4 1 452.2 461.55 -9.35 87.4647 22.06643 m: Slope
5 2 473.3 483.62 -10.32 106.4753
6 3 487.2 505.69 -18.49 341.6996  
7 4 528.4 527.75 0.65 0.4205 Goodness of fit for these settings
8 5 547.4 549.82 -2.42 5.8466 sum of sq. dev. 1380.672
9 6 581.2 571.88 9.32 86.7801
10 7 598.9 593.95 4.95 24.4941 # of dev. used 11
11 8 612.9 616.02 -3.12 9.7174 # of parameters 2
12 9 635.4 638.08 -2.68 7.2023 Standard deviation 12.3858
13 10 665.5 660.15 5.35 28.6210
14 11 622.1 682.22 -60.12  
15 12 612.9 704.28 -91.38  
16 13 646.5 726.35 -79.85  
17 14 702.9 748.42 -45.52
18 15 738.6 770.48 -31.88
19 16 744.6 792.55 -47.95
20

graph2