Section 5: Identifying and Removing Outliers in Data
Some datasets include a few points that do not fit into the overall pattern. These points are called “outliers”, and require special handling if a model is to be fit to the dataset. In general, the approach to modeling data with outliers is to exclude them from the fitting process but report them separately along with the model formula.
Sometimes outliers are simply mistakes in the data. Measurement is not a perfect process, and errors in a measurement device or in writing down the measurement results can lead to bad values in a dataset. One benefit of looking at the overall pattern of the data is that it usually will reveal any substantial errors of this kind.
On the other hand, some outliers are accurate measurements but report anomalies, situations that are different from the typical situation in which the other measurements were made. Measurements of Sunday pedestrian traffic in downtown Austin, for example, would show an outlier each spring due to the Capital 10K race.
It is important to report outliers, so that people depending on measurements similar to those that produced the data are alerted to the possibility of large deviations from the general trend of the data. Even if the outlier is a mistake, it is an indication that users should watch out for similar mistakes. If the outlier is an anomaly, it is possible that it is the most important part of the data. For example, we would want the people who design bridges to design for the maximum load and not the typical load.
When using Models.xls to find best-fit models, we can exclude outliers from the fitting process by erasing the content of the column E cell for each row that contains outlier data. This means that the deviation associated with the outlier value is not counted in computing the standard deviation, and thus does not influence the fitting process.
Example 7: Using the dataset shown to the right:
- Fit and report on a linear model using all of the data
- Fit and report on a linear model with the outlier excluded.
- Copy the data to a Linear Model worksheet in Models.xls.
- Spread columns C, D, and E down to match the data, as usual.
- Make a graph of the data and model together, as usual.
- Use Solver to find the best-fit parameters and standard deviation.
- Erase cell E6, since the graph shows the data in row 6 is an outlier.
- Use Solver again, to find results with the outlier excluded.
|Remaining fuel in engine tank|
- With all the data, the best-fit linear model is [latex]y=-2.20x+166.8[/latex], σ = 33.16
- Without the outlier at (x = 20, y = 35.3), the best-fit linear model is [latex]y=-2.56x+186.8[/latex], σ = 1.91
Notice that removal of the outlier makes a great difference in the size of the standard deviation, since the way standard deviations are computed emphasizes any large deviations.
Is this particular outlier a mistake or an anomaly? You can’t tell from the numbers, since any type of outlier consists of big deviations. But in this case the caption for the data indicates a process that logically must change smoothly. Thus this outlier is a mistake. Examination of the data suggests that the “35.3” y value for the outlier should have been about 100 higher, so perhaps an actual measurement of “135.3” was copied incorrectly.
Example 8: For the U.S. airline-passenger data provided to the right below:
- Report the best-fit linear model formula and its standard deviation, using all the data points provided.
- Examine the graph and list of deviations in column D to identify any data points which are outliers from the data trend from 1990 to 2000.
- Report the best-fit line and σ when the outliers are excluded.
- In this case, are the outlier points mistakes or anomalies?
- How many 2006 passengers would there have been (to the nearest million) if US air traffic had continued its 1990-2000 trend?
- Has the airline industry fully recovered its 1990-2000 trend? Why?
|US Airline Traffic|
- The all-data model is [latex]y=17.6x+454.3[/latex], σ = 25.1 million passengers.
- The points for the years 2001, 2002, and 2003 are clearly outliers. The points for 2004, 2005, and 2006 are also low but are closer to the 1990-2000 trend; borderline cases of this kind are a matter of judgment, and either analysis is correct as long as you describe what decisions you have made.
- The best-fit linear model for the 1990-2000 data is[latex]y=22.1x+439.5[/latex], with σ = 12.4 million passengers.
- These outliers are anomalies, which reflect a real change in conditions.
- The trend for 2006 was 793 million (cell C19 of the 1990-2000 model).
- Air business has not fully recovered, because actual 2006 air traffic of 745 million was 48 million passengers below the pre-2001 trend, and all actual values since 2001 have been lower than the pre-2001 trend.
Worksheet showing the result of fitting the data:
|1||X||y data||y model||Residual||Squared||Linear Model: y = m * x + b|
|2||Year-1990||Passengers||prediction||deviation||deviation||y = 22.06643 x + 439.4858|
|7||4||528.4||527.75||0.65||0.4205||Goodness of fit for these settings|
|8||5||547.4||549.82||-2.42||5.8466||sum of sq. dev.||1380.672|
|10||7||598.9||593.95||4.95||24.4941||# of dev. used||11|
|11||8||612.9||616.02||-3.12||9.7174||# of parameters||2|