What to Know About Five Number Summary in Box Plots and Datasets: 4D – 26

Boxplots are helpful for visualizing the distribution of a quantitative variable. A boxplot clearly shows the median of the data and provides a summary at a glance of the bulk of the data and the presence of outliers. In the next activity, you will need to be able to identity and interpret the features of a boxplot, identify outliers in a dataset, and relate a boxplot of a quantitative variable to its distribution. In this section, you’ll learn to identify the key pieces of information needed to accomplish these tasks.

Features of a Boxplot

In order to interpret boxplots, you will need to identify the minimum, maximum, and median of a quantitative variable. You’ve done this in previous activities. If you need a refresher, take a look at the video below. A boxplot captures only the median of the dataset, not the mean, as a measure of center.

recall

Core skill:

Five-Number Summary

You will also need to know the following definitions:

  • the first quartile of a quantitative variable (sometimes denoted Q1) is the value below which one quarter of the data lies, and the first quartile is also equal to the 25th percentile;
  • the third quartile of a quantitative variable (sometimes denoted Q3) is the value below which three quarters of the data lay, and the third quartile is also equal to the 75th percentile; and
  • the interquartile range (sometimes denoted IQR) of a quantitative variable is the quantity Q3–Q1.

The collection of the minimum, first quartile, median, third quartile, and maximum form the five-number summary of the variable.

first and third quartiles

[Perspective video — a 3 instructor video showing how to understand Q1 and Q3 as percentiles and/or quarters of data. See below for the idea:]

  • the location of the Q1/25th percentile and Q3/75th percentile on a number line along with other percentile locations such as 10th and 98th along with three ways to think about it:
    • 1)  “if a student scores in the 10th percentile of a test like the SAT, they have scored higher than only 10% of all the test takers but if they score in the 98th percentile, then their score is higher than 98% of all the test takers.” and
    • 2) “percentiles divide data into two parts — the lower part (she scored higher than 98% of the test takers) and the higher part (2% of the test takers scored higher than she did)” and 3) “the 25th percentile (first quartile) splits the data into the lower 25% and the 75% above that; the 50th percentile (2nd quartile) splits the data in half (marked by the median); the 75th percentile (3rd quartile)splits the data into the lower 75% and the 25% of the data above that.”
    • 3) Subtracting the value of Q1 from the value of Q3 gives the IQR (the distance between the 25th percentile and the 75th percentile)
  • (critics may point out that students will have seen all of this before, which is true but doesn’t acknowledge that students also need a brief refresher at this point.)

Identifying the Features of a Boxplot

As we explore the features of boxplots, we will work with part of a dataset that reports information about whether drivers involved in a fatal crash were impaired by alcohol.[1] The dataset contains 51 entries corresponding to all 50 states, as well as Washington, DC.

The following table gives the five-number summary for the percentage of drivers involved in fatal collisions who were alcohol-impaired in all 50 states and Washington, DC.

Minimum First Quartile Median Third Quartile Maximum
16 28 30 33 44

One of the ways to visualize the data using the five-number summary is by creating a boxplot. For questions 1 – 4, refer to the following boxplot, which depicts data about the percentage of drivers involved in fatal collisions who were alcohol-impaired in all 50 states and Washington, DC. The boxplot is superimposed with the letters A – G labeling different features of the plot.

A vertical boxplot titled "Percentage of drivers involved in fatal collisions who were alcohol-impaired." The vertical axis is numbered by increments of 5 from 15 to 50. On the graph, there are points at16, 41, 42, and 44. The point at 44 is labeled "A." The high point of the box plot is at 38 and labeled "B," while the low point is at 23 and labeled "F." The high end of the box is at 33 and labeled "C" while the low end is at 28 and labeled "E." The middle line is at 30 and labeled "D."

question 1

Match the labeled feature on the above boxplot to the term that describes it.

Term Boxplot Feature
Minimum
First quartile (Q1)
Median
Third quartile (Q3)
Maximum

For questions 2 -4, complete each sentence using information from the boxplot above.

question 2

In about half of the states, fewer than _______ of drivers involved in a fatal crash were impaired by alcohol.

a) 23%

b) 28%

c) 30%

d) 33%

e) 38%

f) 44%

question 3

In about one quarter of the states, fewer than _______ of drivers involved in a fatal crash were impaired by alcohol.

a) 23%

b) 28%

c) 30%

d) 33%

e) 38%

f) 44%

question 4

_______ of the states had alcohol involved in 33% or more of their fatal crashes.

a) One-fourth

b) One half

c) Three-fourths

Interquartile Range and Outliers

Now, let’s define the idea of an outlier more precisely. Previously, we’ve seen that an outlier is a value that is unusual, given the other values in a dataset. But what does “unusual” mean? To be more precise, for data with only one variable, let’s define the define the following:

  • upper outlier as an observation that is greater than Q3 + 1.5 × (IQR); and
  • lower outlier as an observation that is less than Q1 – 1.5 × (IQR).

Use these definitions with the boxplot above question 1 to complete the sentences in questions 5 and 6.

identifying features of a boxplot

[Worked example – a 3-instructor video showing a worked example similar to Questions 5 – 7]

question 5

The interquartile range (IQR) of this dataset is ______.

 

question 6

Recall that we say that upper outliers lie above Q3 + 1.5 × (IQR), and lower outliers lie below Q1 – 1.5 × (IQR). Because of this, states with more than ________% of fatal crashes involving alcohol impairment are considered upper outliers, and states with fewer than ________% of fatal crashes involving alcohol impairment are considered lower outliers.

question 7

Again, referring to the boxplot above Question 1, we saw previously how some of the boxplot’s features relate to the five-number summary, but when outliers are present, the boxplot is modified as shown below. On the following table, match the labeled feature on the boxplot to the term that describes it.

A vertical boxplot titled "Percentage of drivers involved in fatal collisions who were alcohol-impaired." The vertical axis is numbered by increments of 5 from 15 to 50. On the graph, there is a point at 16 labeled "D." There are also points at 41, 42, and 44 collectively labeled "A." The high point of the box plot is at 38 and labeled "B," while the low point is at 23 and labeled "C." The high end of the box is at 33 while the low end is at 28. The middle line is at 30.

Term Boxplot Feature
Upper outlier(s)
Lower outlier(s)
Greatest value of an observation that is not an upper outlier
Lowest value of an observation that is not a lower outlier

It’s important to note that there are several good methods to use for determining an observation to be an outlier in the distribution. The IQR method commonly uses a distance 1.5 times IQR from Q1 or Q3, but certain applications use larger distances. The IQR method does work for skewed distributions, though. In the next section, you’ll learn about another method that doesn’t involve the IQR, and which works well for symmetrical distributions. Depending upon the application, it may be desirable to set the distance 2 or even 3 times IQR, but 1.5 times is commonly used and works well for our application, so we use it here.

Interpreting the Features of a Boxplot

The following table lists each state in the dataset, along with the corresponding percentages of drivers involved in fatal crashes who were impaired by alcohol, in order from lowest percentage to highest percentage. Use this table and the definition of outlier to answer questions 8 -9.

Drivers Involved in Fatal Crashes by State
State Percentage of Drivers Involved in Fatal Crashes and Impaired by Alcohol State Percentage of Drivers Involved in Fatal Crashes and Impaired by Alcohol
Utah 16 Maine 30
Kentucky 23 New Hampshire 30
Kansas 24 Vermont 30
Alaska 25 Mississippi 31
Georgia 25 North Carolina 31
Iowa 25 Pennsylvania 31
Arkansas 26 Maryland 32
Oregon 26 Nevada 32
District of Columbia 27 Wyoming 32
New Mexico 27 Louisiana 33
Virginia 27 South Dakota 33
Arizona 28 Washington 33
California 28 Wisconsin 33
Colorado 28 Illinois 34
Michigan 28 Missouri 34
New Jersey 28 Ohio 34
West Virginia 28 Massachusetts 35
Florida 29 Nebraska 35
Idaho 29 Connecticut 36
Indiana 29 Rhode Island 38
Minnesota 29 Texas 38
New York 29 Hawaii 41
Oklahoma 29 South Carolina 41
Tennessee 29 North Dakota 42
Alabama 30 Montana 44
Delaware 30

 

question 8

Which state(s) in the list below is a lower outlier? In other words, which has an unusually low percentage of drivers involved in fatal crashes who were impaired by alcohol? Choose all that apply.

a) Kentucky

b) Kansas

c) Utah

d) Alaska

question 9

Which of the following states have unusually high percentages of drivers involved in fatal crashes who were impaired by alcohol?

a) Texas, South Carolina, Montana

b) Montana, North Dakota, South Carolina, Hawaii

c) Montana, North Dakota, South Carolina

d) Texas, South Carolina, Montana, Rhode Island

question 10

Without computing the mean of the percentage of drivers involved in fatal crashes who were impaired by alcohol, make a prediction about whether the mean and median will be very different or fairly similar.

 

Now, let’s use technology to explore the dataset.

Go to the Describing and Exploring Quantitative Variables tool at https://dcmathpathways.shinyapps.io/EDA_quantitative/.

Step 1) Select the Single Group tab.

Step 2) Locate the dropdown under Enter Data and select From Textbook.

Step 3) Locate the drop-down menu under Dataset and select Bad Drivers (alcohol).

Step 4) Use the tool to compute the mean percentage of drivers involved in fatal collisions who were alcohol-impaired.

question 11

Which of the following describes your findings?

a) The mean is much higher than the median.

b) The median is much higher than the mean.

c) The mean and the median are about the same.

Identifying the Shape of a Distribution from a Boxplot

question 12

Just as histograms and dotplots can tell us about the distribution of a quantitative variable, so can a boxplot. For each boxplot below, choose the description that matches the shape of the data’s distribution. (Note that boxplots can be oriented vertically, as we saw previously, or horizontally, as we see below.)

 

Boxplot Distribution
A boxplot numbered in increments of 50 from 0 to 350. The low point of the plot is at 50 and the high point is at approximately 290. The low end of the box is at approximately 140, the high end is at approximately 210, and the middle line is at approximately 180. There is also a point at 0 and one at approximately 340. a) left skewed

b) symmetric

c) right skewed

include dropdown options similar to Question 10 in WTK 4C

A boxplot numbered in increments of 5 from 0 to 25. The low end of the plot is at 0 and the high end is at approximately 8. The low edge of the box is at approximately 1, while the high edge is at approximately 4 and the center line is at approximately 2. There are also points at approximately 9, 10, 12, 13, 14, 15, 16, 18, 20, 22, and 24. a) left skewed

b) symmetric

c) right skewed

include dropdown options similar to Question 10 in WTK 4C

A box plot labeled in increments of 5 from 35 to 60. The low point of the box plot is at 55 and the high point is at approximately 61. The low end of the box is at approximately 56, the high end is at approximately 60, and the middle line is at approximately 58. There are also points at approximately 35, 36, 37, and 38. a) left skewed

b) symmetric

c) right skewed

include dropdown options similar to Question 10 in WTK 4C

Summary

In this section, you’ve learned about boxplots: how to calculate the five-number summary, how to read these numbers from a boxplot, and how to identify outliers in a dataset using the interquartile range. Let’s summarize where these skills showed up in the material.

  • In Questions 1 and 7, you identified the features of a boxplot.
  • In Questions 2 – 4, you interpreted the features of a boxplot.
  • In Questions 5, 6, 8, and 9, you identified outliers in a dataset.
  • In Questions 10 – 12, you related the boxplot of a quantitative variable to its distribution.

Being able to calculate and identify features of a boxplot and relate the boxplot and distribution of a quantitative variable are necessary statistical skills and will be used in the next activity. If you feel comfortable with these skills, please move on!


  1. Chalabi, M. (2014, October 24). Dear Mona, which state has the worst driver? FiveThirtyEight. https://fivethirtyeight.com/features/which-state-has-the-worst-drivers/