What to Know About Five Number Summary in Box Plots and Datasets: 4D – 26

goals for this section

After completing this section, you should feel comfortable performing these skills.

Define the terms: first quartile, third quartile, interquartile range, and five-number summary.
Identify the features of a boxplot
Calculate interquartile range for a dataset.
Calculate the range of observations characterized as upper outliers or lower outliers.
Interpret the features of a boxplot.
Use a boxplot of a dataset to identify whether the shape of its distribution is left-skewed, symmetric, or right-skewed.

Click on a skill above to jump to its location in this section.

Boxplots are helpful for visualizing the distribution of a quantitative variable. A boxplot clearly shows the median of the data and provides a summary at a glance of the bulk of the data and the presence of outliers. In the next activity, you will need to be able to identity and interpret the features of a boxplot, identify outliers in a dataset, and relate a boxplot of a quantitative variable to its distribution. In this section, you’ll learn to identify the key pieces of information needed to accomplish these tasks.

Features of a Boxplot

In order to interpret boxplots, you will need to identify the minimum, maximum, and median of a quantitative variable. You’ve done this in previous activities. If you need a refresher, take a look at the video below. A boxplot captures only the median of the dataset, not the mean, as a measure of center.

recall

Core skill:

Identify the minimum value, maximum value, and median of a quantitative variable.

Five-Number Summary

You will also need to know the following definitions:

the first quartile of a quantitative variable (sometimes denoted Q1) is the value below which one quarter of the data lies, and the first quartile is also equal to the 25th percentile;
the third quartile of a quantitative variable (sometimes denoted Q3) is the value below which three quarters of the data lay, and the third quartile is also equal to the 75th percentile; and
the interquartile range (sometimes denoted IQR) of a quantitative variable is the quantity Q3–Q1.

The collection of the minimum, first quartile, median, third quartile, and maximum form the five-number summary of the variable.

first and third quartiles

[Perspective video — a 3 instructor video showing how to understand Q1 and Q3 as percentiles and/or quarters of data. See below for the idea:]

the location of the Q1/25th percentile and Q3/75th percentile on a number line along with other percentile locations such as 10th and 98th along with three ways to think about it:
- 1) “if a student scores in the 10th percentile of a test like the SAT, they have scored higher than only 10% of all the test takers but if they score in the 98th percentile, then their score is higher than 98% of all the test takers.” and
- 2) “percentiles divide data into two parts — the lower part (she scored higher than 98% of the test takers) and the higher part (2% of the test takers scored higher than she did)” and 3) “the 25th percentile (first quartile) splits the data into the lower 25% and the 75% above that; the 50th percentile (2nd quartile) splits the data in half (marked by the median); the 75th percentile (3rd quartile)splits the data into the lower 75% and the 25% of the data above that.”
- 3) Subtracting the value of Q1 from the value of Q3 gives the IQR (the distance between the 25th percentile and the 75th percentile)
(critics may point out that students will have seen all of this before, which is true but doesn’t acknowledge that students also need a brief refresher at this point.)

Identifying the Features of a Boxplot

As we explore the features of boxplots, we will work with part of a dataset that reports information about whether drivers involved in a fatal crash were impaired by alcohol.^[1] The dataset contains 51 entries corresponding to all 50 states, as well as Washington, DC.

The following table gives the five-number summary for the percentage of drivers involved in fatal collisions who were alcohol-impaired in all 50 states and Washington, DC.

Minimum	First Quartile	Median	Third Quartile	Maximum
16	28	30	33	44

One of the ways to visualize the data using the five-number summary is by creating a boxplot. For questions 1 – 4, refer to the following boxplot, which depicts data about the percentage of drivers involved in fatal collisions who were alcohol-impaired in all 50 states and Washington, DC. The boxplot is superimposed with the letters A – G labeling different features of the plot.

question 1

Match the labeled feature on the above boxplot to the term that describes it.

Term	Boxplot Feature
Minimum
First quartile (Q1)
Median
Third quartile (Q3)
Maximum

Hint

For questions 2 -4, complete each sentence using information from the boxplot above.

question 2

In about half of the states, fewer than _______ of drivers involved in a fatal crash were impaired by alcohol.

a) 23%

b) 28%

c) 30%

d) 33%

e) 38%

f) 44%

Hint

question 3

In about one quarter of the states, fewer than _______ of drivers involved in a fatal crash were impaired by alcohol.

a) 23%

b) 28%

c) 30%

d) 33%

e) 38%

f) 44%

Hint

question 4

_______ of the states had alcohol involved in 33% or more of their fatal crashes.

a) One-fourth

b) One half

c) Three-fourths

Hint

Interquartile Range and Outliers

Now, let’s define the idea of an outlier more precisely. Previously, we’ve seen that an outlier is a value that is unusual, given the other values in a dataset. But what does “unusual” mean? To be more precise, for data with only one variable, let’s define the define the following:

upper outlier as an observation that is greater than Q3 + 1.5 × (IQR); and
lower outlier as an observation that is less than Q1 – 1.5 × (IQR).

Use these definitions with the boxplot above question 1 to complete the sentences in questions 5 and 6.

identifying features of a boxplot

[Worked example – a 3-instructor video showing a worked example similar to Questions 5 – 7]

question 5

The interquartile range (IQR) of this dataset is ______.

Hint

question 6

Recall that we say that upper outliers lie above Q3 + 1.5 × (IQR), and lower outliers lie below Q1 – 1.5 × (IQR). Because of this, states with more than ________% of fatal crashes involving alcohol impairment are considered upper outliers, and states with fewer than ________% of fatal crashes involving alcohol impairment are considered lower outliers.

Hint

question 7

Again, referring to the boxplot above Question 1, we saw previously how some of the boxplot’s features relate to the five-number summary, but when outliers are present, the boxplot is modified as shown below. On the following table, match the labeled feature on the boxplot to the term that describes it.

Term	Boxplot Feature
Upper outlier(s)
Lower outlier(s)
Greatest value of an observation that is not an upper outlier
Lowest value of an observation that is not a lower outlier

Hint

It’s important to note that there are several good methods to use for determining an observation to be an outlier in the distribution. The IQR method commonly uses a distance 1.5 times IQR from Q1 or Q3, but certain applications use larger distances. The IQR method does work for skewed distributions, though. In the next section, you’ll learn about another method that doesn’t involve the IQR, and which works well for symmetrical distributions. Depending upon the application, it may be desirable to set the distance 2 or even 3 times IQR, but 1.5 times is commonly used and works well for our application, so we use it here.

Interpreting the Features of a Boxplot

The following table lists each state in the dataset, along with the corresponding percentages of drivers involved in fatal crashes who were impaired by alcohol, in order from lowest percentage to highest percentage. Use this table and the definition of outlier to answer questions 8 -9.

Drivers Involved in Fatal Crashes by State
State	Percentage of Drivers Involved in Fatal Crashes and Impaired by Alcohol	State	Percentage of Drivers Involved in Fatal Crashes and Impaired by Alcohol
Utah	16	Maine	30
Kentucky	23	New Hampshire	30
Kansas	24	Vermont	30
Alaska	25	Mississippi	31
Georgia	25	North Carolina	31
Iowa	25	Pennsylvania	31
Arkansas	26	Maryland	32
Oregon	26	Nevada	32
District of Columbia	27	Wyoming	32
New Mexico	27	Louisiana	33
Virginia	27	South Dakota	33
Arizona	28	Washington	33
California	28	Wisconsin	33
Colorado	28	Illinois	34
Michigan	28	Missouri	34
New Jersey	28	Ohio	34
West Virginia	28	Massachusetts	35
Florida	29	Nebraska	35
Idaho	29	Connecticut	36
Indiana	29	Rhode Island	38
Minnesota	29	Texas	38
New York	29	Hawaii	41
Oklahoma	29	South Carolina	41
Tennessee	29	North Dakota	42
Alabama	30	Montana	44
Delaware	30

question 8

Which state(s) in the list below is a lower outlier? In other words, which has an unusually low percentage of drivers involved in fatal crashes who were impaired by alcohol? Choose all that apply.

a) Kentucky

b) Kansas

c) Utah

d) Alaska

Hint

question 9

Which of the following states have unusually high percentages of drivers involved in fatal crashes who were impaired by alcohol?

a) Texas, South Carolina, Montana

b) Montana, North Dakota, South Carolina, Hawaii

c) Montana, North Dakota, South Carolina

d) Texas, South Carolina, Montana, Rhode Island

Hint

question 10

Without computing the mean of the percentage of drivers involved in fatal crashes who were impaired by alcohol, make a prediction about whether the mean and median will be very different or fairly similar.

Hint

Now, let’s use technology to explore the dataset.

Go to the Describing and Exploring Quantitative Variables tool at https://dcmathpathways.shinyapps.io/EDA_quantitative/.

Step 1) Select the Single Group tab.

Step 2) Locate the dropdown under Enter Data and select From Textbook.

Step 3) Locate the drop-down menu under Dataset and select Bad Drivers (alcohol).

Step 4) Use the tool to compute the mean percentage of drivers involved in fatal collisions who were alcohol-impaired.

question 11

Which of the following describes your findings?

a) The mean is much higher than the median.

b) The median is much higher than the mean.

c) The mean and the median are about the same.

Hint

Identifying the Shape of a Distribution from a Boxplot

question 12

Just as histograms and dotplots can tell us about the distribution of a quantitative variable, so can a boxplot. For each boxplot below, choose the description that matches the shape of the data’s distribution. (Note that boxplots can be oriented vertically, as we saw previously, or horizontally, as we see below.)

Boxplot	Distribution
	a) left skewed b) symmetric c) right skewed include dropdown options similar to Question 10 in WTK 4C
	a) left skewed b) symmetric c) right skewed include dropdown options similar to Question 10 in WTK 4C
	a) left skewed b) symmetric c) right skewed include dropdown options similar to Question 10 in WTK 4C

Hint

Summary

In this section, you’ve learned about boxplots: how to calculate the five-number summary, how to read these numbers from a boxplot, and how to identify outliers in a dataset using the interquartile range. Let’s summarize where these skills showed up in the material.

In Questions 1 and 7, you identified the features of a boxplot.
In Questions 2 – 4, you interpreted the features of a boxplot.
In Questions 5, 6, 8, and 9, you identified outliers in a dataset.
In Questions 10 – 12, you related the boxplot of a quantitative variable to its distribution.

Being able to calculate and identify features of a boxplot and relate the boxplot and distribution of a quantitative variable are necessary statistical skills and will be used in the next activity. If you feel comfortable with these skills, please move on!

Chalabi, M. (2014, October 24). Dear Mona, which state has the worst driver? FiveThirtyEight. https://fivethirtyeight.com/features/which-state-has-the-worst-drivers/ ↵

Alpha Module 2: Exploring Quantitative Variables Using Graphical Displays