objectives for this activity
During this activity, you will:
- Summarize the description of a distribution of a quantitative variable using the shape, center, spread, and presence of outliers.
- Determine the appropriate representation of the spread based on the shape of the distribution and presence of outliers.
Click on a skill above to jump to its location in this activity.
In the previous section, What to Know About Applications of Histograms: 3D, you practiced using histograms to describe a quantitative dataset. You described the shape, estimated the center and spread, and identified any outliers in the distribution. Now it’s time to use the skills you learned on a dataset involving information collected from student evaluations of their classes at a Texas University.
What Do Students Think?
Before you begin this activity, take a moment to think about a scenario in which only a low percentage (e.g., fewer than 10%) of students in your class completes the course evaluation at the end of the semester.

question 1
Would these evaluations be an accurate representation of the general experience and opinion of students in the course? Explain.
video placement
[Intro: “If you are a freshman in your first term in college, you may not have heard about course evaluations yet. These are surveys that students in a class fill out anonymously at the end of the course term to provide feedback about the course and instructor. If only a few students in the course complete the survey, it would be natural to question if those students’ responses accurately represented the experience of the class in general or if they were just the students who had the strongest opinions (negative or positive) about the course. If so, the sample of responses would not be an accurate representation of the general experience. In this activity we’ll use a dataset of course evaluations to investigate the percentage of students who do tend to complete course evaluations to learn how statistical language can be used to describe a distribution based on its graphical display. Our descriptions will include the shaper, center, spread, and presence of any outliers in the distribution. We’ll also see that we can identify a representation of the spread of a distribution based on its shape and outliers.”]
In this activity, you’ll see common statistical language used to describe a distribution based on what is observed from a graphical display, which you’ll describe by identifying its shape, center, spread, and any outliers present. You’ll also see that range (the difference between the maximum and minimum values) of a distribution that contains outliers or is skewed can be a misleading representation of spread.
We will investigate the question:
In general, what percentage of students completes course evaluations?
To do so, we will use the evals dataset[1], which contains information collected from student evaluations for a sample of 463 courses taught by 94 professors at The University of Texas at Austin. Each row has a different course, and the columns have information about the professor and summaries from the evaluations. The first 10 observations of the selected variables within the “Teaching Evaluations” dataset are displayed in the following table.
| Teaching Evaluations |
||||
| cls_did_eval | cls_perc_eval | age | cls_students | score |
| 24 | 55.81395 | 36 | 43 | 4.7 |
| 86 | 68.8 | 36 | 125 | 4.1 |
| 76 | 60.8 | 36 | 125 | 3.9 |
| 77 | 62.60163 | 36 | 123 | 4.8 |
| 17 | 85 | 59 | 20 | 4.6 |
| 35 | 87.5 | 59 | 40 | 4.3 |
| 39 | 88.63636 | 59 | 44 | 2.8 |
| 55 | 100 | 51 | 55 | 4.1 |
| 111 | 56.92308 | 51 | 195 | 3.4 |
| 40 | 86.95652 | 40 | 46 | 4.5 |
The following variables are used in this analysis:
- cls_did_eval: Number of students who completed evaluations
- cls_perc_eval: Percentage of students who completed evaluations
- age: Age of professor in years
- cls_students: Total number of students in the course
- score: Average professor evaluation score (1 to 5, where 1 is the lowest and 5 is the highest)
If we are interested in course evaluation completion, we are naturally curious about how many students completed the evaluation.
question 2
There are two variables in the dataset that capture evaluation completion: cls_did_eval and cls_perc_eval.
Which variable is more appropriate to help us understand whether the course evaluations generally represent the views of all the students in the course or just a few students? Explain.
Examine the data
Now, let’s create a graph to visualize the distribution of the variable of interest.
Use your histogram summarize the description of the distribution by answering Questions 3, 4, and 5.
question 3
video placement
[insert sub-summary: Good job using the technology without a list of instructions! I’d like to point out a feature of this histogram that could be confusing. Did you wonder why there was a bin (a bar) stretching beyond 100 if the range of student completions only went to 100%? It seems strange when you think about it. But, recall that each bin represents an interval of values, and only the left-most value is included in that interval. For example, for the bin that spans 40% to 45%, we would write that interval as [40,45) to indicate that the values including all the percentages from 40% up to through 44% are counted in that bin. In fact, we’d count 44.999% repeating in that bin, but not 45%. The next bin will pick up any value from 45% up to but not including 50%. So, you can see now that the last bin must stretch beyond 100 in order to include exactly 100%. What is the only possible completion rate that would be counted in the last bin? Would it make sense for a completion rate to be greater than 100%? As it turns out, the last bin in this case is the only one for which you’ll know the exact count of a value.]
question 4
In how many courses did all students complete the course evaluations?
question 5
About what proportion of courses had a completion rate between 58% (inclusive) and 82% (not inclusive)?
question 6
Based on these data, would it be more unusual for a course to have a completion rate less 70% or greater than 70%? Explain. Remember, if you are using the app, you can hover over the bar to get the exact counts.
So far, we’ve been able to use the histogram to answer questions about the distribution of cls_perc_eval. The answers to these questions give us some information about the data; however, they do not give us a broad view of the overall distribution of the variable. In addition to visualizing the distribution with a graphical display, we can use common statistical language to describe the distribution.
Before diving into the details, consider why we might want to use words to describe a distribution.
question 7
Why do you think it would be useful to include a verbal or written description of the features of a distribution in addition to the graphical display?
Describe the distribution
Now, in Questions 8 – 12, use statistical terms to describe the distributions of the variable cls_perc_eval. If necessary, refer to the Describing Distributions section at the end of this activity for details about how to describe a distribution.
question 8
Describe the shape of the distribution of cls_perc_eval, the percentage of students who completed the course evaluations.
question 9
What is the approximate center of the distribution of cls_perc_eval, the percentage of students who completed the course evaluations?
Recall that the spread is a measure of how much the values in a dataset tend to differ from one another. One way we can describe the spread is by finding the minimum and maximum values in the data and calculating the difference between them. This difference is called the range.
question 10
Use the range to describe the spread of cls_perc_eval, the percentage of students who completed the course evaluations. Note that the data analysis tool provides values of the minimum and maximum.
question 11
Why might the range calculated in question 9 be a potentially misleading measure of the spread of the distribution of cls_perc_eval?
question 12
Are there any outliers in the distribution of cls_perc_eval, the percentage of students who completed the course evaluations? If so, briefly describe the outliers.
video placement
[sub-summary: “You’ve used all the features of a quantitative display to describe the distribution of the percentage of students who completed the evaluation.” [voice over the distribution with a “pointer” to follow along this part –>] “You saw that the distribution was unimodal and left skewed. You can see the longer tail of smaller counts out to the left and the data sort of bunched up to the right. It looks like the center lies somewhere between about 75% and 80%. You noted that the range could be misleading because, while the range covers 90%, most of the data occur within the right-most 50% of the range. Finally, you were able to identify one outlier by the bin count of 1 in the left-most bin. There seems to be a course with a completion rate between 10% and 16%. In the next part of the activity, try describing the remaining quantitative variables in the dataset on your own. You’ll need to use the tool to create the distributions, then describe them by answering the questions below.”]
Summarize the description of a distribution
Good work. You’ve thoroughly described the variable cls_perc_eval using statistical language: shape, center, spread (range) and outliers. Now it’s your turn to try it on your own. Use the features you described in Questions 8 – 12 to describe the distribution of each of the following variables:
- age: Age of professor in years
- Dataset = “Teaching Evaluations – Age”
- cls_students: Total number of students in the course
- Dataset = “Teaching Evaluations – Students”
- score: Average professor evaluation score (1 to 5, where 1 is the lowest)
- Dataset = “Teaching Evaluations – Scores”
For each of the variables age, cls_students, and score,
- Use the appropriate data tool to make a histogram of the distribution using the following binwidths:
- age: 5
- cls_students: 50
- score: 0.2
- Describe the distribution, including the shape, center, spread, and presence of outliers, using words.
Record your results in Questions 13 – 19 below.
question 13
question 14
What is your description for the variable age?
question 15
question 16
What is your description for the variable cls_students?
question 17
question 18
What is your description for the variable score?
Determine the appropriate representation of the spread of a distribution
question 19
Look back on your distributions for the three variables age, cls_students, and score. For which distribution might the range be a misleading representation of the spread?
video placement
[Wrap-up: You’ve had the chance to describe several differently shaped distributions in this activity using the statistical language of shape, center, spread, and the presence of outliers. Some of them were harder to describe than others, especially when it came to spread, which we described in this activity using the range. We’ll see later that there are other measures of spread we can use as well. Let’s recap the distributions of the other variables we looked at today. [voice over images of the distributions, one-by-one]. You may have found the variable “age” hard to describe. Even though it looks unusual, we would call this unimodal and roughly symmetric with the center at about 50 years. To find the shape, we just want to roughly draw a pen along the overall shape without paying too much attention to little bumps and dips along the way. The values range from about 29 to 73, with a range of about 44. There are some outliers above 70. Removing them would drop the range down. [course-size next] The distribution of course size is unimodal and right skewed. We see the center around 25 and a very wide range, almost 600, but that includes outliers between 500 and 600. In fact there are only a few courses with enrollment larger than 200, so the spread of most of the data is about 200. [average eval score next] Lastly, the distribution for average evaluation score is unimodal and left skewed with a center around 4.25. The range is about 2.75 (between 2.25 and 5). We could consider the few scores at the far left to be outliers. Hopefully you feel comfortable describing distributions using shape, center, spread, and the presence of outliers. And you should have a good idea now of when range can be used to appropriately describe the spread, and when you should make a note that the range could be misleading.”]
Reference: Describing Distributions
The features used to describe the distribution of a quantitative variable are the shape, center, spread, and presence of outliers.
Shape: The overall pattern (left skewed, right skewed, symmetric) and the number of peaks (unimodal, bimodal, multimodal, uniform).
Center: A measure that describes where the middle of the distribution is. The center is a number that describes a typical value. For example, one way to think about center is that it could be the point in the distribution where about half of the observations are below it and half are above it.
Spread: A measure of how far apart the data are. In this lesson, the range is used to measure spread. The range is the difference between the maximum value and minimum value.
Outliers: Unusual observations that are outside the general pattern of the distribution.
The description of shape includes two parts: (1) the overall pattern (left skewed, right skewed, symmetric) and (2) the number of peaks (unimodal, bimodal, multimodal, uniform).
The overall pattern can be described as one of the following:
Symmetric: The left and right sides of the distribution (closely) mirror each other. If you drew a vertical line down the center of the distribution and folded the distribution in half, the left and right sides would closely match one another.
Left skewed: The distribution has a longer tail to the left.
Right skewed: The distribution has a longer tail to the right.
In addition to the overall pattern, the description of shape also includes the number of peaks. This is also known as the modality. The modality can be described as one of the following:
Unimodal: There is one prominent peak.
Bimodal: There are two prominent peaks.
Multimodal: There are three or more prominent peaks.
Uniform: There are no prominent peaks.
The next feature is the center. For now, we can use the histogram to get an approximate value of the center. (In a later activity, you will learn statistics used to describe the center more precisely.)
When describing the spread of a distribution that is left skewed, right skewed, or has outliers, it can be misleading to only rely on the range to measure spread, since it is influenced by skewness and outliers. In this case, the range may make the spread appear to be larger than it is for a vast majority of the data.
If this is the case, in addition to reporting the range, you can include additional information about the spread of most of the data as well. This will give the reader a more accurate and complete picture of the true spread of the data. For example, in addition to reporting the range for the distribution of cls_perc_eval, we can also include information that most of the data are between about 50% and 100%, or within 50%. (In later activities, you will learn additional statistics to describe the typical spread of the data.)
The last feature in the description is the presence of outliers. Outliers are observations in the data that are unusual and outside the general pattern of the rest of the observations in the distribution. When working with a univariate distribution for a quantitative variable, an outlier is an observation that has an unusually high or unusually low value. It is good practice to make note of outliers, as these observations can sometimes influence the statistical results (e.g., the range).
- Professor evaluations and beauty. (n.d.). OpenIntro. Retrieved from https://www.openintro.org/data/indes.php?data=evals ↵