Forming Connections in Applications of Histograms: 3D – 12

objectives for this activity

During this activity, you will:

Summarize the description of a distribution of a quantitative variable using the shape, center, spread, and presence of outliers.
Determine the appropriate representation of the spread based on the shape of the distribution and presence of outliers.

Click on a skill above to jump to its location in this activity.

In the previous section, What to Know About Applications of Histograms: 3D, you practiced using histograms to describe a quantitative dataset. You described the shape, estimated the center and spread, and identified any outliers in the distribution. Now it’s time to use the skills you learned on a dataset involving information collected from student evaluations of their classes at a Texas University.

What Do Students Think?

Before you begin this activity, take a moment to think about a scenario in which only a low percentage (e.g., fewer than 10%) of students in your class completes the course evaluation at the end of the semester.

A woman working on a laptop and checking her phone. On the left, there are three check boxes displayed: a frowning face, a neutral face, and a smiling face. The smiling face check box is marked with a check, and the other two are marked with an X.

question 1

Would these evaluations be an accurate representation of the general experience and opinion of students in the course? Explain.

Hint

video placement

[Intro: “If you are a freshman in your first term in college, you may not have heard about course evaluations yet. These are surveys that students in a class fill out anonymously at the end of the course term to provide feedback about the course and instructor. If only a few students in the course complete the survey, it would be natural to question if those students’ responses accurately represented the experience of the class in general or if they were just the students who had the strongest opinions (negative or positive) about the course. If so, the sample of responses would not be an accurate representation of the general experience. In this activity we’ll use a dataset of course evaluations to investigate the percentage of students who do tend to complete course evaluations to learn how statistical language can be used to describe a distribution based on its graphical display. Our descriptions will include the shaper, center, spread, and presence of any outliers in the distribution. We’ll also see that we can identify a representation of the spread of a distribution based on its shape and outliers.”]

In this activity, you’ll see common statistical language used to describe a distribution based on what is observed from a graphical display, which you’ll describe by identifying its shape, center, spread, and any outliers present. You’ll also see that range (the difference between the maximum and minimum values) of a distribution that contains outliers or is skewed can be a misleading representation of spread.

We will investigate the question:

In general, what percentage of students completes course evaluations?

To do so, we will use the evals dataset^[1], which contains information collected from student evaluations for a sample of 463 courses taught by 94 professors at The University of Texas at Austin. Each row has a different course, and the columns have information about the professor and summaries from the evaluations. The first 10 observations of the selected variables within the “Teaching Evaluations” dataset are displayed in the following table.

Teaching Evaluations
*cls_did_eval*	*cls_perc_eval*	*age*	*cls_students*	*score*
24	55.81395	36	43	4.7
86	68.8	36	125	4.1
76	60.8	36	125	3.9
77	62.60163	36	123	4.8
17	85	59	20	4.6
35	87.5	59	40	4.3
39	88.63636	59	44	2.8
55	100	51	55	4.1
111	56.92308	51	195	3.4
40	86.95652	40	46	4.5

The following variables are used in this analysis:

cls_did_eval: Number of students who completed evaluations
cls_perc_eval: Percentage of students who completed evaluations
age: Age of professor in years
cls_students: Total number of students in the course
score: Average professor evaluation score (1 to 5, where 1 is the lowest and 5 is the highest)

If we are interested in course evaluation completion, we are naturally curious about how many students completed the evaluation.

question 2

There are two variables in the dataset that capture evaluation completion: cls_did_eval and cls_perc_eval.

Which variable is more appropriate to help us understand whether the course evaluations generally represent the views of all the students in the course or just a few students? Explain.

Hint

Examine the data

Now, let’s create a graph to visualize the distribution of the variable of interest.

Go to the Describing and Exploring Quantitative Variables tool at https://dcmathpathways.shinyapps.io/EDA_quantitative/. Select the Single Group tab, and then the dataset Teaching Evaluations – Percent Complete and make a Histogram of the distribution of cls_perc_eval, the percentage of students who completed the course evaluations. Under Select Binwidth for Histogram use 6.

Use your histogram summarize the description of the distribution by answering Questions 3, 4, and 5.

question 3

Hint

video placement

[insert sub-summary: Good job using the technology without a list of instructions! I’d like to point out a feature of this histogram that could be confusing. Did you wonder why there was a bin (a bar) stretching beyond 100 if the range of student completions only went to 100%? It seems strange when you think about it. But, recall that each bin represents an interval of values, and only the left-most value is included in that interval. For example, for the bin that spans 40% to 45%, we would write that interval as [40,45) to indicate that the values including all the percentages from 40% up to through 44% are counted in that bin. In fact, we’d count 44.999% repeating in that bin, but not 45%. The next bin will pick up any value from 45% up to but not including 50%. So, you can see now that the last bin must stretch beyond 100 in order to include exactly 100%. What is the only possible completion rate that would be counted in the last bin? Would it make sense for a completion rate to be greater than 100%? As it turns out, the last bin in this case is the only one for which you’ll know the exact count of a value.]

question 4

In how many courses did all students complete the course evaluations?

Hint

question 5

About what proportion of courses had a completion rate between 58% (inclusive) and 82% (not inclusive)?

Hint

question 6

Based on these data, would it be more unusual for a course to have a completion rate less 70% or greater than 70%? Explain. Remember, if you are using the app, you can hover over the bar to get the exact counts.

Hint

So far, we’ve been able to use the histogram to answer questions about the distribution of cls_perc_eval. The answers to these questions give us some information about the data; however, they do not give us a broad view of the overall distribution of the variable. In addition to visualizing the distribution with a graphical display, we can use common statistical language to describe the distribution.

Before diving into the details, consider why we might want to use words to describe a distribution.

question 7

Why do you think it would be useful to include a verbal or written description of the features of a distribution in addition to the graphical display?

Hint

Describe the distribution

Now, in Questions 8 – 12, use statistical terms to describe the distributions of the variable cls_perc_eval. If necessary, refer to the Describing Distributions section at the end of this activity for details about how to describe a distribution.

question 8

Describe the shape of the distribution of cls_perc_eval, the percentage of students who completed the course evaluations.

Hint

question 9

What is the approximate center of the distribution of cls_perc_eval, the percentage of students who completed the course evaluations?

Hint

Recall that the spread is a measure of how much the values in a dataset tend to differ from one another. One way we can describe the spread is by finding the minimum and maximum values in the data and calculating the difference between them. This difference is called the range.

question 10

Use the range to describe the spread of cls_perc_eval, the percentage of students who completed the course evaluations. Note that the data analysis tool provides values of the minimum and maximum.

Hint

question 11

Why might the range calculated in question 9 be a potentially misleading measure of the spread of the distribution of cls_perc_eval?

Hint

question 12

Are there any outliers in the distribution of cls_perc_eval, the percentage of students who completed the course evaluations? If so, briefly describe the outliers.

Hint

video placement

[sub-summary: “You’ve used all the features of a quantitative display to describe the distribution of the percentage of students who completed the evaluation.” [voice over the distribution with a “pointer” to follow along this part –>] “You saw that the distribution was unimodal and left skewed. You can see the longer tail of smaller counts out to the left and the data sort of bunched up to the right. It looks like the center lies somewhere between about 75% and 80%. You noted that the range could be misleading because, while the range covers 90%, most of the data occur within the right-most 50% of the range. Finally, you were able to identify one outlier by the bin count of 1 in the left-most bin. There seems to be a course with a completion rate between 10% and 16%. In the next part of the activity, try describing the remaining quantitative variables in the dataset on your own. You’ll need to use the tool to create the distributions, then describe them by answering the questions below.”]

Summarize the description of a distribution

Good work. You’ve thoroughly described the variable cls_perc_eval using statistical language: shape, center, spread (range) and outliers. Now it’s your turn to try it on your own. Use the features you described in Questions 8 – 12 to describe the distribution of each of the following variables:

age: Age of professor in years
- Dataset = “Teaching Evaluations – Age”
cls_students: Total number of students in the course
- Dataset = “Teaching Evaluations – Students”
score: Average professor evaluation score (1 to 5, where 1 is the lowest)
- Dataset = “Teaching Evaluations – Scores”

For each of the variables age, cls_students, and score,

Use the appropriate data tool to make a histogram of the distribution using the following binwidths:
- age: 5
- cls_students: 50
- score: 0.2
Describe the distribution, including the shape, center, spread, and presence of outliers, using words.

Record your results in Questions 13 – 19 below.

question 13

question 14

What is your description for the variable age?

Hint

question 15

question 16

What is your description for the variable cls_students?

Hint

question 17

question 18

What is your description for the variable score?

Hint

Determine the appropriate representation of the spread of a distribution

We’ve seen that sometimes the range of a distribution can be a misleading representation of the spread. For example, recall your description of the variable cls_perc_eval earlier in this activity. The range of the distribution of that variable covered 90% of the horizonal axis, but the data were mostly bunched up within the highest 50%. Including this information in the description of a distribution is helpful for understanding whether the range is an appropriate representation of the spread of a distribution.

question 19

Look back on your distributions for the three variables age, cls_students, and score. For which distribution might the range be a misleading representation of the spread?

Hint

video placement

[Wrap-up: You’ve had the chance to describe several differently shaped distributions in this activity using the statistical language of shape, center, spread, and the presence of outliers. Some of them were harder to describe than others, especially when it came to spread, which we described in this activity using the range. We’ll see later that there are other measures of spread we can use as well. Let’s recap the distributions of the other variables we looked at today. [voice over images of the distributions, one-by-one]. You may have found the variable “age” hard to describe. Even though it looks unusual, we would call this unimodal and roughly symmetric with the center at about 50 years. To find the shape, we just want to roughly draw a pen along the overall shape without paying too much attention to little bumps and dips along the way. The values range from about 29 to 73, with a range of about 44. There are some outliers above 70. Removing them would drop the range down. [course-size next] The distribution of course size is unimodal and right skewed. We see the center around 25 and a very wide range, almost 600, but that includes outliers between 500 and 600. In fact there are only a few courses with enrollment larger than 200, so the spread of most of the data is about 200. [average eval score next] Lastly, the distribution for average evaluation score is unimodal and left skewed with a center around 4.25. The range is about 2.75 (between 2.25 and 5). We could consider the few scores at the far left to be outliers. Hopefully you feel comfortable describing distributions using shape, center, spread, and the presence of outliers. And you should have a good idea now of when range can be used to appropriately describe the spread, and when you should make a note that the range could be misleading.”]

Reference: Describing Distributions

The features used to describe the distribution of a quantitative variable are the shape, center, spread, and presence of outliers.

Shape: The overall pattern (left skewed, right skewed, symmetric) and the number of peaks (unimodal, bimodal, multimodal, uniform).

Center: A measure that describes where the middle of the distribution is. The center is a number that describes a typical value. For example, one way to think about center is that it could be the point in the distribution where about half of the observations are below it and half are above it.

Spread: A measure of how far apart the data are. In this lesson, the range is used to measure spread. The range is the difference between the maximum value and minimum value.

Outliers: Unusual observations that are outside the general pattern of the distribution.

The description of shape includes two parts: (1) the overall pattern (left skewed, right skewed, symmetric) and (2) the number of peaks (unimodal, bimodal, multimodal, uniform).

The overall pattern can be described as one of the following:

Symmetric: The left and right sides of the distribution (closely) mirror each other. If you drew a vertical line down the center of the distribution and folded the distribution in half, the left and right sides would closely match one another.

Left skewed: The distribution has a longer tail to the left.

Right skewed: The distribution has a longer tail to the right.

In addition to the overall pattern, the description of shape also includes the number of peaks. This is also known as the modality. The modality can be described as one of the following:

Unimodal: There is one prominent peak.

Bimodal: There are two prominent peaks.

Multimodal: There are three or more prominent peaks.

Uniform: There are no prominent peaks.

The next feature is the center. For now, we can use the histogram to get an approximate value of the center. (In a later activity, you will learn statistics used to describe the center more precisely.)

When describing the spread of a distribution that is left skewed, right skewed, or has outliers, it can be misleading to only rely on the range to measure spread, since it is influenced by skewness and outliers. In this case, the range may make the spread appear to be larger than it is for a vast majority of the data.

If this is the case, in addition to reporting the range, you can include additional information about the spread of most of the data as well. This will give the reader a more accurate and complete picture of the true spread of the data. For example, in addition to reporting the range for the distribution of cls_perc_eval, we can also include information that most of the data are between about 50% and 100%, or within 50%. (In later activities, you will learn additional statistics to describe the typical spread of the data.)

The last feature in the description is the presence of outliers. Outliers are observations in the data that are unusual and outside the general pattern of the rest of the observations in the distribution. When working with a univariate distribution for a quantitative variable, an outlier is an observation that has an unusually high or unusually low value. It is good practice to make note of outliers, as these observations can sometimes influence the statistical results (e.g., the range).

Professor evaluations and beauty. (n.d.). OpenIntro. Retrieved from https://www.openintro.org/data/indes.php?data=evals ↵

Alpha Module 2: Describing and Summarizing Data