What to Know About 1D: Datasets and Statistical Questions

Learning Goals

At the end of this page, you should feel comfortable performing these skills:

  • Identify the observational units in data needed to answer a statistical question.
  • Identify qualities of a good statistical question.
  • Identify variables used to answer a statistical question.

In the next activity, you will need to understand the qualities exhibited by a good statistical question, and you will need to have an understanding of observational units and variables so that you can think about whether a given dataset is appropriate to answer a statistical question.

At its heart, statistics is an investigative process that can allow us to answer questions about our world. In the upcoming activity, we’re going to be considering what makes a good statistical question and how we can match appropriate data with those statistical questions. Prepare for that activity by completing this page to build your understanding of  the observational units and variables present in the data used to answer statistical questions.

The Qualities of Statistical Questions

[insert image of frustrated driver]

You’ve probably heard people complain about drivers from various areas of the country. In a moment, you’ll consider the question, “Which U.S. state has the worst drivers?” as a statistical question. Before you begin to tackle that question, first refresh what you learned about statistical investigative questions in the What to Know [1C] by reading the example below. Then, extend that knowledge by working through the questions on this page to prepare for the upcoming activity.

example

In What to Know [1C], you learned that a statistical investigative question is one that can be used as the starting point for an investigation. To answer the question, data will need to be collected and analyzed using statistical tools and the results of that analysis interpreted. You understood that the first quality of a statistical question is that it doesn’t have an exact answer.

Which of the following questions satisfies the definition of a statistical investigative question?

  1. Which country (or countries) had the highest maximum speed limit for any of its roadways in 2020?
  2. Which country has the best drivers?

You learned previously that statistical questions always anticipate variability, and could lead to data collection and analysis. Let’s extend that understanding now by looking at some further qualities a question will have if it is a good statistical question.

Question 1

A good statistical question will exhibit many and possibly all of the following qualities:

  • The question is relevant and interesting.
  • The question deals with some natural variation in the world; in other words, the question anticipates variability.
  • The question asks us to generalize or find a general tendency among many individuals.
  • The question is one that we need data in order to answer.
  • The question does not have an exact and precise answer that is easy to find; there is some analysis that must be done to answer the question, or the question may be answered in multiple ways.
  • There can be multiple factors that affect the answer to the question.

Which of the above qualities do you find present in the question, “Which U.S. state has the worst drivers?”

 Observational Units

In Forming Connections [1C] you learned that the observational units are the individuals we are asking a question or the entities about whom we want to measure some characteristic. Observational units can be humans, but they can also be any individual of interest. We may wish to collect data about humans, animals, or even non-living items like books, math tests, or U.S. states.

observational units

[perspective video — a 3 instructor video offering a couple of examples of good statistical questions along with a short list of possible observational units associated with the situation to choose from to help answer the question. Use a human example, a non-human living example, and a non-living example <– emphasize that in a situation like, for example, “does active learning tend to increase student success?” that the obs. units would be “test scores” or “course grades” and not “students.” This can lead into a brief mention of ethics and the practice of analyzing anonymous or de-identified data.]

Note: this video could reference questions of similar technical style as those in the example below. 

See the example below before answering Question 2.

example

This example could be a shorter version of the pick-your-dataset examples we used in Module 2. For instance, one option could be 3 questions surrounding a particular issue of social justice while another option could be 3 questions surrounding a particular issue of inclusion and a third option could be 3 questions surrounding  identity. But they should each be in the technical style of the 3 questions below: brief and uncomplex.

Identify the observational units in the data used to answer each of the questions below.

  1. Which species of fish tends to contain more mercury?
    • Are the observational units a) species of fish, b) mercury, or  c) waterways?
  2. Which city has the longest commute time for workers per year?
    • Are the observational units a) workers, b) commute times, or c) cities?
  3. What climate tends to attract more tourists?
    • Are the observational units a) tourists, b) modes of transportation, or c) climates?

Now it’s your turn. Return to the question, “Which U.S. state has the worst drivers?” to answer Question 2.

Question 2

Suppose we wanted to try to answer the question, “Which U.S. state has the worst drivers?” Since we’re asking a question that begins “Which U.S. state…,” which of the following should be the observational units in the data we use to answer this question?

  1. Drivers
  2. Vehicles
  3. Car accidents
  4. U.S. states

Variables

Anticipating variability is key in a good statistical question; in other words, a good statistical question anticipates that there will be variability in the data collected to answer the question. That means that the variables (or characteristics) we measure about our observational units are expected to have different values among the different observational units. Understanding which kinds of questions anticipate variability can help us to understand what kind of variables can be used to explore a statistical question. See the video below for a demonstration of how to identify variables in data, then answer the remaining questions.

Qualities of good statistical questions

[Worked example — a 3 instructor video that follows the themes used in the perspective video above but provides a worked example for Question 3 and Question 4 below.]

Hopefully you are feeling more confident about identifying questions that anticipate variability and variables present in the data. Now it’s your turn to assess your understanding by answering the remaining questions.

Question 3

Which of the following questions anticipate variability in the data required to answer them? Select all that apply. There may be more than one correct answer.

  1. a) Which states have the most automobile accidents per year?
  2. b) Which states tend to have stricter cell phone laws for drivers?
  3. c) Does New York have a state-wide hands-free cell phone law?
  4. d) Which state has the fewest drivers on the road per day?
  5. e) How many speeding tickets were given in the United States in 2019?
  6. f) What time of day has the most traffic in Alabama?

The final question requires you to pull information from an article in which statistical data is used to answer a relevant and interesting question (one of the qualities of a good statistical question). Don’t skip the article when trying to answer the question!

Question 4

In the FiveThirtyEight article, “Dear Mona, Which State Has The Worst Drivers,” the author, Mona Chalabi, attempts to answer the title question. Read the article and identify the variables that the author uses to explore this question.

https://fivethirtyeight.com/features/which-state-has-the-worst-drivers/

Which of the following variables does the author use to answer the question? Select all that apply. There may be more than one correct answer.

  1. a) Number of registered vehicles per state
  2. b) Number of drivers on the state’s roads per day
  3. c) Number of drivers involved in fatal collisions per billion miles traveled
  4. d) Number of fatalities due to automobile wrecks per year
  5. e) Percentage of drivers involved in fatal collisions who were not distracted
  6. f)  Whether or not the state has a hands-free cell phone law
  7. g) Percentage of drivers involved in fatal collisions who were not involved in previous accidents
  8. h) Percentage of fatal collisions where a driver was speeding
  9. i)  Number of speeding tickets given per year
  10. j) Percentage of fatal collisions that occurred on a road with a speed limit over 60 miles per hour
  11. k) Percentage of fatal collisions in which alcohol impairment was involved
  12. l) Average combined car insurance premium

Summary

In this What to Know page, you explored the qualities of a good statistical question and you learned to identify the observational units and variables present in data collected to answer a question.  Let’s summarize where these skills and tasks showed up on the page.

  • In Question 1, you learned the many qualities of a good statistical question.
  • In Question 2, you identified the observational units present in data needed to answer a statistical question.
  • In Question 3, you identified questions that anticipate variability in the data.
  • In Question 4, you identified variables present in data used to answer a statistical question.

Good work! If you feel comfortable with these ideas, it’s time to move on to Forming Connections in the next activity!