Forming Connections in 1D: Datasets and Statistical Questions

Objectives for the activity

During this activity, you will:

Determine whether a question is a good statistical question.
Determine whether a question can be answered with a given set of data.
Construct statistical questions that can be answered with a given set of data.

In What to Know [1D], you learned about the qualities of a good statistical question and how to identify the observational units and variables present in the data used to answer a statistical question. In this activity, you’ll make determinations about statistical questions and you’ll also see that data must be collected with purpose so that the data are appropriate to answer the question of interest.

Bad Drivers and Good Questions

In the article you read in the preview assignment, Mona Chalabi used data to try to answer the question “Which U.S. state has the worst drivers?” We’ll explore the dataset Chalabi used to help us recognize the qualities of a good statistical question and how appropriately chosen data support its investigation.

A man using his phone while driving

Guidance

[Intro: “In the WTK preview, you read a list of some of the qualities of a good statistical question. Do you remember what they were? If you haven’t yet, list some of these in your notebook now. How did you do? If you need to see the list again, it appears above Question 1 below. Did you remember most of them? If you didn’t, that’s okay; you have another chance today to record the list. Hopefully though, you do remember the two key qualities of a good statistical question: it won’t have an exact answer and it will always anticipate variability. In a moment, you’ll use the list of qualities of a good statistical question to answer Question 1 in this activity.

But first, let’s turn our attention to the data. What makes a data set appropriate to use to answer a question? What kind of data did Mona Chalabi use to answer the question, “Which U.S. state has the worst drivers?” in the article? Write some ideas down in your notebook. If you don’t remember, you can review the study again at https://fivethirtyeight.com/features/which-state-has-the-worst-drivers/ .

Let’s frame our activity today with the following statement. Two important starting points for a statistical study are making sure that you have a good statistical question and making sure the data that you plan to collect are the appropriate data to answer that question.”.]

Good Questions

A meaningful discussion of a statistical question necessarily includes a discussion of the data used to investigate it. In the questions that follow, we’ll do that with the dataset from Mona Chalabi’s study by first exploring whether the question exhibits the qualities of a good statistical question. Then, we’ll look at the dataset itself while asking how it supported Chalabi’s question, and how it might support other good statistical questions as well. Here’s the list you read about in the WTK page.

Qualities of a Good Statistical Question

The question is relevant and interesting.
The question deals with some natural variation in the world; in other words, the question anticipates variability.
The question asks us to generalize or find a general tendency among many individuals.
The question is one that needs data in order to be answered.
The question does not have an exact and precise answer that is easy to find; there is some analysis that must be done to answer the question, or the question may be answered in multiple ways.
There can be multiple factors that affect the answer to the question.

Work on your own to answer Question 1, using the qualities of a good statistical question.

Question 1

What made that question a good statistical question? Do you think she answered the question well? In particular, were the data she used the right data to answer the question?

Hint

Appropriate Data

Now let’s turn to the data table. Here are the first 20 rows of the data table for the bad drivers data that were used in the FiveThirtyEight article you read. Remember that the observational units are listed in the rows (state) and the variables in the columns. See the variable descriptions given below the table.

	state	num_drivers	perc_speeding	perc_alcohol	perc_not_distracted	perc_no_previous	insurance_premiums	losses
1	Alabama	18.8	39	30	96	80	784.55	145.08
2	Alaska	18.1	41	25	90	94	1053.48	133.93
3	Arizona	18.6	35	28	84	96	899.47	110.35
4	Arkansas	22.4	18	26	94	95	827.34	142.39
5	California	12.0	35	28	91	89	878.41	165.63
6	Colorado	13.6	37	28	79	95	835.50	139.91
7	Connecticut	10.8	46	36	87	82	1068.73	167.02
8	Delaware	16.2	38	30	87	99	1137.87	151.48
9	District of Columbia	5.9	34	27	100	100	1273.89	136.05
10	Florida	17.9	21	29	92	94	1160.13	144.18
11	Georgia	15.6	19	25	95	93	913.15	142.80
12	Hawaii	17.5	54	41	82	87	861.18	120.92
13	Idaho	15.3	36	29	85	98	641.96	82.75
14	Illinois	12.8	36	34	94	96	803.11	139.15
15	Indiana	14.5	25	29	95	95	720.46	108.92
16	Iowa	15.7	17	25	97	87	649.06	114.47
17	Kansas	17.8	27	24	77	85	780.45	133.80
18	Kentucky	21.4	19	23	78	76	872.51	137.13
19	Louisiana	20.5	35	33	73	98	1281.55	194.78
20	Maine	15.1	38	30	87	84	661.88	96.57

The following seven variables are included in the dataset. This format for displaying and describing the variables is often referred to as a data dictionary. The variable names are presented in italics followed by a brief description.

state: All 50 states, plus the District of Columbia
num_drivers: Number of drivers involved in fatal collisions per billion miles
perc_speeding: Percentage of drivers involved in fatal collisions who were speeding
perc_alcohol: Percentage of drivers involved in fatal collisions who were alcohol-impaired
perc_not_distracted: Percentage of drivers involved in fatal collisions who were not distracted
perc_no_previous: Percentage of drivers involved in fatal collisions who had not been involved in any previous accidents
insurance_premiums: Average combined car insurance premiums ($)
losses: Losses incurred by insurance companies for collisions per insured driver ($)

Work individually to answer Question 2 by writing a different statistical question that could be answered by this data.

Question 2

Give an example of another statistical question you could answer given the data that were used in the “Worst Drivers” article.

Note: Your answer may involve some or all of the variables listed, and you can also consider questions that try to determine whether the variables are related to one another.

Hint

Once you have written your question, come together with others in pairs or small groups to discuss your questions. Answer Question 3 parts A and B about your own question, but use the points as a guide for your discussion.

Question 3

With your group or partner, discuss your sample questions. Consider the class discussion on good questions and appropriate data. Make sure to answer the following about your own question.

Part A: Is your question a good statistical question? Why or why not?

Part B: Are the bad driver data the appropriate data to use to answer your question? Why or why not?

Hint

Now, choose one of the questions in your group that you wouldn’t mind sharing with the class. Go into detail about why you chose the question to answer Question 4. If your group chose a question other than yours, include the answers to Question 3 parts A and B for the new question in your answer to Question 4 as well.

Question 4

With your group or partner, choose one of your questions to share out with the class. Write the question you chose below, and explain why you chose it. If the question is not your question, make sure to explain what makes the question a good statistical question and why the bad driver data are appropriate to answer it.

Guidance

[Wrap-Up: “I hope that you had an opportunity to critique questions your classmates had written during this activity. It’s challenging to work together to select only one question from your group, especially when there are more than one good question to chose from! You certainly had good practice at collaborative learning in this activity. You’ll be ready for the activity in the following section, in which you’ll receive some tools for forming effective study groups outside of class. For now, though, let’s turn back to today’s experience. Take a look back at the objectives at the start of this page to see if you can recognize where each of them appeared in the activity. You should recognize now that analyzing statistics is an investigative process that begins with a good question, followed by identification of what data are needed to answer that question, followed by the actual collection of the data. And data collection is where we’re headed soon! Before closing this activity, take a few minutes to reflect about writing good statistical questions by answering Question 5.” ]

Question 5

After hearing the questions shared by the other groups in the class, summarize what you learned about writing good statistical questions and making sure the data available are appropriate to answer those questions.

Hint

Alpha Module 1: Collecting Data Sensibly and with Purpose