Forming Connections in 1C: Data Collection and Organization

Objectives for the activity

During this activity, you will:

  • Organize data in a spreadsheet.
  • Distinguish between observational units and variables in a dataset.
  • Distinguish between categorical and quantitative variables.
  • Distinguish between quantitative variables that are discrete or continuous.
  • Identify variables that can be used to collect data.

In What to Know [1C], you learned to distinguish between statistical investigative questions and survey questions. You also began to see that some data could be numerical or non-numerical. In this activity, we’ll extend your understanding of statistical problem-solving by learning some key terms and organizational strategies associated with data collection.

Recall the four steps of the statistical problem-solving process, from (1) forming a statistical question and (2) collecting data to (3) analyzing the data and (4) interpreting the results. Today we’ll consider the connection between the first two steps. That is, how do we get from the statistical investigative question to a data collection plan? Along the way, you’ll be able to see that there are multiple data collection and organization strategies that may be considered for a single statistical question. You’ll also consider ethical obligations related to data collection and storage.

Data Collection and Organization

In practice, there are often multiple data collection options to consider. For example, if we were interested in the relationship between phone use in class and grades, there are many ways to define the relevant variables and collect and organize the information.

Several people sitting in a row all using their smartphones.

Consider Question 1 below individually, then compare your answer with a partner and discuss the similarities and differences in your answers.

Question 1

Do you think there is a relationship between a student’s phone use in class and their grades? Are there any details about “phone use” that are important to consider?

Data Organization

A dataset contains information about a group of individuals or observational units. The characteristics of these observational units are recorded as variables. For example, the researcher collecting data on student phone use might ask individual students to report the number of times they checked their messages during class. In this case, the variable is the number of times messages were checked during class and the observational unit is one student response. Prior to analyzing the data, it needs to be organized into a spreadsheet in rows and columns. See the example below for a demonstration.

example

Picture yourself as the researcher collecting responses for many survey questions (variables) from each individual (observational unit) you survey. The data will be organized into a spreadsheet, which consists of rows and columns. Naturally, there are only two possibilities for arranging the variable responses for each individual surveyed.

Which of the following two options do you think represents the way observational units and variables are usually organized in a spreadsheet?

Option A: Each row is a variable and each column is an observational unit

Variabiles Individual 1 Individual 2
Variable 1 response 1 response 1
Variable 2 response 2 response 2
Variable 3 response 3 response 3

Option B: Each row is an observational unit and each column is a variable

Individual surveyed Variable 1 Variable 2 Variable  3
Individual 1 response 1 response 2 response 3
Individual 2 response 1 response 2 response 3

[The hidden answer includes a link to an open access article: Data Organization in Spreadsheets published in The American Statistician and located at  Taylor & Francis Online. Please edit as needed in the preferred citation style.]

Are you beginning to develop an image of how data can be organized in a spreadsheet? Answer Question 2 below to check your understanding.

Question 2

A dataset contains information about a group of individuals or observational units. The characteristics of these observational units are recorded as variables. How are observational units and variables usually organized in a spreadsheet?

Types of Variables

A variable is classified as categorical if it places an individual into one of several groups; it is classified as quantitative if it takes numerical values that can be used in arithmetic.

There are two types of quantitative variables. A discrete variable takes a fixed set of possible values, and it is not possible to get any value in between. In contrast, the range of outcomes for a continuous variable includes an infinite number of possible values. The discussion below provides a demonstration and examples of these types of variables. Try the question given in the Example before moving to Question 3.

example

Categorical Variables

These variables place an individual into one of several groups. Categorical survey questions are often encountered when completing forms that ask for information such as gender and race.

Quantitative Variables

Quantitative variables may be discrete or continuous.

Discrete variables often require non-negative whole numbers as responses. For example, an automobile insurance applicant may be asked for how many accidents were they found to be at fault. Responses would necessarily be a whole number like [latex]0[/latex], [latex]1[/latex], or [latex]2[/latex].

Continuous variables take any number or fraction of a number as a response, such as weight in pounds ([latex]155[/latex], [latex]187.2[/latex], or [latex]221.9[/latex]).

Ex. Imagine that you have been selected as a statistics intern in a veterinary clinic. The veterinarian wants to collect data about the dogs seen in her office. You’ve been asked to record information from the patient files to answer the survey questions listed below. For each, state whether the associated variable is categorical, discrete quantitative, or continuous quantitative and explain how you know.

  1. What zip code does the dog’s owner live in?
  2. What is the dog’s weight in pounds?
  3. How many times has the dog been seen in the office?
  4. Does the owner have an outstanding balance due?
  5. How many pets are in the household in addition to the dog? ([latex]0[/latex], [latex]1-2[/latex], [latex]3-5[/latex], more than [latex]5[/latex])

Now you try identifying the types of variables present in survey questions with a partner. Work in pairs to discuss the list of survey questions given in Question 3.

question 3

Consider the survey questions below. If you used these questions to collect data, would the resulting variables be categorical or quantitative? For variables that are quantitative, classify them as discrete or continuous.

 

What type of mobile phone do you have? (iPhone, Android, other)

 

What is your area code?

 

How many devices capable of connecting to the Internet do you bring with you to class on a typical day?

 

How much time did you spend on your phone yesterday? (less than 2 hours, 2–5 hours, more than 5 hours)

 

Approximately how much time do you spend on your phone in a typical day?

 

Do you usually spend more time on your phone on weekdays or on weekends?

Data Collection

Different survey questions offer different advantages and disadvantages for data collection. For example, it may be easier to remember how much time you spent on your phone yesterday compared to questions about your general habits, but a single day of phone use may not be representative of your phone use in general. The next few questions ask you to consider a statistical question that allows for many different options for collecting data, each with its own advantages and disadvantages.

Work in groups of two or three to complete the remainder of this activity.

Question 4

Suppose you want to investigate whether there is a relationship between a student’s phone use in class and their grades. Write survey questions or state variable names to answer the first two questions below.

 

Part A: List three ways that you could measure phone use. Make sure your list involves at least one categorical variable and at least one quantitative variable.

 

Part B: List three ways that you could measure grades. Make sure your list involves at least one categorical variable and at least one quantitative variable.

Part C: Revisit the lists that you made in Parts A and B. Which of the approaches do you like the best?

When developing the variables to collect data in the question above, you considered individuals as the observational units. How might your variable selections change if the observational unit shifts from individuals to class sections of students? Keep in mind that variables should be characteristic of the observational units as you answer Question 5 next. How should the data collection change when observing class sections rather than individuals?

Question 5

Suppose you want to investigate the relationship between phone use in class and grades using class sections as the observational units instead of individual students. Write survey questions or state variable names to answer both of the questions below.

 

Part A: Name one way you could measure phone use in a class section.

Part B: Name one way you could measure grades in a class section.

Ethical Issues

Think back on the survey questions and variables given in this activity as well as the ones you wrote to answer Questions 4 and 5. Do you have any ethical concerns about the data collection plans proposed? Some examples to consider include:

  • In what ways could the data collection process or the information revealed cause some students to be treated differently than others?
  • In what ways could the questions asked or methods of collection cause the data to imply associations that are not representative of the true situation?
  • Are there any ethical considerations surrounding how the collected data will be stored?

Example

Privacy concerns in data collection are paramount. You may be familiar with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule.[2] For individuals at least 18 years old, this rule prevents the individual’s medical information from being revealed to anyone who the individual has not identified as eligible to receive it.

A similar exists for college students at least 18 years old. A federal law called the Family Educational Rights and Privacy Act (FERPA),[3][4] protects the privacy of student records.

How could data collection and storage when studying phone use and grades protect the privacy of student information?

Work in pairs or groups to summarize your understanding of the ethical concerns associated with data collection and storage as you answer Question 6.

question 6

Are there any ethical concerns associated with a study of phone use and grades?

[Note: a question could be inserted here to specifically add an LO for ethics as needed.]