5.1.1: Data Sources and Representations

Learning Objectives

  • Identify populations and samples
  • Determine whether a variable is independent or dependent
  • Read the information in a table
  • Write a cell in a table as a data pair
  • Identify categorical data
  • Identify quantitative data and class it as discrete or continuous
  • Read the information in a frequency table, a grouped frequency table, and a relative frequency table.

KEY words

  • Population: every member of a set
  • Sample: a subset of the population chosen using a defined procedure
  • Table: rows and columns of data
  • Cell: a specific row and column in a table
  • Data pair: an independent and dependent value inside parentheses
  • Independent variable: a variable that stands on its own with no outside influence
  • Dependent variable: a variable that is determined by the corresponding independent variable
  • Categorical variable: a variable whose values are qualitative and can take on one of a limited, and usually fixed, number of possible values. The categories have no obvious order to them.
  • Discrete quantitative variable: a numerical variable whose values can be counted. There is a clear order to the variable values but the values cannot be divided.
  • Continuous quantitative variable: a numerical variable that represents measurements and where the variable values cannot be counted. There are an infinite number of values in a given interval.
  • Frequency table: a table that shows the number of times a given value occurs.
  • Grouped frequency table: a data representation in which grouped data is displayed along with the corresponding frequencies.
  • Class: a grouped interval
  • Boundary Value: the value halfway between classes
  • Percent: a number out of 100
  • Frequency: the number of observations of a data point
  • Percent frequency: the frequency of a given group divided by the total number of observations

Data

Data is information. It can be facts or figures; numbers or words; measurements or observations; personal or transactional. Data is essential for businesses of all types as it helps them understand and improve business processes that can reduce wasted money and time. It helps to identify issues; it empowers us to make informed decisions; it can be used to prove an argument; it allows theories to emerge; it allows options to be assessed. Good data if well analyzed can improve the quality of our lives.

Data can be categorized as qualitative or quantitative. Data is said to be qualitative if it measures the quality or type of something rather than its quantity. If a data variable has no obvious numerical order, it is said to be a categorical variable (sometimes called a nominal variable). An example of a categorical variable would be the yes/no answer to a question; there are two categories, yes and no. Other categorical variables are hair color; sex; material type; payment method; brand; educational level; etc.

When data is quantitative, the values are derived from measuring or counting something, and it can be either discrete or continuous. Discrete variables are similar to categorical variables, but there is a clear order to them and they can be counted. Discrete data can take on specific integer values but cannot be divided into parts. For example, the number of customer complaints; a whole number age; shoe sizes; the number of students in a class.  In comparison, continuous variables have values that can take on any number (there are an infinite number of possible values within any given interval). Examples of continuous variables are any form of measurement: time; length; speed; volume; weight; etc.

In statistics, a population is a set of similar items or events which is of interest. A population contains every possible member of that set. For example, the US Census Bureau is interested in the population of every person living in the United States. On the other hand, a sample is a subset of a population where individuals or objects are collected or selected from a population by a defined procedure. An example of such a procedure is random sampling where individuals from the population are selected at random. When a sample is truly representative of the population, insights gleaned from sample data can be applied to the population. This is what happens in drug trials where a new drug is tested on a sample of patients. If the drug is effective for the sample, it is made available to the general population.

Examples

Categorize the data variables as qualitative or quantitative. Determine if the data values are categorical, discrete or continuous.

1. The height of UVU basketball players.

2. The types of material used to make a computer.

3. The weight of packages being shipped.

4. The highest degree earned by graduates.

5. The time to run a marathon.

6. The number of cars crossing an intersection.

Solution

1. Height is a numerical measurement so it is quantitative and continuous.

2. Material is not a numerical value; it is qualitative data; categorical.

3. Weight is a numerical measurement so it is quantitative and continuous.

4. Degree is qualitative and categorical.

5. Time is a numerical measurement so it is quantitative and continuous.

6. Number is quantitative but it takes on only whole number values, so the data is discrete.

Try It

Categorize the data variables as qualitative or quantitative. Determine if the data values are categorical, discrete or continuous.

1. The time to drive from Orem to Salt Lake City.

2. The number of full time students attending UVU by academic year.

3. The breed of dog owned by people in a neighborhood.

4. The number of credit hours taken by students at UVU.

 

Tables

Tables are often used to communicate data. Tables can be filled with numerical data or written data. They can be used to compare data or when a single item has several data points associated with it.

A table is made up of a series of rows and columns and has a title that describes what is shown in the table. Rows run horizontally (left to right), while columns run vertically (top to bottom). Each row and column has a title that describes what is contained in each cell of the table.

The title of Table 1 is, “The weight of certain dogs according to their breed”; this lets us know what information is contained in the table. The row titled “Weight (lb.)” indicates the weight of the dog in pounds. The columns are titled with the name of the dog and its breed. This is a simple table that can be read quite easily. For example, Lincoln is a Dachshund that weighs [latex]18[\latex] lb., while Bailey is a Shetland Sheepdog that weighs [latex]25[\latex] lb.

Table 1. The weight of certain dogs according to their breed.
Willy:Bassett Hound Bogart: Border Collie Lincoln: Dachshund Bailey: Shetland Sheepdog Kinsey: Labrador Retriever
Weight (lb.) 52 37 18 25 68

    The information in the table could be written as a set of data pairs in the form (Breed, weight in pounds). So the same information could be written as {(Bassett Hound, [latex]52[/latex]), (Border Collie, [latex]37[/latex]), (Dachshund, [latex]18[/latex]), (Shetland Sheepdog, [latex]25[/latex]), (Labrador Retriever, [latex]68[/latex])}. Each data pair contains two pieces of information: the breed of the dog and its weight. The weight of the dog is dependent upon the breed. Consequently, “breed” is called the independent variable, while “weight” is called the dependent variable.

Table 2 is a bit more complex, and because of that, the data is not quite as accessible to the reader. But the data is complex as well, and if it’s going to be displayed for ease of review, this seems like a decent choice. Table 2 shows the results of a survey where participants were asked to score eight speakers (a male and a female of different English Language dialects) according to their professionalism, intelligence, education, friendliness, and sociability on a scale of 1 – 10. A score of 1 indicates the lowest rating and a score of 10, the highest rating. An average was calculated for each dialect from the responses.

Table 2. Average Perceptions of English Speakers*
Standard American English
Sex Professional Intelligent Educated Friendly Extroverted
Female Speaker 5.83 5.83 5.75 5.42 4.92
Male Speaker 6.92 6.67 6.75 6.42 6.33
Southern American English
Sex Professional Intelligent Educated Friendly Extroverted
Female Speaker 5.75 5.17 5.00 7.25 7.00
Male Speaker 4.33 4.17 3.75 5.92 6.42
British English
Sex Professional Intelligent Educated Friendly Extroverted
Female Speaker 7.50 7.33 7.33 5.50 5.25
Male Speaker 6.50 6.25 6.17 5.17 4.92
Australian English
Sex Professional Intelligent Educated Friendly Extroverted
Female Speaker 7.00 6.92 7.08 6.25 6.42
Male Speaker 6.92 6.92 6.75 6.17 6.00
*Participants in this survey were asked to rate speakers on a scale of 1–10.

To read this table, we have to be aware of exactly which row and column any given cell is in. For example, a female who is speaking standard American English is rated [latex]5.83[/latex] for sounding professional. This compares to a male who is speaking standard American English who is rated [latex]6.92[/latex]. Since [latex]6.92 > 5.83[/latex], a male who is speaking standard American English is perceived as being more professional than a woman who is speaking standard American English.

Another example from Table 2 is that a woman speaking British English is rated [latex]7.33[/latex] on the perception of being educated. This compares to a women speaking southern American English who is rated [latex]5.00[/latex] on the perception of being educated. Since [latex]7.33 > 5.00[/latex], the women speaking British English is seen as being more educated than the women speaking southern American English.

Example

Using the data in Table 2,

  1. Is the variable “sex” categorical, discrete, or continuous?
  2. Is the calculated variable “average score” categorical, discrete, or continuous?
  3. Compare the rating of a male speaking Australian English with a male speaking Southern American English on being extroverted.
  4. Compare the rating of a female speaking standard American English with a male speaking standard American English on friendliness.
  5. Which cell in the table has the highest rating score and what does it measure?
  6. Which cell in the table has the lowest rating score and what does it measure?

Solution

  1. Sex is categorical.
  2. Average score is continuous.
  3. The rating of a male speaking Australian English on being extroverted is [latex]6.00[/latex]. The rating of a male speaking Southern American English on being extroverted is [latex]6.42[/latex]. The male speaking Southern American English is perceived as being more extroverted than the male speaking Australian English.
  4. The rating of a female speaking standard American English on friendliness is [latex]5.42[/latex]. The rating of a male speaking standard American English on friendliness is [latex]6.42[/latex]. The male speaking standard American English is perceived as being friendlier than the female speaking standard American English.
  5. The highest rating score in the table is [latex]7.50[/latex] and rates a female, British English speaker on professionalism. Female, British English speakers are perceived as being the most professional.
  6. The lowest rating score in the table is [latex]4.33[/latex] and rates a male speaking Southern American English on professionalism. Male, Southern American English speakers are perceived as being the least professional.

Try it

The Utah Valley Express bus has an online timetable.

  1. If you have a 9 am class to attend at UVU and are catching the bus at the Center Street Station, what time will you catch the bus to be on time to your class. Assume it take 10 minutes to walk from the UVU bus stop to your class.
  2. If you have an 11am class to attend at UVU and are catching the bus at BYU Stadium, what time will you catch the bus to be on time to your class. Assume it take 10 minutes to walk from the UVU bus stop to your class.
  3. Assuming the buses run on time, how log does it take to travel from to UVU if you catch the 11:55 am bus from Provo Station?
  4. Is the variable “bus stop” categorical, discrete, or continuous?
  5. Is the variable “time” categorical, discrete, or continuous?

Tables help us manage more complex sets of data. A table can be used if you’re looking to display individual values, if values are being compared, or if data is going to be shown and then summarized. Tables can convey a large amount of information in an easy-to-understand way.

Frequency Tables

A table that shows the frequency of each data value in a sample is called a frequency table. It is used when each value in the data set is not unique so that the frequency of the value can be shown.

Table 3 shows the grades of thirty students after the first test in a math class. If a C or better is a passing grade, then [latex]5+9+11=25[/latex] of the 30 students were passing after the first test, and [latex]3+2=5[/latex] students were not passing. A frequency table gives an idea of how the grades are distributed. More students earned a C than any other grade. Five students earned an A; the same number as those who did not pass. A table like this is useful for teachers to get a beat on how much the students are learning.

Table 3. Grades after first test.
Grade A B C D E
Number of Students 5 9 11 3 2

Notice that table 3 does not give individual scores on the exam. Grade is a categorical variable, so this table only tells us the number of students in each grade category.

Example

Use the table to answer the questions:

Shoe size 7 7.5 8 8.5 9 9.5 10 10.5 11
Number 1 2 5 7 10 9 6 4 1
Shoe sizes of 12th grade male students
  1. What does the table show?
  2. What is the sample size?
  3. What type of variable is shoe size?
  4. Which shoe size is most common?
  5. Which shoe size is least common?
  6. How many students wear a size 8 or smaller?
  7. How many students wear a size 10 or larger?

Solution

  1. The number of 12th grade students who wear a specific shoe size.
  2. [latex]1+2+5+7+10+9+6+4+1=45[/latex]
  3. Discrete
  4. Size [latex]9[/latex] has the highest frequency of [latex]10[/latex] students.
  5. Size [latex]7[/latex] and [latex]11[/latex] are each worn by only [latex]1[/latex] student.
  6. [latex]1+2+5=8[/latex]
  7. [latex]6+4+1=11[/latex]

Try It

A survey was conducted in the Science Building at UVU. The table shows the results.

Major Biology Chemistry Data Science Education Geology Mathematics Physics Sociology Undeclared
Female 7 2 4 2 1 2 0 2 4
Male 5 3 1 1 2 4 2 0 4
Survey results of UVU students majors
  1. What type of variable is major?
  2. How many students were surveyed?
  3. How many female students were surveyed?
  4. How many male students were surveyed?
  5. What is the most common major for female students?
  6. What is the most common major for male students?
  7. How many students had undeclared majors?
  8. Is this sample representative of all UVU students? Why or why not?

 

Grouped Frequency Tables

A table that shows the frequency of grouped data values in a sample is called a grouped frequency table. It is used when individual values in the data set are unknown but the frequency of the grouped values can be shown.

Table 4 shows the results from a question on the Spring 2021 UVU Student Survey, that asked, “How many hours per week are you currently employed?” To find the total number of students who answered this question we can add up the number of responses in each group: [latex]155+60+196+269+206+143=1029[/latex]. We can also see that working 21 to 30 hours per week is most common as that is the group with the highest frequency.

Table 4. Hours per week at current employment.
Hours per week None. I am currently not employed. 1 – 10 11 – 20 21 – 30 31 – 40 41 or more
Number of responses 155 60 196 269 206 143

Notice that the number of hours worked is a continuous quantitative variable. The hours per week are grouped and each group does not intersect with any other group. This is very important for a grouped frequency table. The intervals of hours worked (e.g., 11-20) is called a classNo matter the endpoints of the class, there will always be a gap between one class and the next. For example, there is a gap between 10 and 11. One question that often comes up is, “What if someone works 10.4 hours per week or 30.7 hours per week? Where does that fit into the table?”. Technically, the halfway point between adjacent class endpoints is called a boundary value. For example, halfway between 10 and 11 is 10.5, so 10.5 is a boundary value. Likewise, 0.5, 20.5, 30.5, and 40.5 are all boundary values. The boundary value belongs to the interval following it. So, 10.5 belongs to the 11-20 class, and 30.5 belongs to the 31-40 class. This means that if we have someone who works 10.4 hours per week, which is less than the boundary value 10.5, that person would count in the 1-10 class. Similarly, 30.7 hours per week is greater than or equal to the boundary value 30.5, so it would be counted in the 31-40 class.  This is equivalent to rounding the number of hours worked to the nearest whole number: 10.4 rounds to 10; 30.7 rounds to 31. In summary, if a value falls between classes, we round to the number of decimal points used in the class endpoints.

Example

The grouped frequency table displays the time to complete the New York Marathon in November 2021 for participants using handcycles.

Time in hours 1.5 – 1.9  2.0 – 2.4 2.5 – 2.9 3.0 – 3.4 3.5 – 3.9 4.0 – 4.4 4.5 – 4.9 5.0 – 5.4 5.5 – 5.9 6.0 – 6.4 6.5 – 6.9
Number of athletes 4 16 9 5 0 1 0 2 0 0 1
  1. How many hand cyclists completed the marathon?
  2. What was the most common window of time for them to finish?
  3. What was the fastest window of time to finish?
  4. What was the slowest window of time to finish?
  5. What is the boundary value between the 2.5-2.9 class and the 3.0-3.4 class?

Solution

  1. [latex]4+16+9+5+1+2+1=38[/latex]
  2. The highest frequency is 16 finishers, who finished in 2.0 – 2.4 hours.
  3. The fastest time was 1.5 – 1.9 hours.
  4. The slowest time was 6.5 – 6.9 hours.
  5. The boundary value is exactly halfway between 2.9 and 3.0: [latex] 2.95[/latex]

Try It

Ticket Price ($) 70-89 80-89 90-99 100-109 110-119 120-129 130-139 140-149 150-159
Number of Teams 3 5 5 6 7 3 1 1 1
National Football League (NFL) average ticket price in 2020
  1. What is shown on this table?
  2. Which class of ticket prices do the most teams fall into?
  3. Which is the class that is most expensive and how many teams charge that price?
  4. What is the boundary value between the 110-119 class and the 120-129 class?
  5. In which class would a ticket price of $129.76 be counted?

 

Relative Frequency Tables

Often seeing raw numbers like these is not as helpful as seeing the percent of students in each group. By finding the percent of students in each group, we are determining the number of students out of 100 that would fall in each group. To find the percent of students in each group, we divide the number of students in a group by the total number of students surveyed, For example, [latex]60[/latex] students work between 1-10 hours per week. This translates to [latex]\frac{60}{1029}=0.056[/latex] when divided and rounded to 3 decimal places. To turn any number into a percent, we multiply by 100 and add a percent sign, so [latex]0.056[/latex] is equivalent to [latex]5.6%[/latex]. This means that for every 100 students, 5.6 work between 1 – 10 hours per week. Obviously we can’t have 0.6 of a student!! Percents are often decimals so we usually round to a number that makes the most sense. Since we are talking about whole numbers of students, we can round [latex]5.6%[/latex] to [latex]6%[/latex].

turning a number into a percent

For all real numbers [latex]x[/latex],  [latex]x=100x\large\text{%}[/latex].

To turn any number into a percent, multiply the number by 100%.

Table 5 shows the same data as Table 4 with the percent of total students shown for each group. When we give the percent of the total in a frequency table, it is referred to as a relative frequency table. Any time data is displayed in a relative frequency table, the total number of data points must be given.

Table 5. The relative frequency of the number of hours worked per week at current employment. [latex]n=1029[/latex]
Hours per week None. I am currently not employed. 1 – 10 11 – 20 21 – 30 31 – 40 41 or more
Relative frequency 15% 6% 19% 26% 20% 14%

Notice that if we add up all of the relative frequencies, we get 100%. This makes perfect sense, since 100% (=1) of the total frequencies is the total number of students.

relative frequency

Relative frequency = [latex]\large\frac{\text{frequency in a group}}{\text{total number of observations}}\cdot 100\text{%}[/latex]

Example

The results of a survey held in the United States in May 2021 are shown in the table. The study asked adults 18 years or older, their frequency of using cable news as source of news.

Cable News Daily A few times per week Once per week A few times per month Once per month Less often than once per month Never
Relative Frequency 22% 17% 7% 8% 3% 7% 35%
[latex]n=2200[/latex]
  1. What percent of adults used cable networks as a source of news on a daily basis?
  2. What percent of adults never used cable networks as a source of news on a daily basis?
  3. How many more people answered that they never used cable networks as a source of news than those that used cable networks as a source of news on a daily basis?

Solution

  1. 22% of adults
  2. 35% of adults
  3. There is a difference of [latex]35%-22%=13%[/latex]. This amounts to [latex]13\text{% of }2200\text{ respondents }=0.13(2200)\text{ respondents}=286\text{ respondents}[/latex]

Remember that [latex]13\text{%}=\frac{13}{100}=0.13[/latex].

Try It

Use the table to answer the questions.

Income ($) Under 15,000 15,000-24,999 25,000-34,999 35,000-49,999 50,000-74,999 75,000-99,999 100,000-149,999 150,000-199,999 200,000 and over
Relative Frequency 9.4% 8.7% 8.1% 11.6% 16.5% 12.2% 15.3% 8% 10.3%
Percentage distribution of household income in the U.S. in 2020. [latex]n=129,931,000[/latex]
  1. What data is displayed in the table?
  2. How many households were there in the US in 2020?
  3. What percent of US households earn more than $100,000?
  4. What percent of US households earn less than $35,000?
  5. How many US households earn over $200,000?
  6. How many more US households earn under $15,000 than earn $25,000-$34,999?