15D Preview

Preparing for the next classIn the next in-class activity, you will need to understand the difference between the chi-square test of homogeneity and the chi-square test of independence, as well as understand what it means for two variables to be independent. You will also need to be able to identify the null and alternative hypotheses for a chi-square test of independence and find expected counts for the cells of the contingency table in a chi-square test of independence. The Pew Research Center is a non-partisan fact tank that conducts polls and social science research. One survey that they conduct periodically is called the Core Trends Survey, which measures variables of a wide variety for a representative sample of American adults, including demographic information and information on Internet and social media use. Two of the variables included in the survey are Education level and Income level. The observed counts from the 2019 Core Trends Survey for these two variables are displayed in the following two-way table. [1] We’ve seen two-way tables (also called contingency tables) before in a couple of contexts. In the previous lesson, we saw contingency tables that displayed values for one categorical variable for samples from multiple populations. In this situation, the two-way table classifies counts for a sample of individuals from one population on two categorical variables.
Count Income level
< $30,000 $30,000-$74,999 $75,000 and up Total
Education level Post-Grad Degree 2 8 46 56
College Degree 39 113 202 354
Some College 131 138 120 389
HS Grad 175 129 65 369
No HS Degree 78 32 8 118
Total 425 420 441 1,286
Since we have two categorical variables measured for the same sample of individuals, the natural question to ask is,“Are these two variables independent?” In other words, “Is income level independent of education level?” We address this question using the chi-square test of independence. Recall from In-Class Activity7.C that two events, A and B, are independent if 𝑃(𝐴)=𝑃(𝐴|𝐵)(i.e., knowing whether event B happens has no effect on how likely event A is to occur). If the two variables Income level and Education level are independent, knowing one’s education level should not change the probability that they will have a particular income level, so the distribution of Income level should be the same for every education level. Similarly, the distribution of Education level should be the same for every income level. This should be feeling fairly reminiscent of the chi-square test of homogeneity, but it is different in a couple of important ways. The homogeneity test considered one categorical variable measured for samples from different populations and asked whether the distribution of that one variable was the same among the populations. In this case, we have one sample from one population of individuals for which two categorical variables are measured, and we’re asking whether those two variables are independent.

Question 1

1) For each of the following statements, select whether it applies to the chi-square test of homogeneity, the chi-square test of independence, or the chi-square goodness of fit test.
Part A: The question we ask is,“Are the variables independent?”
a) Chi-square test of homogeneity
b)Chi-square test of independence
c) Chi-square goodness of fit test
Part B: The question we ask is,“Does the distribution of this variable match a particular theoretical model?”
a) Chi-square test of homogeneity
b)Chi-square test of independence
c)Chi-square goodness of fit test
Part C: There is one categorical variable measured for distinct populations.
a) Chi-square test of homogeneity
b)Chi-square test of independence
c)Chi-square goodness of fit test
Part D: The question we ask is,“Are the distributions the same among the populations?”
a) Chi-square test of homogeneity
b)Chi-square test of independence
c)Chi-square goodness of fittest
Part E: There are two categorical variables measured for each individual in the sample.
a) Chi-square test of homogeneity
b)Chi-square test of independence
c)Chi-square goodness of fit test
Part F: The individuals of interest come from multiple,distinct populationsthat aresampled separately.
a) Chi-square test of homogeneity
b)Chi-square test of independence
c)Chi-square goodness of fit test
Part G: There is one sample drawn from one population, and one categorical variable is measured for each individual in the sample.
a) Chi-square test of homogeneity
b)Chi-square test of independence
c)Chi-square goodness of fit test
Since we are addressing a different question with the chi-square test of independence, the null and alternative hypotheses are different:𝐻0: The two variables of interest are independent. 𝐻𝐴: The two variables of interest are not independent. As usual, the null hypothesis is a statement of no change in that if the two variables are independent in the population, knowing the value of one variable does not change the likelihood that the second variable will have a particular value.
Sometimes the null and alternative hypotheses are written with slightly different wording, but they are equivalent to the previous wording:𝐻0: The two variables of interest are not associated. 𝐻𝐴: The two variables of interest are associated.

Question 2

2) Write down the null and alternative hypotheses for a chi-square test of independence based on the example from the Pew Research Center survey.

Questions 3–6: The mechanics of performing a chi-square test of independence are the same as those for the chi-square test of homogeneity. Since we are dealing with two variables here instead of just one, we can find the expected counts for each cell by focusing on the marginal distribution of either variable.The marginal distribution of a variable gives the distribution of one of the variables with no regard to the other variable whatsoever. In the table, this will be either the total row or the total column. One way to remember this is that the “margins” are on the outsides of a piece of paper (sides, top, and bottom), and the total row and column are the outside row and column of the table (on the side and bottom). As in the previous in-class activity, we will use several decimal places for our proportions in order to avoid rounding errors. For example, if Income level and Education level are independent, the proportion of people with incomes under $30,000 should be the same regardless of education level, so it should match the overall proportion of individuals with incomes under $30,000:
[latex]\frac{Total\;individuals\;with\;incomes\;under\;$30,000}{Total\;individuals\;in\;the\;sample}=\frac{425}{1286}=\0.33048212\;or\; 33.048212%}

Question 3

3) Complete the following table for the marginal distribution of Income level. Enter your answers in the yellow highlighted cells below the observed count.
Count Income level
< $30,000 $30,000-$74,999 $75,000 and up Total
Education level Post-Grad Degree 2 8 46 56 0.04353588
College Degree 39 113 202 354
Some College 131 138 120 389
HS Grad 175 129 65 369
No HS Degree 78 32 8 118
Total 425 420 441 1,28

The proportions you found in the previous table should be the proportions of income level for every value of the variable Education level. For example, about 33.05% of the 56 people with post-grad degrees should have an income level under $30,000:
[latex]33.048212%of56=0.33048212×56=18.507[/latex]

Question 4

4) Complete the following table for the expected countsof income level for those withpost-graduate degrees.(Rememberthat you can use subtraction to find the last one!)

< $30,000 $30,000-$74,999 $75,000 and up Total
Post-Grad Degree 18.507 56
We could also find the expected counts by using the marginal distribution for Education level. Note that 56 of the 1,286 individuals sampled have post-graduate degrees, so overall, [latex]\frac{56}{1286}=4.354588%[/latex] of individuals sampled have post-graduate degrees.

Question 5

5) Complete the table in Question 3 by adding the marginal distribution of Education level to the cells in green.If the two variables are independent, the conditional distribution of Education level for every income level should match the proportions in the previous table. For example, of the 425 individuals sampled with an income level under $30,000, about 4.35% of them should have post-graduate degrees, so there is an expected count of 4.354588% of 425=0.04354588×425=18.507 individuals with post-graduate degrees who make under $30,000 a year. Notice that this count matches the expected count we found in Question 4.

Question 6

6) Complete the following table for the expected counts of those with an income level under $30,000.

< $30,000
Post-Grad Degree 18.507
College Degree
Some College
High School Degree
No High School Degree
Total 425

Looking ahead

As we did in the previous in-class activity, we will be using technology to conduct this type of chi-square test. In fact, since the tests are so similar and have the same mechanics, we’ll be using the same data analysis as last time. Feel free to try it out before class! https://dcmathpathways.shinyapps.io/ChiSquaredTest/


  1. Pew Research Center. (2019). Core trends survey-Mobile technology and home broadband 2019. https://www.pewresearch.org/internet/dataset/core-trends-survey/