Preparing for the next classIn the next in-class activity, you will needto understand the purpose of a chi-square test of homogeneity, identify the null and alternative hypotheses for a chi-square test of homogeneity, and determine the expected count for each cell of a two-way table for a chi-square test of homogeneity.Suppose we wanted to compare some of the top commercial airlines in the United States to determine whether their distributions of on-time, delayed, canceled, and diverted flights were the same. We could compare as many airlines as we like, but let’s look at thetop three airlines (by number of passengers) and compare their flight status distributions. We can look at this information in acontingency table (i.e., atwo-way table),where each row represents the flight status distribution of an airline.The following table gives the data for the flights of each airline in March 2021.2Notice that in this table, we are displaying the counts for the categories of one categorical variable (flight status) for three different populations (each population is all the flights for a single airline). On-Time FlightsDelayed FlightsCanceled FlightsDiverted FlightsTotalAmerican Airlines42,6004,6572969547,648Delta Airlines51,6204,0301505655,856Southwest Airlines69,3849,2801,78212880,574Total 163,60417,9672,228279184,078Notice that the different airlines have different numbers of flights, so it can be useful to look at the relative frequencydistribution for each airline as well (i.e., the proportions of flights that have each status foreach airline).
For example, since there were 42,600 American Airlines flights that were on time and 47,648 American Airlines flights total, the relative frequency (or proportion) of American Airlines flights that were on time is:42,60047,468=0.894=89.4%In finding a similar proportion for each flight status, we find that the relative frequencydistribution for flight status for American Airlines is as displayed in the following table.Percentage On-TimeFlightsPercentage DelayedFlightsPercentage CanceledFlightsPercentage DivertedFlightsTotalAmerican Airlines42,60047,648=0.894=89.4%4,65747,648=0.098=9.8%29647,648=0.006=0.6%9547648=0.002=0.2%100%
Question 1
1) Complete thefollowingtable giving the relative frequency distribution(distribution ofproportions rather than counts)for each airline,as well as the overall percentages for each flight status (for all three airlines combined).Percentage On-TimeFlightsPercentage DelayedFlightsPercentage CanceledFlightsPercentage DivertedFlightsTotalAmerican Airlines89.4%9.8%0.6%0.2%100%Delta AirlinesSouthwest AirlinesOverall Hint: Round your calculations to the nearest thousandthbefore converting topercentages. Notice that because our percentages are tonearest tenth, it is possible that sometimes the totals will not round exactly to 100%. When we conduct the actual statistical test, wewill use technology, and that technology will keep track of all the decimal places to reduce rounding errors.
In comparing the flight status distributions for these airlines, we’ll build on two ideas we’ve seen before. We’ve already seen a test for determining whether two population proportions are equal: the two-proportion z-test. For example, we could think of the March flights as a sample of flights for each airline and consider whether the proportion of on-time flights for all American Airlines flights is the same as the proportion of on-time flights for all Delta Airlines flights. However, in this case, we’re generalizing on that idea by considering more than two populations and looking at the entire distribution of flight status for all values of the categorical variable. Secondly, we’ll be building on the previous in-class activity by using a chi-square test, but instead of comparing a distribution of counts to a theoretical model, we’re comparing distributions of a categorical variable (in this case, flight status)among different populations(in this case, there are three populations: all flights for three different airlines). The word “homogeneous” means the same or similar, so the chi-square test of homogeneity is asking whether or not two or more distributions of a categorical variable are the same. In short, a chi-square test of homogeneity compares distributions of one categorical variable for multiple populations.
Question 2
2) Consider the following scenarios and select the ones that can be tested with a chi-square test of homogeneity. There may be more than one correct answer.
a) Suppose we want to compare the types of majors selected by students atpublic universities and private universities. We take a random sample of students who attend public universities, and we take a random sample of students who attend private universities. We record the major of each respondent.
b) Suppose we want to compare the tendency to vote in local elections in the United Statesamong different age groups. We take a random, representative sample of American adults and record their age groupsand whether they voted in a local election in the last year.
c) Suppose we want to compare the fuel economy for pickup trucks produced by GMC and Dodge. We take a random sample of trucks produced by GMC and a random sample of trucks produced by Dodge. For each truck, we record the fuel economy in miles per gallon to one decimal place.
d) Suppose we want to compare the types of cars made by GMC and Dodge. We take a random sample of GMC car models and a random sample of Dodge car models. For each model, you record the category (sedan, SUV, pickup truck, etc.). Hint: Things to ask yourself when considering the scenarios: Is there one categorical variable? Is that variable measured for different populations (notice that this requires a different sample from each population)?
As usual, the null hypothesis is a statement of no difference or no change, so for a chi-square test of homogeneity, the null hypothesis is always that the distribution of the categorical variable is the same among all the populations. The alternative hypothesis is that the distributions are not the same among all the populations, so this test looks for evidence that there are differences among the samples that arelarger than those you would expect to see from just sampling variation if there really is no difference in the distributions for the populations.
Question 3
3) Write the null and alternative hypotheses for the test of homogeneity forour airlineflight statusexample. As with the goodness of fit test, we need to find the expected count for each group. In this case, our null hypothesis is that the distribution of proportions of flight status for each airline is the same. We estimate those proportions by looking at the overall proportions for each flight status among all three airlines. For example, with on-time flights, there were 163,604 on-time flights total out of the total 184,708 flights for all three airlines we’re considering.So,the estimated on-time proportionfor all three airlinesis:163,604184,078=0.88877541≈88.9%If the distributions are the same, or homogeneous (i.e., if the null hypothesis is true), we would expect each airline to haveabout88.9% of its flights be on time. American Airlines, for example, would have an expected count of on-time flights that is about 88.9% of the total 47,648 flights that American Airlines flew in March 2021. In this case, we will use more decimal places in our calculation to avoid rounding errors:0.88877541∗47,648=42,348.4Note that since the expected counts are theoretical values, they do not need to be whole numbers.
Question 4
4) Complete the following table for the expected counts of on-time flights for DeltaAirlinesand Southwest Airlines,assuming that the distributions of flight status are the same. Notice that the total should be the overall total of on-time flights from our original two-way table. Expected Count of On-timeFlightsAmerican Airlines42,348.4Delta AirlinesSouthwest AirlinesTotal Notice that once wehad the expected on-time flight counts for American Airlines and Delta Airlines, wecould have just subtracted those from the total number of on-time flights in order to find the expected count for Southwest Airlines.(Try this out to check your answer!)The same goes for each column: once we have two of the expected counts, we can find the third by subtracting. Similarly, in each row, once we have three of the expected counts, we can find the fourth by subtracting. This gives us 2×3=6degrees of freedom for our chi-square test of homogeneity.(In other words, once we have filled in sixcells inthe table of expected counts, we can fill in the others by subtracting.)
Question 5
5) Use the given expected flight counts for American Airlines to find the expected count of diverted flights for American Airlines by subtractingthe given expected counts from the total number of American Airlines flights. Expected On-Time FlightsExpected Delayed FlightsExpected Canceled FlightsExpected Diverted FlightsTotal FlightsAmerican Airlines42,348.447,648.In general, if the two-way table for a homogeneity test has 𝐶columns and 𝑅rows, then there are (𝑅−1)(𝐶−1)degrees of freedom. Notice that in our example, there are threerows (representing airlines) and fourcolumns(representing flight status), which give (3−1)(4−1)=(2)(3)=6degrees of freedom, as we saw previously.
As with the chi-square goodness of fit test, we compute the value of (Observed−Expected)2Expectedfor each cell in the table and then weobtainthe value of the test statistic𝜒2=∑(Observed−Expected)2ExpectedAs in the previous in-class activity,we will use technology to compute the chi-square value and the associated P-value. Feel free to check out the DCMPChi-Square Testtoolathttps://dcmathpathways.shinyapps.io/ChiSquaredTest/before class.Instructions for using the DCMPChi-Square Test tool: •Click the Test of Independence/Homogeneitytab at the top. •In the drop-down menu under“Enter Data,”select “Contingency Table.” •Under “Row Variable Name,”enter “Airlines.” The rows represent the different populations you are testing for homogeneity. •Under “Category Labels,”enter “American, Delta, Southwest” (separated by commas). •Under “Column Variable Name,” enter “Flight Status.”The columns are the values of the categorical variable we are considering the distributions of.•Under “Category Labels,”enter “On-Time, Delayed, Canceled, Diverted” (separated bycommas). •Enter the values given at the beginning of this preview activity into the appropriate cells of the two-way table(counts).•Explore checking the boxes below the table to see what kindsof information the data analysiscan give you.