15A Preview

Preparing for the next class
In the next in-class activity, you will need to be able to calculate expected counts for a single categorical variable, calculate a chi-square test statistic and explain the purpose behind each step in the calculation, and use technology to find the probability associated with a chi-square test statistic value.In the next activity, we will test hypotheses about the frequency distribution of a categorical variable and consider hypotheses that compare the proportion of a population that falls into two or more possible categories. Recall that a frequency table organizes data by listing different possible categories and the number of times each category occurs in the dataset.Thus, the frequency table displays the frequency or relative frequency distribution of a categorical variable. In this activity, you will compare the observed distribution of a categorical variable to its hypothesized distribution.In doing so, you will perform a chi-square test for goodness of fit. The “chi-square” part of the name represents the underlying statistical distribution (the chi-square distribution), which you will explore in this assignment. The “goodness of fit” part of the name describes the main task at hand: comparing how well the observed values “fit” with the hypothesized valuesor baseline distribution.Example-Italian SoccerItalian youth soccer leagues create cohorts of children based on year of birth. For example, children born in 2005 only playedother children born in that same year. If a child wasborn on December 31, 2004, theyplayedwith the 2004 cohort (rather than the younger 2005 cohort). So, children born earlier in the year (e.g.,JanuaryorFebruary) tend to be the eldest players in their leagues. Children born later in the year (e.g.,NovemberorDecember) tend to be the youngest players in their leagues.Could this seemingly unimportant practice—grouping by year of birth—have an effect on players’ later soccer careers? We’regoing to explore this question using data1compiled by researchers on professional Italian soccer players.

Question 1

1) A calendar year can be defined in quarters (as shown in thefollowing table).Birth rates differbetween quarters of the year. Some quarters arelonger (have more days), and different cultures have different preferences for times of birth. Researchers measured birth rates in Italy and found the following results: 1Fumarco, L. & Rossi, G. (2018, August 8). The relative age effect on labour market outcomes -Evidence from Italian football. EuropeanSport Management Quarterly, 18(4), 501–516. DOI:10.1080/16184742.2018.1424225
Quarter Quarter 1

(Jan. – March)

Quarter 2

(April – June)

Quarter 3

(July – Sept.)

Quarter 4

(Oct. – Dec.)

Proportion of births in Italy 0.2248 0.2498 0.2574 0.2680
Part A: Researchers collected data on 1,703 professional Italian soccer players. Assume their birthdates are distributed similarlyacross the fourquartersto the birthdates in the general Italian populationpresented in the previous table. If the birth quarters of the professional Italian soccer playersare like the general Italian population, how many of the players would be born in each quarter? Find the expected counts for each quarter. Show your results and work inthe following table.
Quarter Quarter 1

(Jan. – March)

Quarter 2

(April – June)

Quarter 3

(July – Sept.)

Quarter 4

(Oct. – Dec.)

Expected number of soccer players born this quarter  

0.2248 × 1703 = 382.83

 

Hint: Sincethe expected counts are theoretical values, they do not need to be whole numbers.The observedbirthdates of the actualsample of 1,703 professional Italian soccer players were classified by quarter and the results are summarized in the following table.
Quarter Quarter 1

(Jan. – March)

Quarter 2

(April – June)

Quarter 3

(July – Sept.)

Quarter 4

(Oct. – Dec.)

Observed number of soccer players 507 534 389 273
Part B: Compare the expected counts in Part A to theactualobserved counts. Do the observed counts have a different pattern than the expected counts? If so, describe the difference.

Question 2

2) Let’s quantify the differencesbetween the expected counts and the observed counts.

Quarter Observed (O) Expected (E) (O – E) (O – E)2
1 507 382.83
2 534 425.41
3 389 438.35
4 273 456.40

Part A:For each quarter, find the difference between the observed and the expected numbersof soccer players with birthdates in the quarter (Observed –Expected, or O –E). Fill in the (O–E)column in the previous table.
Part B:To get a sense of the overall difference between the observed and expected counts, we may be tempted to add up all of the differences we calculated. Why might thisbecounterproductive?
Part C:To get a full set of positive differences, one way is to square the differences. This is similar to the way we calculate standard deviations. Write your answers in the (O –E)2column of the previous table.
Part D:To further understand the intuition of the next step in the previousscenario, let’s briefly explore a simple example. Imagineyou’re a senior in high school inthese twoscenarios: i. You are buying donuts for threeof your friends. When your friends reach into the bag, they only find two donuts—the store has shorted you by one donut. ii. You are buying donuts for your entire school. There are 600 students at your school. The donut shop mistakenly only gives you 599 donuts in the boxes you bring to school. In which situation is the donut shortage more severe? Explain.

Part E:To get a sense of the difference between our observed and expected values on the scaleof what we expected, divide the values in the (O –E)2 column by the expected counts. Put your results in the final (𝑂−𝐸)2𝐸columnin theprevioustable.
Part F:Finally, add up the final values in the (𝑂−𝐸)2𝐸column and report your result.
You just calculated the value of the chi-square(pronounced “kai-square”) test statistic for this problem. It measures the overall distance between observed and expected counts. The greater the chi-square test statistic, the further the observed counts are from what we expected. Here is the formula for the chi-square test statistic:𝜒2=∑(Observed−Expected)2ExpectedThis formula shows what we did in Part F—we added up (the large sigma Σrepresents summation) the (𝑂−𝐸)2𝐸for each quarter of the year(each category).It’s important to remember the intuition behind this formula—we get the differences, square them to get rid of the negative values, and then scale them by dividing the squared differences by the expected counts. In this way, we get a robust measure of the overall difference between the observed and expected counts for a categorical variable.So, what does our chi-square test statistic value mean? To assess what our chi-square value tells us about the distance between the expected and observed counts, we’ll turn to the chi-square distribution. Goto the DCMPChi-Square Distributiontool athttps://dcmathpathways.shinyapps.io/ChisqDist/.

Question 3

3)Let’s assume that the distribution of birthdates among all professional Italian soccer players is the same as the distribution of birthdates in the general Italian population. Under this assumption,and if certain conditions are met (we will talk about these conditions in the next in-class activity), the statistic we calculated should follow a chi-square distributionwith threedegrees of freedom (df = 3).The degrees of freedom is one fewer than the number of possible categories for our categorical variable. In our case, we categorizedbirthdates into fourquarters, so one fewer makes threedegrees of freedom.
Part A:Set the degrees of freedom slider to threeand observe the shape of the chi-square distribution. Does the distribution have any negative values? Does this make sense? Explain.Hint: Think about how we dealt with negatives when calculating the chi-square test statistic.
Part B:In the chi-square distributionwith threedegrees of freedom, are small values (close to 0) more common or are large values (sixor higher) more commonif our assumption about the distribution of the categorical variable is true? Does this make sense? Explain.
Hint: Think about the assumption stated in the setup to Question 3.
Part C:Select the Find Probabilitytab at the top ofthe data analysis tool.Keep the same degrees of freedom (df = 3) and select the “Upper Tail” probability option. Enter your calculated chi-square statistic value where it says “Value of x.” Record the probability the data analysis tool shows, and comment on what this probability says about the assumption that the distribution of birthdates among all professional Italian soccer players is the same as the distribution of birthdates in the general Italian population.