16D InClass

Question 1

1) Choose one of the following options and explain how you made your choice. Data on incomes areusually…

a) Skewed right

b) Skewed left

c) Symmetric

Figurines of golfers on tall stacks of coins next to shorter stacks of coins with figurines of construction workers on them.

The data for this in-class activity are from the Gapminder site on global development, which shows 2018 data on countries’ income per person^[1] (in standardized dollar amounts) and life expectancy.^[2] Each data point represents a different nation.

Question 2

2) The following is a dotplot showing the income per person among all the nations in the dataset.

A dot plot labeled “Income per Person (in 2011, international dollars),” with the x-axis labeled in increments of 10000. The graph is significantly skewed to the left. Albania is marked at approximately 12000, the USA is marked at approximately 56000, Singapore is marked at approximately 91000, and Qatar is marked at approximately 113000.

Part A: Do any of the labeled countries have higher or lower income values than you expected? Explain.

Part B: Describe the shape of the distribution.

Question 3

3) The following is a scatterplot (created using theDCMP Linear Regression tool) that visualizes each nation’s life expectancy as predicted by its income per person. A linear model is fit to the data, with the least square regression equation shown. The fit has an 𝑅2value of 46.6%. The following scatterplot is a residual plot from this fit.

A scatterplot titled “Life Expectancy & Nations’ Incomes.” It is labeled “Income Per Person (international, 2011 dollars)” on the x-axis and “Life Expectancy” on the y-axis. There is a line whose slope is given as “Regression Line: y =68.5 + 0.000245x.”

A residual plot labeled “Income Per Person (international, 2011 dollars)” on the x-axis and “Residual” on the y-axis. There is a horizontal line at y = 0. There are more points for lower x-values and both lower x-values and higher x-values are more likely to be below the horizontal line.

Part A: Is the linear model a good fit for these data? Would you trust predictions or inferences made from a linearmodel? Justify your answer using the 𝑅2value, residual plot, and scatterplot.

Part B: Why do you think the data take this shape in the scatterplot? Explain while referencing the dotplot in Question 2. Life Expectancy & Nations’ Incomes Residual Plot

Question 4

4) To handle the right skew in the income data, let’s try different transformations.

Part A: The following is a dotplot of incomes per person after taking the square root of all the values. Is the right skew reduced in severity? Is it still present? Explain.

A dot plot titled “Square Root of Incomes per Person.” There are more dots on the left than on the right, although it is less skewed than the data was in the previous dot pot.

Part B: The following is a dotplot of incomes per person after taking the base 10 logarithms of all the values. Is the right skew reducedin severity? Is it still present? Explain.

A dot plot titled “Log base ten of Nations’ Incomes.” The dots are spread relatively evenly across the dot plot.

Part C: Which transformation makes the data more symmetric? Explain.

Question 5

5) The following is a scatterplot after taking the base 10 log transformations of the income values. A linear model is fit to the data (using log income in place of income on the x-axis), with the least square regression equation shown. The fit has an 𝑅2 value of 71.1%. The following scatterplot is a residual plot from this fit: Dotplot created at stapplet.com Dotplot created at stapplet.com Log10(Nations’ Incomes)

A scatterplot titled “Life Expectancy & Nations’ Incomes.” It is labeled “Base-10 Logarithm of Income Per Person“ on the x-axis and “Life Expectancy” on the y-axis. There is a line whose slope is given as “Regression Line: y = 27.7 + 11.3x.”

A residual plot labeled “Base-10 Logarithm of Income Per Person” on the x-axis and “Residual” on the y-axis. There is a horizontal line at y = 0. There does not appear to be a pattern to the points.

Part A: Did the transformation result in data for which a linear model provides abetter fit for the data? Explain your answer using the 𝑅2 value, residual plot, and scatterplot.

Part B: Using the previous scatterplot and the linear regression model, a statistician claims that, “A nation with an income of about $4 per person has a predicted national life expectancy of about 73 years.” Explain what’s wrong with their statement and correct it.

Part C: Imagine that we instead looked at the national budget balance per person in every nation. In some nations, the budget balance is negative (more debts than revenue). In such a case, we can no longer use the log transformation. Explain. Life Expectancy & Log (Nations’ Incomes)Residual Plot

Question 6

6) In addition to transformations that make right-skew distributions more symmetric, there are transformations that can make left-skew distributions more symmetric.High school GPAs tend to be distributed in a left-skew shape; most students get A’s, B’s, and C’s in their classes, while fewerstudents consistently get lower grades (left tail). The followingis a dataset^[3] of 1,000 high school student GPAsvisualized as a histogram:

A histogram labeled “High School GPAs of 1000 Students” on the x-axis and “Count” on the y-axis. For 1.8-2, the count is approximately 5. For 2-2.2, the count is approximately 25. For 2.2-2.4, the count is approximately 45. For 2.4-2.6, the count is approximately 75. For 2.6-2.8, the count is approximately 100. For 2.8-3, the count is approximately 90. For 3-3.2, the count is approximately 135. For 3.2-3.4, the count is approximately 115. For 3.4-3.6, the count is approximately 130. For 3.6-3.8, the count is approximately 110. For 3.8-4, the count is approximately 55. For 4-4.2, the count is approximately 120. For 4.4-4.6, the count is approximately 5.

Part A: To try to make this distribution more symmetric, let’s first try to square all the values.The following graph shows the squared GPAs.Is the left skew reduced in severity? Is it still present? Explain.

A histogram labeled “Squared High School GPAs of 1000 Students” on the x-axis and “Count” on the y-axis. For 2-4, the count is approximately 5. For 4-6, the count is approximately 80. For 6-8, the count is approximately 220. For 8-10, the count is approximately 170. For 10-12, the count is approximately 150. For 12-14, the count is approximately 160. For 14-16, the count is approximately 100. For 16-18, the count is approximately 120. For 20-22, the count is approximately 5.

Part B: The following graph shows the GPAs cubed. Is the left skew reduced in severity? Is it still present? Explain.

A histogram labeled “Cubed High School GPAs of 1000 Students” on the x-axis and “Count” on the y-axis. For 0-9, the count is approximately 20. For 9-18, the count is approximately 150. For 18-27, the count is approximately 170. For 27-36, the count is approximately 250. For 36-45, the count is approximately 140. For 45-54, the count is approximately 120. For 54-63, the count is approximately 60. For 63-72, the count is approximately 125. For 90-99, the count is approximately 5.

Part C: If your goal is to make the distribution symmetric, would you use the square or cube transformation of GPA values? Explain.

Part D: In some countries, GPA values can be negative. In such cases, a transformation that squares every data value wouldn’t be appropriate. Explain.