Preparing for the next class
In the next in-class activity, you will need to be able to apply the steps for a hypothesis test to compare two population means and compare results of a hypothesis test to the corresponding confidence interval using appropriate notation.
The Maternal Smoking Study
There are many studies that link maternal smoking to lower birth weights, premature births, and miscarriages. Researchers in the early 1960s collected birth weights, dates, and gestational periods as part of the Child Health and Development Studies organization in 1961 and 1962. Information about the babies’ parents—age, education, height, weight, and whether the mother smoked—was also recorded. The variables included in the dataset are:
gestation: Length of gestation (in days)
wt: Weight (in ounces)
age: Mother’s age in years at termination of pregnancy
smoke: Does mother smoke? (never, smokes now, until current pregnancy, once did, not now)
smoke_now: Does mother currently smoke?
“Yes” includes “smokes now”
“No” includes responses of “until current pregnancy,” “once did,” “not now,” and “never.”
Ten observations are presented in the following table. The full dataset is found in spreadsheet DCMP_STAT_13C_Maternal_Smoke.
| gestation | smoke | smoke_now | age | wt |
| 284 | never | No | 27 | 120 |
| 282 | never | No | 33 | 113 |
| 279 | now | Yes | 28 | 128 |
| 282 | now | Yes | 23 | 108 |
| 286 | until current pregnancy | No | 25 | 136 |
| 244 | never | No | 33 | 138 |
| 245 | never | No | 23 | 132 |
| 289 | never | No | 25 | 120 |
| 299 | now | Yes | 30 | 143 |
| 351 | once did, not now | No | 27 | 140 |
Question 1
1) Suppose we wanted to study the difference in birth weight of babies born to mothers who smoked during pregnancy (smoke_now = yes) and mothers who did not smoke during pregnancy.
a) Clearly define the two populations of interest.
Recall that when we are interested in estimating a difference in population means, we usually start with data from a sample from each of the populations of interest.
There are two different strategies for selecting the two samples. One strategy is to select a sample from one population and then independently select a sample from the second population. Using this strategy results in two samples where the individuals selected for the first sample do not influence the individuals selected for the second sample.
This would be the case if you take a random sample from each population. Samples selected in this way are said to be independent samples.
b) Can the samples defined in this study be considered independent? Explain.
Question 2
2) Using the DCMP Describing and Exploring Quantitative Variables – Several Groups tool at https://dcmathpathways.shinyapps.io/EDA_quantitative/, describe the mean, standard deviation, and sample sizes for each group defined in Question 1.
Complete the following table, which represents notation that we can use to help us distinguish the sample mean, standard deviation, and sample size for the subjects in Group 1 vs. the subjects in Group 2.
| Group 1:
smoke_now = Yes Mothers who smoked during pregnancy |
Group 2:
smoke_now = No Mothers who did not smoke during pregnancy |
|
| Sample Mean | [latex]\bar{x}_{1}=[/latex] _____ | [latex]\bar{x}_{2}=[/latex] _____ |
| Sample Standard Deviation | [latex]s_{1}=[/latex] ______ | [latex]s_{2}=[/latex] ______ |
| Sample Size | [latex]n_{1}=[/latex] _____ | [latex]n_{1}=[/latex] _____ |
Hint: Use spreadsheet DCMP_STAT_13C_Maternal_Smoke!
Question 3
3) One way to compare the means of two groups is by looking at the difference of the means.
a) Write an expression that represents the difference between the two sample means using the notation in the table you completed in Question 2.
b) Write an expression to represent the difference between the population means.
c) What would be the value of the difference between the population means if there was no difference between the groups?
Hint: If there was no difference between the population means, [latex]\mu_{1}=\mu_{2}[/latex]. Think about the value you would get if you subtracted [latex]\mu_{2}[/latex] from[latex]\mu_{1}[/latex].
d) Describe, in the context of the study, what it means if there was no difference between the two groups.
When we are interested in estimating a difference in population means using data from independent samples, we will use a two-sample t confidence interval (In-Class Activity 12.D) or a two-sample t-test.
The conditions that you need to check for the two-sample t-test are the same as a two sample t confidence interval, presented in Preview Assignment 12.D:
- The samples are independent.
- Each sample is a random sample from the corresponding population of interest or it is reasonable to regard the sample as random. It is reasonable to regard the sample as a random sample if it was selected in a way that should result in a sample that is representative of the population. If the data are from an experiment, we just need to check that there was random assignment to experimental groups—this substitutes for the random sample condition and also results in independent samples.
- For each population, the distribution of the variable that was measured is approximately normal, or the sample size for the sample from that population is large. Usually, a sample of size 30 or more is considered to be “large.” If a sample size is less than 30, you should look at a plot of the data from that sample (a dotplot, a boxplot, or, if the sample size isn’t really small, a histogram) to make sure that the distribution looks approximately symmetric and that there are no outliers.
Question 4
4) Does the maternal smoking study satisfy the conditions for a two-sample t-test?
Question 5
5) We can use a hypothesis test to determine if the observed difference in sample means is consistent with a hypothesized difference in population means.
To do this, we use what we know about the sampling distribution of [latex]\bar{x}_{1}-\bar{x}_{2}[/latex] and, in particular, its estimated standard deviation (the standard error). Recall from In-Class Activity 12.D that you learned that the difference in the sample means, [latex]\bar{x}_{1}-\bar{x}_{2}[/latex], also has an approximately normal distribution, centered at the difference of the population means, [latex]\bar{x}_{1}-\bar{x}_{2}[/latex]. The standard deviation is given by the following formula:
[latex]\sqrt{\frac{\sigma^{2}_{1}}{n_{1}}+\frac{\sigma^{2}_{2}}{n_{2}}}[/latex]
In practice, we will have to estimate the standard deviation because it depends on the unknown population standard deviations. Replacing [latex]\sigma_{1}[/latex] and [latex]\sigma_{2}[/latex] by the sample standard deviations [latex]s_{1}[/latex] and [latex]s_{2}[/latex], we get the standard error of the difference:
[latex]standard\;error\;of\;\bar{x}_{1}-\bar{x}_{2}=\sqrt{\frac{s^{2}_{1}}{n_{1}}+\frac{s^{2}_{2}}{n_{2}}}[/latex]
a) Calculate the estimated difference in the means in Question 2.
b) Calculate the standard error for the distribution using the statistics from Question 2. Round your answer to the nearest hundredth.
Hint: Use the formula [latex]SE=\sqrt{\frac{s^{2}_{1}}{n_{1}}+\frac{s^{2}_{2}}{n_{2}}}[/latex]
c) Interpret the meaning of this value.
Question 6
6) Use the DCMP Describing and Exploring Quantitative Variables – Several Groups tool at https://dcmathpathways.shinyapps.io/EDA_quantitative/ to visualize the difference in means between the two groups defined in Question 2 using histograms.
Question 7
7) Briefly describe the difference (or lack thereof) between the two groups. Do you think there is a significant difference between the birth weights of babies born to mothers who smoked during pregnancy versus those who did not? Be prepared to share your conclusions in class.
This analysis uses descriptive statistics only. How can we make an inference about the difference when the population refers to all pregnant women? We will answer this question in the next in-class activity.