## Repeated Measures Design

Repeated measures analysis of variance (rANOVA) is one of the most commonly used statistical approaches to repeated measures designs.

### Learning Objectives

Evaluate the significance of repeated measures design given its advantages and disadvantages

### Key Takeaways

#### Key Points

• Repeated measures design, also known as within-subjects design, uses the same subjects with every condition of the research, including the control.
• Repeated measures design can be used to conduct an experiment when few participants are available, conduct an experiment more efficiently, or to study changes in participants’ behavior over time.
• The primary strengths of the repeated measures design is that it makes an experiment more efficient and helps keep the variability low.
• A disadvantage of the repeated measure design is that it may not be possible for each participant to be in all conditions of the experiment (due to time constraints, location of experiment, etc.).
• One of the greatest advantages to using the rANOVA, as is the case with repeated measures designs in general, is that you are able to partition out variability due to individual differences.
• The rANOVA is still highly vulnerable to effects from missing values, imputation, unequivalent time points between subjects, and violations of sphericity — factors which can lead to sampling bias and inflated levels of type I error.

#### Key Terms

• longitudinal study: A correlational research study that involves repeated observations of the same variables over long periods of time.
• sphericity: A statistical assumption requiring that the variances for each set of difference scores are equal.
• order effect: An effect that occurs when a participant in an experiment is able to perform a task and then perform it again at some later time.

Repeated measures design (also known as “within-subjects design”) uses the same subjects with every condition of the research, including the control. For instance, repeated measures are collected in a longitudinal study in which change over time is assessed. Other studies compare the same measure under two or more different conditions. For instance, to test the effects of caffeine on cognitive function, a subject’s math ability might be tested once after they consume caffeine and another time when they consume a placebo. Repeated Measures Design: An example of a test using a repeated measures design to test the effects of caffeine on cognitive function. A subject’s math ability might be tested once after they consume a caffeinated cup of coffee, and again when they consume a placebo.

Repeated measures design can be used to:

• Conduct an experiment when few participants are available: The repeated measures design reduces the variance of estimates of treatment-effects, allowing statistical inference to be made with fewer subjects.
• Conduct experiment more efficiently: Repeated measures designs allow many experiments to be completed more quickly, as only a few groups need to be trained to complete an entire experiment.
• Study changes in participants’ behavior over time: Repeated measures designs allow researchers to monitor how the participants change over the passage of time, both in the case of long-term situations like longitudinal studies and in the much shorter-term case of order effects.

The primary strengths of the repeated measures design is that it makes an experiment more efficient and helps keep the variability low. This helps to keep the validity of the results higher, while still allowing for smaller than usual subject groups.

A disadvantage of the repeated measure design is that it may not be possible for each participant to be in all conditions of the experiment (due to time constraints, location of experiment, etc.). There are also several threats to the internal validity of this design, namely a regression threat (when subjects are tested several times, their scores tend to regress towards the mean), a maturation threat (subjects may change during the course of the experiment) and a history threat (events outside the experiment that may change the response of subjects between the repeated measures).

### Repeated Measures ANOVA

Repeated measures analysis of variance (rANOVA) is one of the most commonly used statistical approaches to repeated measures designs.

### Partitioning of Error

One of the greatest advantages to using the rANOVA, as is the case with repeated measures designs in general, is that you are able to partition out variability due to individual differences. Consider the general structure of the $\text{F}$– statistic:

$\text{F} = \dfrac{\text{MS}_{\text{treatment}}}{\text{MS}_{\text{error}}} = \dfrac{\text{SS}_{\text{treatment}} / \text{df}_{\text{treatment}}}{\text{SS}_{\text{error}} / \text{df}_{\text{error}}}$

In a between-subjects design there is an element of variance due to individual difference that is combined in with the treatment and error terms:

$\text{SS}_{\text{total}} = \text{SS}_{\text{treatment}} + \text{SS}_{\text{error}}$
$\text{df}_{\text{total}} = \text{n}-1$

In a repeated measures design it is possible to account for these differences, and partition them out from the treatment and error terms. In such a case, the variability can be broken down into between-treatments variability (or within-subjects effects, excluding individual differences) and within-treatments variability. The within-treatments variability can be further partitioned into between-subjects variability (individual differences) and error (excluding the individual differences).

$\text{SS}_{\text{total}} = \text{SS}_{\text{treatment}} + \text{SS}_{\text{subjects}} + \text{SS}_{\text{error}}$

\begin{align} \text{df}_{\text{total}} &= \text{df}_{\text{treatment}} + \text{df}_{\text{between subjects}} + \text{df}_{\text{error}}\\ &= (\text{k}-1) + (\text{n}-1) + ((\text{n}-\text{k})-(\text{n}-1)) \end{align}

In reference to the general structure of the $\text{F}$-statistic, it is clear that by partitioning out the between-subjects variability, the $\text{F}$-value will increase because the sum of squares error term will be smaller resulting in a smaller $\text{MS}_{\text{error}}$. It is noteworthy that partitioning variability pulls out degrees of freedom from the $\text{F}$-test, therefore the between-subjects variability must be significant enough to offset the loss in degrees of freedom. If between-subjects variability is small this process may actually reduce the $\text{F}$-value.

### Assumptions

As with all statistical analyses, there are a number of assumptions that should be met to justify the use of this test. Violations to these assumptions can moderately to severely affect results, and often lead to an inflation of type 1 error. Univariate assumptions include:

1. Normality: For each level of the within-subjects factor, the dependent variable must have a normal distribution.
2. Sphericity: Difference scores computed between two levels of a within-subjects factor must have the same variance for the comparison of any two levels.
3. Randomness: Cases should be derived from a random sample, and the scores between participants should be independent from each other.

The rANOVA also requires that certain multivariate assumptions are met because a multivariate test is conducted on difference scores. These include:

1. Multivariate normality: The difference scores are multivariately normally distributed in the population.
2. Randomness: Individual cases should be derived from a random sample, and the difference scores for each participant are independent from those of another participant.

### $\text{F}$-Test

Depending on the number of within-subjects factors and assumption violates, it is necessary to select the most appropriate of three tests:

1. Standard Univariate ANOVA $\text{F}$-test: This test is commonly used when there are only two levels of the within-subjects factor. This test is not recommended for use when there are more than 2 levels of the within-subjects factor because the assumption of sphericity is commonly violated in such cases.
2. Alternative Univariate test: These tests account for violations to the assumption of sphericity, and can be used when the within-subjects factor exceeds 2 levels. The $\text{F}$ statistic will be the same as in the Standard Univariate ANOVA F test, but is associated with a more accurate $\text{p}$-value. This correction is done by adjusting the $\text{df}$ downward for determining the critical $\text{F}$ value.
3. Multivariate Test: This test does not assume sphericity, but is also highly conservative.

While there are many advantages to repeated-measures design, the repeated measures ANOVA is not always the best statistical analyses to conduct. The rANOVA is still highly vulnerable to effects from missing values, imputation, unequivalent time points between subjects, and violations of sphericity. These issues can result in sampling bias and inflated rates of type I error.

## Further Discussion of ANOVA

Due to the iterative nature of experimentation, preparatory and follow-up analyses are often necessary in ANOVA.

### Learning Objectives

Contrast preparatory and follow-up analysis in constructing an experiment

### Key Takeaways

#### Key Points

• Experimentation is often sequential, with early experiments often being designed to provide a mean -unbiased estimate of treatment effects and of experimental error, and later experiments often being designed to test a hypothesis that a treatment effect has an important magnitude.
• Power analysis is often applied in the context of ANOVA in order to assess the probability of successfully rejecting the null hypothesis if we assume a certain ANOVA design, effect size in the population, sample size and significance level.
• Effect size estimates facilitate the comparison of findings in studies and across disciplines.
• A statistically significant effect in ANOVA is often followed up with one or more different follow-up tests, in order to assess which groups are different from which other groups or to test various other focused hypotheses.

#### Key Terms

• iterative: Of a procedure that involves repetition of steps (iteration) to achieve the desired outcome.
• homoscedasticity: A property of a set of random variables where each variable has the same finite variance.

Some analysis is required in support of the design of the experiment, while other analysis is performed after changes in the factors are formally found to produce statistically significant changes in the responses. Because experimentation is iterative, the results of one experiment alter plans for following experiments.

### The Number of Experimental Units

In the design of an experiment, the number of experimental units is planned to satisfy the goals of the experiment. Most often, the number of experimental units is chosen so that the experiment is within budget and has adequate power, among other goals.

Experimentation is often sequential, with early experiments often being designed to provide a mean-unbiased estimate of treatment effects and of experimental error, and later experiments often being designed to test a hypothesis that a treatment effect has an important magnitude.

Less formal methods for selecting the number of experimental units include graphical methods based on limiting the probability of false negative errors, graphical methods based on an expected variation increase (above the residuals ) and methods based on achieving a desired confidence interval.

### Power Analysis

Power analysis is often applied in the context of ANOVA in order to assess the probability of successfully rejecting the null hypothesis if we assume a certain ANOVA design, effect size in the population, sample size and significance level. Power analysis can assist in study design by determining what sample size would be required in order to have a reasonable chance of rejecting the null hypothesis when the alternative hypothesis is true.

### Effect Size

Effect size estimates facilitate the comparison of findings in studies and across disciplines. Therefore, several standardized measures of effect gauge the strength of the association between a predictor (or set of predictors) and the dependent variable.

Eta-squared ($\eta^2$) describes the ratio of variance explained in the dependent variable by a predictor, while controlling for other predictors. Eta-squared is a biased estimator of the variance explained by the model in the population (it estimates only the effect size in the sample). On average, it overestimates the variance explained in the population. As the sample size gets larger the amount of bias gets smaller:

$\eta^2 = \dfrac{\text{SS}_{\text{treatment}}}{\text{SS}_{\text{total}}}$

Jacob Cohen, an American statistician and psychologist, suggested effect sizes for various indexes, including $\text{f}$ (where $0.1$ is a small effect, $0.25$ is a medium effect and $0.4$ is a large effect). He also offers a conversion table for eta-squared ($\eta^2$) where $0.0099$ constitutes a small effect, $0.0588$ a medium effect and $0.1379$ a large effect.

### Model Confirmation

It is prudent to verify that the assumptions of ANOVA have been met. Residuals are examined or analyzed to confirm homoscedasticity and gross normality. Residuals should have the appearance of (zero mean normal distribution) noise when plotted as a function of anything including time and modeled data values. Trends hint at interactions among factors or among observations. One rule of thumb is: if the largest standard deviation is less than twice the smallest standard deviation, we can use methods based on the assumption of equal standard deviations, and our results will still be approximately correct.

### Follow-Up Tests

A statistically significant effect in ANOVA is often followed up with one or more different follow-up tests. This can be performed in order to assess which groups are different from which other groups, or to test various other focused hypotheses. Follow-up tests are often distinguished in terms of whether they are planned (a priori) or post hoc. Planned tests are determined before looking at the data, and post hoc tests are performed after looking at the data.

Post hoc tests, such as Tukey’s range test, most commonly compare every group mean with every other group mean and typically incorporate some method of controlling for type I errors. Comparisons, which are most commonly planned, can be either simple or compound. Simple comparisons compare one group mean with one other group mean. Compound comparisons typically compare two sets of groups means where one set has two or more groups (e.g., compare average group means of groups $\text{A}$, $\text{B}$, and $\text{C}$ with that of group $\text{D}$). Comparisons can also look at tests of trend, such as linear and quadratic relationships, when the independent variable involves ordered levels.