## When to Use These Tests

“Ranking” refers to the data transformation in which numerical or ordinal values are replaced by their rank when the data are sorted.

### Learning Objectives

Indicate why and how data transformation is performed and how this relates to ranked data.

### Key Takeaways

#### Key Points

• Data transforms are usually applied so that the data appear to more closely meet the assumptions of a statistical inference procedure that is to be applied, or to improve the interpretability or appearance of graphs.
• Guidance for how data should be transformed, or whether a transform should be applied at all, should come from the particular statistical analysis to be performed.
• When there is evidence of substantial skew in the data, it is common to transform the data to a symmetric distribution before constructing a confidence interval.
• Data can also be transformed to make it easier to visualize them.
• A final reason that data can be transformed is to improve interpretability, even if no formal statistical analysis or visualization is to be performed.

#### Key Terms

• confidence interval: A type of interval estimate of a population parameter used to indicate the reliability of an estimate.
• data transformation: The application of a deterministic mathematical function to each point in a data set.
• central limit theorem: The theorem that states: If the sum of independent identically distributed random variables has a finite variance, then it will be (approximately) normally distributed.

In statistics, “ranking” refers to the data transformation in which numerical or ordinal values are replaced by their rank when the data are sorted. If, for example, the numerical data 3.4, 5.1, 2.6, 7.3 are observed, the ranks of these data items would be 2, 3, 1 and 4 respectively. In another example, the ordinal data hot, cold, warm would be replaced by 3, 1, 2. In these examples, the ranks are assigned to values in ascending order. (In some other cases, descending ranks are used. ) Ranks are related to the indexed list of order statistics, which consists of the original dataset rearranged into ascending order.

Some kinds of statistical tests employ calculations based on ranks. Examples include:

• Friedman test
• Kruskal-Wallis test
• Rank products
• Spearman’s rank correlation coefficient
• Wilcoxon rank-sum test
• Wilcoxon signed-rank test

Some ranks can have non-integer values for tied data values. For example, when there is an even number of copies of the same data value, the above described fractional statistical rank of the tied data ends in $\frac{1}{2}$.

### Data Transformation

Data transformation refers to the application of a deterministic mathematical function to each point in a data set—that is, each data point $\text{z}_\text{i}$ is replaced with the transformed value $\text{y}_\text{i} = \text{f}(\text{z}_\text{i})$, where $\text{f}$ is a function. Transforms are usually applied so that the data appear to more closely meet the assumptions of a statistical inference procedure that is to be applied, or to improve the interpretability or appearance of graphs.

Nearly always, the function that is used to transform the data is invertible and, generally, is continuous. The transformation is usually applied to a collection of comparable measurements. For example, if we are working with data on peoples’ incomes in some currency unit, it would be common to transform each person’s income value by the logarithm function.

### Reasons for Transforming Data

Guidance for how data should be transformed, or whether a transform should be applied at all, should come from the particular statistical analysis to be performed. For example, a simple way to construct an approximate 95% confidence interval for the population mean is to take the sample mean plus or minus two standard error units. However, the constant factor 2 used here is particular to the normal distribution and is only applicable if the sample mean varies approximately normally. The central limit theorem states that in many situations, the sample mean does vary normally if the sample size is reasonably large.

However, if the population is substantially skewed and the sample size is at most moderate, the approximation provided by the central limit theorem can be poor, and the resulting confidence interval will likely have the wrong coverage probability. Thus, when there is evidence of substantial skew in the data, it is common to transform the data to a symmetric distribution before constructing a confidence interval. If desired, the confidence interval can then be transformed back to the original scale using the inverse of the transformation that was applied to the data.

Data can also be transformed to make it easier to visualize them. For example, suppose we have a scatterplot in which the points are the countries of the world, and the data values being plotted are the land area and population of each country. If the plot is made using untransformed data (e.g., square kilometers for area and the number of people for population), most of the countries would be plotted in tight cluster of points in the lower left corner of the graph. The few countries with very large areas and/or populations would be spread thinly around most of the graph’s area. Simply rescaling units (e.g., to thousand square kilometers, or to millions of people) will not change this. However, following logarithmic transformations of both area and population, the points will be spread more uniformly in the graph.

Population Versus Area Scatterplots: A scatterplot in which the areas of the sovereign states and dependent territories in the world are plotted on the vertical axis against their populations on the horizontal axis. The upper plot uses raw data. In the lower plot, both the area and population data have been transformed using the logarithm function.

A final reason that data can be transformed is to improve interpretability, even if no formal statistical analysis or visualization is to be performed. For example, suppose we are comparing cars in terms of their fuel economy. These data are usually presented as “kilometers per liter” or “miles per gallon. ” However, if the goal is to assess how much additional fuel a person would use in one year when driving one car compared to another, it is more natural to work with the data transformed by the reciprocal function, yielding liters per kilometer, or gallons per mile.

## Mann-Whitney U-Test

The Mann–Whitney $\text{U}$-test is a non-parametric test of the null hypothesis that two populations are the same against an alternative hypothesis.

### Learning Objectives

Compare the Mann-Whitney $\text{U}$-test to Student’s $\text{t}$-test

### Key Takeaways

#### Key Points

• Mann-Whitney has greater efficiency than the $\text{t}$-test on non- normal distributions, such as a mixture of normal distributions, and it is nearly as efficient as the $\text{t}$-test on normal distributions.
• The test involves the calculation of a statistic, usually called $\text{U}$, whose distribution under the null hypothesis is known.
• The first method to calculate $\text{U}$ involves choosing the sample which has the smaller ranks, then counting the number of ranks in the other sample that are smaller than the ranks in the first, then summing these counts.
• The second method involves adding up the ranks for the observations which came from sample 1. The sum of ranks in sample 2 is now determinate, since the sum of all the ranks equals $\frac{\text{N}(\text{N}+1)}{2}$, where $\text{N}$ is the total number of observations.

#### Key Terms

• tie: One or more equal values or sets of equal values in the data set.
• ordinal data: A statistical data type consisting of numerical scores that exist on an ordinal scale, i.e. an arbitrary numerical scale where the exact numerical quantity of a particular value has no significance beyond its ability to establish a ranking over a set of data points.

The Mann–Whitney $\text{U}$-test is a non-parametric test of the null hypothesis that two populations are the same against an alternative hypothesis, especially that a particular population tends to have larger values than the other. It has greater efficiency than the $\text{t}$-test on non-normal distributions, such as a mixture of normal distributions, and it is nearly as efficient as the $\text{t}$-test on normal distributions.

### Assumptions and Formal Statement of Hypotheses

Although Mann and Whitney developed the test under the assumption of continuous responses with the alternative hypothesis being that one distribution is stochastically greater than the other, there are many other ways to formulate the null and alternative hypotheses such that the test will give a valid test. A very general formulation is to assume that:

1. All the observations from both groups are independent of each other.
2. The responses are ordinal (i.e., one can at least say of any two observations which is the greater).
3. The distributions of both groups are equal under the null hypothesis, so that the probability of an observation from one population ($\text{X}$) exceeding an observation from the second population ($\text{Y}$) equals the probability of an observation from $\text{Y}$exceeding an observation from $\text{X}$. That is, there is a symmetry between populations with respect to probability of random drawing of a larger observation.
4. Under the alternative hypothesis, the probability of an observation from one population ($\text{X}$) exceeding an observation from the second population ($\text{Y}$) (after exclusion of ties) is not equal to $0.5$. The alternative may also be stated in terms of a one-sided test, for example: $\text{P}(\text{X} > \text{Y}) + 0.5 \cdot \text{P}(\text{X} = \text{Y}) > 0.5$.

### Calculations

The test involves the calculation of a statistic, usually called $\text{U}$, whose distribution under the null hypothesis is known. In the case of small samples, the distribution is tabulated, but for sample sizes above about 20, approximation using the normal distribution is fairly good.

There are two ways of calculating $\text{U}$ by hand. For either method, we must first arrange all the observations into a single ranked series. That is, rank all the observations without regard to which sample they are in.

### Method One

For small samples a direct method is recommended. It is very quick, and gives an insight into the meaning of the $\text{U}$ statistic.

1. Choose the sample for which the ranks seem to be smaller (the only reason to do this is to make computation easier). Call this “sample 1,” and call the other sample “sample 2. “
2. For each observation in sample 1, count the number of observations in sample 2 that have a smaller rank (count a half for any that are equal to it). The sum of these counts is $\text{U}$.

### Method Two

For larger samples, a formula can be used.

First, add up the ranks for the observations that came from sample 1. The sum of ranks in sample 2 is now determinate, since the sum of all the ranks equals:

$\dfrac{\text{N}(\text{N} + 1)}{2}$

where $\text{N}$ is the total number of observations. $\text{U}$ is then given by:

$\text{U}_1=\text{R}_1 - \dfrac{\text{n}_1(\text{n}_1+1)}{2}$

where $\text{n}_1$ is the sample size for sample 1, and $\text{R}_1$ is the sum of the ranks in sample 1. Note that it doesn’t matter which of the two samples is considered sample 1. The smaller value of $\text{U}_1$ and $\text{U}_2$ is the one used when consulting significance tables.

### Example of Statement Results

In reporting the results of a Mann–Whitney test, it is important to state:

• a measure of the central tendencies of the two groups (means or medians; since the Mann–Whitney is an ordinal test, medians are usually recommended)
• the value of $\text{U}$
• the sample sizes
• the significance level

In practice some of this information may already have been supplied and common sense should be used in deciding whether to repeat it. A typical report might run:

“Median latencies in groups $\text{E}$ and $\text{C}$ were $153$ and $247$ ms; the distributions in the two groups differed significantly (Mann–Whitney $\text{U}=10.5$, $\text{n}_1=\text{n}_2=8$, $\text{P} < 0.05\text{, two-tailed}$).”

### Comparison to Student’s $\text{t}$-Test

The $\text{U}$-test is more widely applicable than independent samples Student’s $\text{t}$-test, and the question arises of which should be preferred.

### Ordinal Data

$\text{U}$ remains the logical choice when the data are ordinal but not interval scaled, so that the spacing between adjacent values cannot be assumed to be constant.

### Robustness

As it compares the sums of ranks, the Mann–Whitney test is less likely than the $\text{t}$-test to spuriously indicate significance because of the presence of outliers (i.e., Mann–Whitney is more robust).

### Efficiency

For distributions sufficiently far from normal and for sufficiently large sample sizes, the Mann-Whitney Test is considerably more efficient than the $\text{t}$. Overall, the robustness makes Mann-Whitney more widely applicable than the $\text{t}$-test. For large samples from the normal distribution, the efficiency loss compared to the $\text{t}$-test is only 5%, so one can recommend Mann-Whitney as the default test for comparing interval or ordinal measurements with similar distributions.

## Wilcoxon t-Test

The Wilcoxon $\text{t}$-test assesses whether population mean ranks differ for two related samples, matched samples, or repeated measurements on a single sample.

### Learning Objectives

Break down the procedure for the Wilcoxon signed-rank t-test.

### Key Takeaways

#### Key Points

• The Wilcoxon $\text{t}$-test can be used as an alternative to the paired Student’s $\text{t}$-test, $\text{t}$-test for matched pairs, or the $\text{t}$-test for dependent samples when the population cannot be assumed to be normally distributed.
• The test is named for Frank Wilcoxon who (in a single paper) proposed both the rank $\text{t}$-test and the rank-sum test for two independent samples.
• The test assumes that data are paired and come from the same population, each pair is chosen randomly and independent and the data are measured at least on an ordinal scale, but need not be normal.

#### Key Terms

• Wilcoxon t-test: A non-parametric statistical hypothesis test used when comparing two related samples, matched samples, or repeated measurements on a single sample to assess whether their population mean ranks differ (i.e., it is a paired-difference test).
• tie: One or more equal values or sets of equal values in the data set.

The Wilcoxon signed-rank t-test is a non-parametric statistical hypothesis test used when comparing two related samples, matched samples, or repeated measurements on a single sample to assess whether their population mean ranks differ (i.e., it is a paired difference test). It can be used as an alternative to the paired Student’s $\text{t}$-test, $\text{t}$-test for matched pairs, or the $\text{t}$-test for dependent samples when the population cannot be assumed to be normally distributed.

The test is named for Frank Wilcoxon who (in a single paper) proposed both the rank $\text{t}$-test and the rank-sum test for two independent samples. The test was popularized by Siegel in his influential text book on non-parametric statistics. Siegel used the symbol $\text{T}$ for the value defined below as $\text{W}$. In consequence, the test is sometimes referred to as the Wilcoxon $\text{T}$-test, and the test statistic is reported as a value of $\text{T}$. Other names may include the “$\text{t}$-test for matched pairs” or the “$\text{t}$-test for dependent samples.”

### Assumptions

1. Data are paired and come from the same population.
2. Each pair is chosen randomly and independent.
3. The data are measured at least on an ordinal scale, but need not be normal.

### Test Procedure

Let $\text{N}$ be the sample size, the number of pairs. Thus, there are a total of $2\text{N}$ data points. For $\text{i}=1,\cdots,\text{N}$, let $\text{x}_{1,\text{i}}$ and $\text{x}_{2,\text{i}}$ denote the measurements.

$\text{H}_0$: The median difference between the pairs is zero.

$\text{H}_1$: The median difference is not zero.

1. For $\text{i}=1,\cdots,\text{N}$, calculate $\left| { \text{x} }_{ 2,\text{i} }-{ \text{x} }_{ 1,\text{i} } \right|$ and $\text{sgn}\left( { \text{x} }_{ 2,\text{i} }-{ \text{x} }_{ 1,\text{i} } \right)$, where $\text{sgn}$ is the sign function.

2. Exclude pairs with $\left|{ \text{x} }_{ 2,\text{i} }-{ \text{x} }_{ 1,\text{i} } \right|=0$. Let $\text{N}_\text{r}$ be the reduced sample size.

3. Order the remaining pairs from smallest absolute difference to largest absolute difference, $\left| { \text{x} }_{ 2,\text{i} }-{ \text{x} }_{ 1,\text{i} } \right|$.

4. Rank the pairs, starting with the smallest as 1. Ties receive a rank equal to the average of the ranks they span. Let $\text{R}_\text{i}$ denote the rank.

5. Calculate the test statistic $\text{W}$, the absolute value of the sum of the signed ranks:

$\text{W}= \left| \sum \left(\text{sgn}(\text{x}_{2,\text{i}}-\text{x}_{1,\text{i}}) \cdot \text{R}_\text{i} \right) \right|$

6. As $\text{N}_\text{r}$ increases, the sampling distribution of $\text{W}$ converges to a normal distribution. Thus, for $\text{N}_\text{r} \geq 10$, a $\text{z}$-score can be calculated as follows:

$\text{z}=\dfrac{\text{W}-0.5}{\sigma_\text{W}}$

where

$\displaystyle{\sigma_\text{W} = \sqrt{\frac{\text{N}_\text{r}(\text{N}_\text{r}+1)(2\text{N}_\text{r}+1)}{6}}}$

If $\text{z} > \text{z}_{\text{critical}}$ then reject $\text{H}_0$.

For $\text{N}_\text{r} < 10$, $\text{W}$ is compared to a critical value from a reference table. If $\text{W}\ge { \text{W} }_{ \text{critical,}{ \text{N} }_{ \text{r} } }$ then reject $\text{H}_0$.

Alternatively, a $\text{p}$-value can be calculated from enumeration of all possible combinations of $\text{W}$ given $\text{N}_\text{r}$.

## Kruskal-Wallis H-Test

The Kruskal–Wallis one-way analysis of variance by ranks is a non-parametric method for testing whether samples originate from the same distribution.

### Learning Objectives

Summarize the Kruskal-Wallis one-way analysis of variance and outline its methodology

### Key Takeaways

#### Key Points

• The Kruskal-Wallis test is used for comparing more than two samples that are independent, or not related.
• When the Kruskal-Wallis test leads to significant results, then at least one of the samples is different from the other samples.
• The test does not identify where the differences occur or how many differences actually occur.
• Since it is a non- parametric method, the Kruskal–Wallis test does not assume a normal distribution, unlike the analogous one-way analysis of variance.
• The test does assume an identically shaped and scaled distribution for each group, except for any difference in medians.
• Kruskal–Wallis is also used when the examined groups are of unequal size (different number of participants).

#### Key Terms

• chi-squared distribution: A distribution with $\text{k}$ degrees of freedom is the distribution of a sum of the squares of $\text{k}$ independent standard normal random variables.
• Kruskal-Wallis test: A non-parametric method for testing whether samples originate from the same distribution.
• Type I error: An error occurring when the null hypothesis ($\text{H}_\text{0}$) is true, but is rejected.

The Kruskal–Wallis one-way analysis of variance by ranks (named after William Kruskal and W. Allen Wallis) is a non-parametric method for testing whether samples originate from the same distribution. It is used for comparing more than two samples that are independent, or not related. The parametric equivalent of the Kruskal-Wallis test is the one-way analysis of variance (ANOVA). When the Kruskal-Wallis test leads to significant results, then at least one of the samples is different from the other samples. The test does not identify where the differences occur, nor how many differences actually occur. It is an extension of the Mann–Whitney $\text{U}$ test to 3 or more groups. The Mann-Whitney would help analyze the specific sample pairs for significant differences.

Since it is a non-parametric method, the Kruskal–Wallis test does not assume a normal distribution, unlike the analogous one-way analysis of variance. However, the test does assume an identically shaped and scaled distribution for each group, except for any difference in medians.

Kruskal–Wallis is also used when the examined groups are of unequal size (different number of participants).

### Method

1. Rank all data from all groups together; i.e., rank the data from $1$ to $\text{N}$ ignoring group membership. Assign any tied values the average of the ranks would have received had they not been tied.

2. The test statistic is given by:

$\displaystyle{\text{K}=(\text{N}-1) \frac{\displaystyle{\sum_{\text{i}=1}^\text{g}\text{n}_\text{i}(\bar{\text{r}}_{\text{i}\cdot} - \bar{\text{r}})^2}}{\displaystyle{\sum_{\text{i}=1}^\text{g} \sum_{\text{j}=1}^{\text{n}_\text{i}} (\text{r}_{\text{ij}}-\bar{\text{r}})^2}}}$where

$\displaystyle{\bar{\text{r}}_{\text{i}\cdot}= \frac{\sum_{\text{j}=1}^{\text{n}_\text{i}}\text{r}_{\text{ij}}}{\text{n}_\text{i}}}$

and where $\bar{\text{r}} = \frac{1}{2} (\text{N}+1)$ and is the average of all values of $\text{r}_{\text{ij}}$, $\text{n}_\text{i}$ is the number of observations in group $\text{i}$, $\text{r}_{\text{ij}}$ is the rank (among all observations) of observation $\text{j}$ from group $\text{i}$, and $\text{N}$ is the total number of observations across all groups.

3. If the data contain no ties, the denominator of the expression for $\text{K}$ is exactly

$\dfrac{(\text{N}-1)\text{N}(\text{N}+1)}{12}$

and

$\bar{\text{r}}=\dfrac{\text{N}+1}{2}$

Therefore:

\begin{align} \text{K} &= \frac{12}{\text{N}(\text{N}+1)} \cdot \sum_{{i}=1}^\text{g} \text{n}_\text{i} \left( \bar{\text{r}}_{\text{i} \cdot} - \dfrac{\text{N}+1}{2}\right)^2 \\ &= \frac{12}{\text{N}(\text{N}+1)} \cdot \sum_{\text{i}=1}^\text{g} \text{n}_\text{i} \bar{\text{r}}_{\text{i}\cdot}^2 - 3 (\text{N}+1) \end{align}

Note that the second line contains only the squares of the average ranks.

4. A correction for ties if using the shortcut formula described in the previous point can be made by dividing $\text{K}$ by the following:

$1-\frac{\displaystyle{\sum_{\text{i}=1}^\text{G} (\text{t}_\text{i}^3 - \text{t}_\text{i})}}{\displaystyle{\text{N}^3-\text{N}}}$

where $\text{G}$ is the number of groupings of different tied ranks, and $\text{t}_\text{i}$ is the number of tied values within group $\text{i}$ that are tied at a particular value. This correction usually makes little difference in the value of $\text{K}$ unless there are a large number of ties.

5. Finally, the p-value is approximated by:

$\text{Pr}\left( { \chi }_{ \text{g}-1 }^{ 2 }\ge \text{K} \right)$

If some $\text{n}_\text{i}$ values are small (i.e., less than 5) the probability distribution of $\text{K}$ can be quite different from this chi-squared distribution. If a table of the chi-squared probability distribution is available, the critical value of chi-squared, ${ \chi }_{ \alpha,\text{g}-1′ }^{ 2 }$, can be found by entering the table at $\text{g} − 1$ degrees of freedom and looking under the desired significance or alpha level. The null hypothesis of equal population medians would then be rejected if $\text{K}\ge { \chi }_{ \alpha,\text{g}-1 }^{ 2 }$. Appropriate multiple comparisons would then be performed on the group medians.

6. If the statistic is not significant, then there is no evidence of differences between the samples. However, if the test is significant then a difference exists between at least two of the samples. Therefore, a researcher might use sample contrasts between individual sample pairs, or post hoc tests, to determine which of the sample pairs are significantly different. When performing multiple sample contrasts, the type I error rate tends to become inflated.