## Tests of Significance

Tests of significance are a statistical technology used for ascertaining the likelihood of empirical data, and (from there) for inferring a real effect.

### Learning Objectives

Examine the idea of statistical significance and the fundamentals behind the corresponding tests.

### Key Takeaways

#### Key Points

• In relation to Fisher, statistical significance is a statistical assessment of whether observations reflect a pattern rather than just chance.
• In statistical testing, a result is deemed statistically significant if it is so extreme that such a result would be expected to arise simply by chance only in rare circumstances.
• Statistical significance refers to two separate notions: the $\text{p}$-value and the Type I error rate $\alpha$.
• A typical test of significance comprises two related elements: the calculation of the probability of the data and an assessment of the statistical significance of that probability.

#### Key Terms

• statistical significance: A measure of how unlikely it is that a result has occurred by chance.
• null hypothesis: A hypothesis set up to be refuted in order to support an alternative hypothesis; presumed true until statistical evidence in the form of a hypothesis test indicates otherwise.

Tests of significance are a statistical technology used for ascertaining the likelihood of empirical data, and, from there, for inferring a real effect, such as a correlation between variables or the effectiveness of a new treatment. Beginning circa 1925, Sir Ronald Fisher—an English statistician, evolutionary biologist, geneticist, and eugenicist (shown in )—standardized the interpretation of statistical significance, and was the main driving force behind the popularity of tests of significance in empirical research, especially in the social and behavioral sciences.

Sir Ronald Fisher: Sir Ronald Fisher was an English statistician, evolutionary biologist, geneticist, and eugenicist who standardized the interpretation of statistical significance (starting around 1925), and was the main driving force behind the popularity of tests of significance in empirical research, especially in the social and behavioral sciences.

Statistical significance refers to two separate notions:

1. the $\text{p}$-value, (the probability that the observed data would occur by chance in a given true null hypothesis ); or
2. the Type I error rate $\alpha$ (false positive rate) of a statistical hypothesis test (the probability of incorrectly rejecting a given null hypothesis in favor of a second alternative hypothesis).

In relation to Fisher, statistical significance is a statistical assessment of whether observations reflect a pattern rather than just chance. The fundamental challenge is that any partial picture of a given hypothesis, poll or question is subject to random error. In statistical testing, a result is deemed statistically significant if it is so extreme (without external variables which would influence the correlation results of the test) that such a result would be expected to arise simply by chance only in rare circumstances. Hence the result provides enough evidence to reject the hypothesis of “no effect. ”

A typical test of significance comprises two related elements:

1. the calculation of the probability of the data, and
2. an assessment of the statistical significance of that probability.

### Probability of the Data

The probability of the data is normally reported using two related statistics:

1. a test statistic ($\text{z}$, $\text{t}$, $\text{F}$…), and
2. an associated probability ($\text{p}$, $^*$).

The information provided by the test statistic is of little immediate usability and can be ignored in most cases. The associated probability, on the other hand, tells how probable the test results are and forms the basis for assessing statistical significance.

### Statistical Significance

The statistical significance of the results depends on criteria set up by the researcher beforehand. A result is deemed statistically significant if the probability of the data is small enough, conventionally if it is smaller than 5% ($\text{sig} \leq 0.05$). However, conventional thresholds for significance may vary depending on disciplines and researchers. For example, health sciences commonly settle for 10% ($\text{sig} \leq 0.10$), while particular researchers may settle for more stringent conventional levels, such as 1% ($\text{sig} \leq 0.01$). In any case, p-values ($\text{p}$, $^*$) larger than the selected threshold are considered non-significant and are typically ignored from further discussion. $\text{P}$-values smaller than, or equal to, the threshold are considered statistically significant and interpreted accordingly. A statistically significant result normally leads to an appropriate inference of real effects, unless there are suspicions that such results may be anomalous. Notice that the criteria used for assessing statistical significance may not be made explicit in a research article when the researcher is using conventional assessment criteria.

As an example, consider the following test statistics:

$[\text{z}=1.96, \text{p}=0.025]$

$[\text{F} = 13.140, \text{p}<0.01]$

$[\text{r} = 0.60^*]$

In this example, the test statistics are $\text{z}$ (normality test), $\text{F}$ (equality of variance test), and $\text{r}$ (correlation). Each $\text{p}$-value indicates, with more or less precision, the probability of its test statistic under the corresponding null hypothesis. Assuming a conventional 5% level of significance ($\text{sig} \leq 0.05$), all tests are, thus, statistically significant. We can thus infer that we have measured a real effect rather than a random fluctuation in the data. When interpreting the results, the correlation statistic provides information which is directly usable. We could thus infer a medium-to-high correlation between two variables. The test statistics $\text{z}$ and $\text{F}$, on the other hand, do not provide immediate useful information, and any further interpretation needs of descriptive statistics. For example, skewness and kurtosis are necessary for interpreting non-normality $\text{z}$, and group means and variances are necessary for describing group differences $\text{F}$.

## Elements of a Hypothesis Test

A statistical hypothesis test is a method of making decisions using data from a scientific study.

### Learning Objectives

Outline the steps of a standard hypothesis test.

### Key Takeaways

#### Key Points

• Statistical hypothesis tests define a procedure that controls (fixes) the probability of incorrectly deciding that a default position ( null hypothesis ) is incorrect based on how likely it would be for a set of observations to occur if the null hypothesis were true.
• The first step in a hypothesis test is to state the relevant null and alternative hypotheses; the second is to consider the statistical assumptions being made about the sample in doing the test.
• Next, the relevant test statistic is stated, and its distribution is derived under the null hypothesis from the assumptions.
• After that, the relevant significance level and critical region are determined.
• Finally, values of the test statistic are observed and the decision is made whether to either reject the null hypothesis in favor of the alternative or not reject it.

#### Key Terms

• significance level: A measure of how likely it is to draw a false conclusion in a statistical test, when the results are really just random variations.
• null hypothesis: A hypothesis set up to be refuted in order to support an alternative hypothesis; presumed true until statistical evidence in the form of a hypothesis test indicates otherwise.

A statistical hypothesis test is a method of making decisions using data from a scientific study. In statistics, a result is called statistically significant if it has been predicted as unlikely to have occurred by chance alone, according to a pre-determined threshold probability—the significance level. Statistical hypothesis testing is sometimes called confirmatory data analysis, in contrast to exploratory data analysis, which may not have pre-specified hypotheses. Statistical hypothesis testing is a key technique of frequentist inference.

Statistical hypothesis tests define a procedure that controls (fixes) the probability of incorrectly deciding that a default position (null hypothesis) is incorrect based on how likely it would be for a set of observations to occur if the null hypothesis were true. Note that this probability of making an incorrect decision is not the probability that the null hypothesis is true, nor whether any specific alternative hypothesis is true. This contrasts with other possible techniques of decision theory in which the null and alternative hypothesis are treated on a more equal basis.

### The Testing Process

The typical line of reasoning in a hypothesis test is as follows:

1. There is an initial research hypothesis of which the truth is unknown.
2. The first step is to state the relevant null and alternative hypotheses. This is important as mis-stating the hypotheses will muddy the rest of the process.
3. The second step is to consider the statistical assumptions being made about the sample in doing the test—for example, assumptions about the statistical independence or about the form of the distributions of the observations. This is important because invalid assumptions will mean that the results of the test are invalid.
4. Decide which test is appropriate, and state the relevant test statistic $\text{T}$.
5. Derive the distribution of the test statistic under the null hypothesis from the assumptions.
6. Select a significance level ($\alpha$), a probability threshold below which the null hypothesis will be rejected. Common values are 5% and 1%.
7. The distribution of the test statistic under the null hypothesis partitions the possible values of $\text{T}$ into those for which the null hypothesis is rejected, the so called critical region, and those for which it is not. The probability of the critical region is $\alpha$.
8. Compute from the observations the observed value $\text{t}_\text{obs}$ of the test statistic $\text{T}$.
9. Decide to either reject the null hypothesis in favor of the alternative or not reject it. The decision rule is to reject the null hypothesis $\text{H}_0$ if the observed value $\text{t}_\text{obs}$ is in the critical region, and to accept or “fail to reject” the hypothesis otherwise.

An alternative process is commonly used:

7. Compute from the observations the observed value $\text{t}_\text{obs}$ of the test statistic $\text{T}$.

8. From the statistic calculate a probability of the observation under the null hypothesis (the $\text{p}$-value).

9. Reject the null hypothesis in favor of the alternative or not reject it. The decision rule is to reject the null hypothesis if and only if the $\text{p}$-value is less than the significance level (the selected probability) threshold.

The two processes are equivalent. The former process was advantageous in the past when only tables of test statistics at common probability thresholds were available. It allowed a decision to be made without the calculation of a probability. It was adequate for classwork and for operational use, but it was deficient for reporting results. The latter process relied on extensive tables or on computational support not always available. The calculations are now trivially performed with appropriate software.

Tea Tasting Distribution: This table shows the distribution of permutations in our tea tasting example.

## The Null and the Alternative

The alternative hypothesis and the null hypothesis are the two rival hypotheses that are compared by a statistical hypothesis test.

### Learning Objectives

Differentiate between the null and alternative hypotheses and understand their implications in hypothesis testing.

### Key Takeaways

#### Key Points

• The null hypothesis refers to a general or default position: that there is no relationship between two measured phenomena, or that a potential medical treatment has no effect.
• In the testing approach of Ronald Fisher, a null hypothesis is potentially rejected or disproved, but never accepted or proved.
• In the hypothesis testing approach of Jerzy Neyman and Egon Pearson, a null hypothesis is contrasted with an alternative hypothesis, and these are decided between on the basis of data, with certain error rates.
• The four principal types of alternative hypotheses are: point, one-tailed directional, two-tailed directional, and non-directional.

#### Key Terms

• alternative hypothesis: a rival hypothesis to the null hypothesis, whose likelihoods are compared by a statistical hypothesis test
• null hypothesis: A hypothesis set up to be refuted in order to support an alternative hypothesis; presumed true until statistical evidence in the form of a hypothesis test indicates otherwise.

In statistical hypothesis testing, the alternative hypothesis and the null hypothesis are the two rival hypotheses which are compared by a statistical hypothesis test. An example might be where water quality in a stream has been observed over many years. A test can be made of the null hypothesis (that there is no change in quality between the first and second halves of the data) against the alternative hypothesis (that the quality is poorer in the second half of the record).

### The Null Hypothesis

The null hypothesis refers to a general or default position: that there is no relationship between two measured phenomena, or that a potential medical treatment has no effect. Rejecting or disproving the null hypothesis (and thus concluding that there are grounds for believing that there is a relationship between two phenomena or that a potential treatment has a measurable effect) is a central task in the modern practice of science and gives a precise sense in which a claim is capable of being proven false.

The concept of a null hypothesis is used differently in two approaches to statistical inference, though the same term is used, a problem shared with statistical significance. In the significance testing approach of Ronald Fisher, a null hypothesis is potentially rejected or disproved on the basis of data that is significantly under its assumption, but never accepted or proved. In the hypothesis testing approach of Jerzy Neyman and Egon Pearson, a null hypothesis is contrasted with an alternative hypothesis, and these are decided between on the basis of data, with certain error rates.

Sir Ronald Fisher: Sir Ronald Fisher, pictured here, was the first to coin the term null hypothesis.

### The Alternative Hypothesis

In the case of a scalar parameter, there are four principal types of alternative hypothesis:

1. Point. Point alternative hypotheses occur when the hypothesis test is framed so that the population distribution under the alternative hypothesis is a fully defined distribution, with no unknown parameters. Such hypotheses are usually of no practical interest but are fundamental to theoretical considerations of statistical inference.
2. One-tailed directional. A one-tailed directional alternative hypothesis is concerned with the region of rejection for only one tail of the sampling distribution.
3. Two-tailed directional. A two-tailed directional alternative hypothesis is concerned with both regions of rejection of the sampling distribution.
4. Non-directional. A non-directional alternative hypothesis is not concerned with either region of rejection, but, rather, only that the null hypothesis is not true.

The concept of an alternative hypothesis forms a major component in modern statistical hypothesis testing; however, it was not part of Ronald Fisher’s formulation of statistical hypothesis testing. In Fisher’s approach to testing, the central idea is to assess whether the observed dataset could have resulted from chance if the null hypothesis were assumed to hold, notionally without preconceptions about what other model might hold. Modern statistical hypothesis testing accommodates this type of test, since the alternative hypothesis can be just the negation of the null hypothesis.

### The Test

A hypothesis test begins by consider the null and alternate hypotheses, each containing an opposing viewpoint.

$\text{H}_0$: The null hypothesis: It is a statement about the population that will be assumed to be true unless it can be shown to be incorrect beyond a reasonable doubt.

$\text{H}_\text{a}$: The alternate hypothesis: It is a claim about the population that is contradictory to $\text{H}_0$ and what we conclude when we reject $\text{H}_0$.

Since the null and alternate hypotheses are contradictory, we must examine evidence to decide if there is enough evidence to reject the null hypothesis or not. The evidence is in the form of sample data.

We can make a decision after determining which hypothesis the sample supports (there are two options for a decision). They are “reject $\text{H}_0$” if the sample information favors the alternate hypothesis or “do not reject $\text{H}_0$” or “fail to reject $\text{H}_0$” if the sample information is insufficient to reject the null hypothesis.

### Examples

#### Example 1

$\text{H}_0$: No more than 30% of the registered voters in Santa Clara County voted in the primary election.

$\text{H}_\text{a}$: More than 30% of the registered voters in Santa Clara County voted in the primary election.

#### Example 2

We want to test whether the mean grade point average in American colleges is different from 2.0 (out of 4.0).

$\text{H}_0: \mu = 2.0 | \text{H}_\text{a}: \mu \neq 2.0$

#### Example 3

We want to test if college students take less than five years to graduate from college, on the average.

$\text{H}_0: \mu \geq 5 | \text{H}_\text{a}: \mu <5$

## Type I and Type II Errors

If the result of a hypothesis test does not correspond with reality, then an error has occurred.

### Learning Objectives

Distinguish between Type I and Type II error and discuss the consequences of each.

### Key Takeaways

#### Key Points

• A type I error occurs when the null hypothesis ($\text{H}_0$) is true but is rejected.
• The rate of the type I error is called the size of the test and denoted by the Greek letter $\alpha$ (alpha).
• A type II error occurs when the null hypothesis is false but erroneously fails to be rejected.
• The rate of the type II error is denoted by the Greek letter $\beta$ (beta) and related to the power of a test (which equals $1-\beta$).

#### Key Terms

• type II error: Accepting the null hypothesis when the null hypothesis is false.
• Type I error: Rejecting the null hypothesis when the null hypothesis is true.

The notion of statistical error is an integral part of hypothesis testing. The test requires an unambiguous statement of a null hypothesis, which usually corresponds to a default “state of nature” — for example “this person is healthy,” “this accused is not guilty” or “this product is not broken. ” An alternative hypothesis is the negation of null hypothesis (for example, “this person is not healthy,” “this accused is guilty,” or “this product is broken”). The result of the test may be negative, relative to null hypothesis (not healthy, guilty, broken) or positive (healthy, not guilty, not broken).

If the result of the test corresponds with reality, then a correct decision has been made. However, if the result of the test does not correspond with reality, then an error has occurred. Due to the statistical nature of a test, the result is never, except in very rare cases, free of error. The two types of error are distinguished as type I error and type II error. What we actually call type I or type II error depends directly on the null hypothesis, and negation of the null hypothesis causes type I and type II errors to switch roles.

### Type I Error

A type I error occurs when the null hypothesis ($\text{H}_0$) is true but is rejected. It is asserting something that is absent, a false hit. A type I error may be compared with a so-called false positive (a result that indicates that a given condition is present when it actually is not present) in tests where a single condition is tested for. A type I error can also be said to occur when we believe a falsehood. In terms of folk tales, an investigator may be “crying wolf” without a wolf in sight (raising a false alarm). $\text{H}_0$: no wolf.

The rate of the type I error is called the size of the test and denoted by the Greek letter $\alpha$ (alpha). It usually equals the significance level of a test. In the case of a simple null hypothesis, $\alpha$ is the probability of a type I error. If the null hypothesis is composite, $\alpha$ is the maximum of the possible probabilities of a type I error.

### False Positive Error

A false positive error, commonly called a “false alarm,” is a result that indicates a given condition has been fulfilled when it actually has not been fulfilled. In the case of “crying wolf,” the condition tested for was “is there a wolf near the herd? ” The actual result was that there had not been a wolf near the herd. The shepherd wrongly indicated there was one, by crying wolf.

A false positive error is a type I error where the test is checking a single condition and results in an affirmative or negative decision, usually designated as “true or false.”

### Type II Error

A type II error occurs when the null hypothesis is false but erroneously fails to be rejected. It is failing to assert what is present, a miss. A type II error may be compared with a so-called false negative (where an actual “hit” was disregarded by the test and seen as a “miss”) in a test checking for a single condition with a definitive result of true or false. A type II error is committed when we fail to believe a truth. In terms of folk tales, an investigator may fail to see the wolf (“failing to raise an alarm”). Again, $\text{H}_0$: no wolf.

The rate of the type II error is denoted by the Greek letter $\beta$ (beta) and related to the power of a test (which equals $1-\beta$).

### False Negative Error

A false negative error is where a test result indicates that a condition failed, while it actually was successful. A common example is a guilty prisoner freed from jail. The condition: “Is the prisoner guilty? ” actually had a positive result (yes, he is guilty). But the test failed to realize this and wrongly decided the prisoner was not guilty.

A false negative error is a type II error occurring in test steps where a single condition is checked for and the result can either be positive or negative.

### Consequences of Type I and Type II Errors

Both types of errors are problems for individuals, corporations, and data analysis. A false positive (with null hypothesis of health) in medicine causes unnecessary worry or treatment, while a false negative gives the patient the dangerous illusion of good health and the patient might not get an available treatment. A false positive in manufacturing quality control (with a null hypothesis of a product being well made) discards a product that is actually well made, while a false negative stamps a broken product as operational. A false positive (with null hypothesis of no effect) in scientific research suggest an effect that is not actually there, while a false negative fails to detect an effect that is there.

Based on the real-life consequences of an error, one type may be more serious than the other. For example, NASA engineers would prefer to waste some money and throw out an electronic circuit that is really fine (null hypothesis: not broken; reality: not broken; test find: broken; action: thrown out; error: type I, false positive) than to use one on a spacecraft that is actually broken. On the other hand, criminal courts set a high bar for proof and procedure and sometimes acquit someone who is guilty (null hypothesis: innocent; reality: guilty; test find: not guilty; action: acquit; error: type II, false negative) rather than convict someone who is innocent.

Minimizing errors of decision is not a simple issue. For any given sample size the effort to reduce one type of error generally results in increasing the other type of error. The only way to minimize both types of error, without just improving the test, is to increase the sample size, and this may not be feasible. An example of acceptable type I error is discussed below.

Type I Error: NASA engineers would prefer to waste some money and throw out an electronic circuit that is really fine than to use one on a spacecraft that is actually broken. This is an example of type I error that is acceptable.

## Significance Levels

If a test of significance gives a $\text{p}$-value lower than or equal to the significance level, the null hypothesis is rejected at that level.

### Learning Objectives

Outline the process for calculating a $\text{p}$-value and recognize its role in measuring the significance of a hypothesis test.

### Key Takeaways

#### Key Points

• Significance levels may be used either as a cutoff mark for a $\text{p}$-value or as a desired parameter in the test design.
• To compute a $\text{p}$-value from the test statistic, one must simply sum (or integrate over) the probabilities of more extreme events occurring.
• In some situations, it is convenient to express the complementary statistical significance (so 0.95 instead of 0.05), which corresponds to a quantile of the test statistic.
• Popular levels of significance are 10% (0.1), 5% (0.05), 1% (0.01), 0.5% (0.005), and 0.1% (0.001).
• The lower the significance level chosen, the stronger the evidence required.

#### Key Terms

• Student’s t-test: Any statistical hypothesis test in which the test statistic follows a Student’s $t$ distribution if the null hypothesis is supported.
• p-value: The probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true.

A fixed number, most often 0.05, is referred to as a significance level or level of significance. Such a number may be used either as a cutoff mark for a $\text{p}$-value or as a desired parameter in the test design.

### $\text{p}$-Value

In brief, the (left-tailed) $\text{p}$-value is the quantile of the value of the test statistic, with respect to the sampling distribution under the null hypothesis. The right-tailed $\text{p}$value is one minus the quantile, while the two-tailed $\text{p}$-value is twice whichever of these is smaller. Computing a $\text{p}$value requires a null hypothesis, a test statistic (together with deciding if one is doing one-tailed test or a two-tailed test), and data. The key preparatory computation is computing the cumulative distribution function (CDF) of the sampling distribution of the test statistic under the null hypothesis, which may depend on parameters in the null distribution and the number of samples in the data. The test statistic is then computed for the actual data and its quantile is computed by inputting it into the CDF. An example of a $\text{p}$-value graph is shown below.

$\text{p}$-Value Graph: Example of a $\text{p}$-value computation. The vertical coordinate is the probability density of each outcome, computed under the null hypothesis. The $\text{p}$-value is the area under the curve past the observed data point.

Hypothesis tests, such as Student’s $\text{t}$-test, typically produce test statistics whose sampling distributions under the null hypothesis are known. For instance, in the example of flipping a coin, the test statistic is the number of heads produced. This number follows a known binomial distribution if the coin is fair, and so the probability of any particular combination of heads and tails can be computed. To compute a $\text{p}$-value from the test statistic, one must simply sum (or integrate over) the probabilities of more extreme events occurring. For commonly used statistical tests, test statistics and their corresponding $\text{p}$-values are often tabulated in textbooks and reference works.

### Using Significance Levels

Popular levels of significance are 10% (0.1), 5% (0.05), 1% (0.01), 0.5% (0.005), and 0.1% (0.001). If a test of significance gives a $\text{p}$-value lower than or equal to the significance level, the null hypothesis is rejected at that level. Such results are informally referred to as statistically significant (at the $\text{p}=0.05$ level, etc.). For example, if someone argues that “there’s only one chance in a thousand this could have happened by coincidence”, a 0.001 level of statistical significance is being stated. The lower the significance level chosen, the stronger the evidence required. The choice of significance level is somewhat arbitrary, but for many applications, a level of 5% is chosen by convention.

In some situations, it is convenient to express the complementary statistical significance (so 0.95 instead of 0.05), which corresponds to a quantile of the test statistic. In general, when interpreting a stated significance, one must be careful to make precise note of what is being tested statistically.

Different levels of cutoff trade off countervailing effects. Lower levels – such as 0.01 instead of 0.05 – are stricter and increase confidence in the determination of significance, but they run an increased risk of failing to reject a false null hypothesis. Evaluation of a given $\text{p}$-value of data requires a degree of judgment; and rather than a strict cutoff, one may instead simply consider lower $\text{p}$-values as more significant.

## Directional Hypotheses and One-Tailed Tests

A one-tailed hypothesis is one in which the value of a parameter is either above or equal to a certain value or below or equal to a certain value.

### Learning Objectives

Differentiate a one-tailed from a two-tailed hypothesis test.

### Key Takeaways

#### Key Points

• A one-tailed test or two-tailed test are alternative ways of computing the statistical significance of a data set in terms of a test statistic, depending on whether only one direction is considered extreme (and unlikely) or both directions are considered extreme.
• The terminology “tail” is used because the extremes of distributions are often small, as in the normal distribution or “bell curve”.
• If the test statistic is always positive (or zero), only the one-tailed test is generally applicable, while if the test statistic can assume positive and negative values, both the one-tailed and two-tailed test are of use.
• Formulating the hypothesis as a “better than” comparison is said to give the hypothesis directionality.
• One-tailed tests are used for asymmetric distributions that have a single tail (such as the chi-squared distribution, which is common in measuring goodness-of-fit) or for one side of a distribution that has two tails (such as the normal distribution, which is common in estimating location).

#### Key Terms

• one-tailed hypothesis: a hypothesis in which the value of a parameter is specified as being either above or equal to a certain value or below or equal to a certain value
• null hypothesis: A hypothesis set up to be refuted in order to support an alternative hypothesis; presumed true until statistical evidence in the form of a hypothesis test indicates otherwise.

When putting together a hypothesis test, consideration of directionality is critical. The vast majority of hypothesis tests involve either a point hypothesis, two-tailed hypothesis or one-tailed hypothesis. A one-tailed test or two-tailed test are alternative ways of computing the statistical significance of a data set in terms of a test statistic, depending on whether only one direction is considered extreme (and unlikely) or both directions are considered extreme. The terminology “tail” is used because the extremes of distributions are often small, as in the normal distribution or “bell curve”. If the test statistic is always positive (or zero), only the one-tailed test is generally applicable, while if the test statistic can assume positive and negative values, both the one-tailed and two-tailed test are of use.

Two-Tailed Test: A two-tailed test corresponds to both extreme negative and extreme positive directions of the test statistic, here the normal distribution.

A one-tailed hypothesis is a hypothesis in which the value of a parameter is specified as being either:

• above or equal to a certain value, or
• below or equal to a certain value.

One-Tailed Test: A one-tailed test, showing the $\text{p}$-value as the size of one tail.

An example of a one-tailed null hypothesis, in the medical context, would be that an existing treatment, $\text{A}$, is no worse than a new treatment, $\text{B}$. The corresponding alternative hypothesis would be that $\text{B}$ is better than $\text{A}$. Here, if the null hypothesis is not rejected (i.e., there is no reason to reject the hypothesis that $\text{A}$ is at least as good as $\text{B}$) the conclusion would be that treatment $\text{A}$ should continue to be used. If the null hypothesis were rejected (i.e., there is evidence that $\text{B}$ is better than $\text{A}$) the result would be that treatment $\text{B}$ would be used in future. An appropriate hypothesis test would look for evidence that $\text{B}$ is better than $\text{A}$ not for evidence that the outcomes of treatments $\text{A}$ and $\text{B}$ are different. Formulating the hypothesis as a “better than” comparison is said to give the hypothesis directionality.

### Applications of One-Tailed Tests

One-tailed tests are used for asymmetric distributions that have a single tail (such as the chi-squared distribution, which is common in measuring goodness-of-fit) or for one side of a distribution that has two tails (such as the normal distribution, which is common in estimating location). This corresponds to specifying a direction. Two-tailed tests are only applicable when there are two tails, such as in the normal distribution, and correspond to considering either direction significant.

In the approach of Ronald Fisher, the null hypothesis $\text{H}_0$ will be rejected when the $\text{p}$-value of the test statistic is sufficiently extreme (in its sampling distribution) and thus judged unlikely to be the result of chance. In a one-tailed test, “extreme” is decided beforehand as either meaning “sufficiently small” or “sufficiently large” – values in the other direction are considered insignificant. In a two-tailed test, “extreme” means “either sufficiently small or sufficiently large”, and values in either direction are considered significant. For a given test statistic there is a single two-tailed test and two one-tailed tests (one each for either direction). Given data of a given significance level in a two-tailed test for a test statistic, in the corresponding one-tailed tests for the same test statistic it will be considered either twice as significant (half the $\text{p}$-value) if the data is in the direction specified by the test or not significant at all ($\text{p}$-value above 0.5) if the data is in the direction opposite that specified by the test.

For example, if flipping a coin, testing whether it is biased towards heads is a one-tailed test. Getting data of “all heads” would be seen as highly significant, while getting data of “all tails” would not be significant at all ($\text{p}=1$). By contrast, testing whether it is biased in either direction is a two-tailed test, and either “all heads” or “all tails” would both be seen as highly significant data.

## Creating a Hypothesis Test

Creating a hypothesis test generally follows a five-step procedure.

### Learning Objectives

Design a hypothesis test utilizing the five steps listed in this text.

### Key Takeaways

#### Key Points

• The first step is to set up or assume a null hypothesis.
• The second step is to decide on an appropriate level of significance for assessing results.
• The third step is to decide between a one-tailed or a two-tailed statistical test.
• The fourth step is to interpret your results — namely, your $\text{p}$-value and observed test statistics.
• The final step is to write a report summarizing the statistical significance of your results.

#### Key Terms

• null hypothesis: A hypothesis set up to be refuted in order to support an alternative hypothesis; presumed true until statistical evidence in the form of a hypothesis test indicates otherwise.

The creation of a hypothesis test generally follows a five-step procedure as detailed below:

1. Set up or assume a statistical null hypothesis ($\text{H}_0$). Setting up a null hypothesis helps clarify the aim of the research. Such a hypothesis can also be assumed, given that null hypotheses, in general, are nil hypotheses and can be easily “reconstructed. ” Examples of null hypotheses include:

• $\text{H}_0$: Given our sample results, we will be unable to infer a significant correlation between the dependent and independent research variables.
• $\text{H}_0$: It will not be possible to infer any statistically significant mean differences between the treatment and the control groups.
• $\text{H}_0$: We will not be able to infer that this variable’s distribution significantly departs from normality.

2. Decide on an appropriate level of significance for assessing results. Conventional levels are 5% ($\text{sig}<0.05$, meaning that results have a probability under the null hypothesis of less than 1 time in 20) or 1% ($\text{sig}<0.01$, meaning that results have a probability under the null hypothesis of less than 1 time in 100). However, the level of significance can be any “threshold” the researcher considers appropriate for the intended research (thus, it could be 0.02, 0.001, 0.0001, etc). If required, label such level of significance as “significance” or “sig” (i.e., $\text{sig}<0.05$). Avoid labeling it as “$\text{p}$” (so not to confuse it with $\text{p}$-values) or as “alpha” or “$\alpha$” (so not to confuse it with alpha tolerance errors).

3. Decide between a one-tailed or a two-tailed statistical test. A one-tailed test assesses whether the observed results are either significantly higher or smaller than the null hypothesis, but not both. Thus, one-tailed tests are appropriate when testing that results will only be higher or smaller than null results, or when the only interest is on interventions which will result in higher or smaller outputs. A two-tailed test, on the other hand, assesses both possibilities at once. It achieves so by dividing the total level of significance between both tails, which also implies that it is more difficult to get significant results than with a one-tailed test. Thus, two-tailed tests are appropriate when the direction of the results is not known, or when the researcher wants to check both possibilities in order to prevent making mistakes.

Two-Tailed Statistical Test: This image shows a graph representation of a two-tailed hypothesis test.

4. Interpret results:

• Obtain and report the probability of the data. It is recommended to use the exact probability of the data, that is the ‘$\text{p}$-value’ (e.g., $\text{p}=0.011$, or $\text{p}=0.51$). This exact probability is normally provided together with the pertinent statistic test ($\text{z}$, $\text{t}$, $\text{F}$…).
• $\text{p}$-values can be interpreted as the probability of getting the observed or more extreme results under the null hypothesis (e.g., $\text{p}=0.033$ means that 3.3 times in 100, or 1 time in 33, we will obtain the same or more extreme results as normal [or random] fluctuation under the null).
• $\text{p}$-values are considered statistically significant if they are equal to or smaller than the chosen significance level. This is the actual test of significance, as it interprets those $\text{p}$-values falling beyond the threshold as “rare” enough as to deserve attention.
• If results are accepted as statistically significant, it can be inferred that the null hypothesis is not explanatory enough for the observed data.

5. Write Up the Report:

• All test statistics and associated exact $\text{p}$-values can be reported as descriptive statistics, independently of whether they are statistically significant or not.
• Significant results can be reported in the line of “either an exceptionally rare chance has occurred, or the theory of random distribution is not true. “
• Significant results can also be reported in the line of “without the treatment I administered, experimental results as extreme as the ones I obtained would occur only about 3 times in 1000. Therefore, I conclude that my treatment has a definite effect.”. Further, “this correlation is so extreme that it would only occur about 1 time in 100 ($\text{p}=0.01$). Thus, it can be inferred that there is a significant correlation between these variables.

## Testing a Single Proportion

Here we will evaluate an example of hypothesis testing for a single proportion.

### Learning Objectives

Construct and evaluate a hypothesis test for a single proportion.

### Key Takeaways

#### Key Points

• Our hypothesis test involves the following steps: stating the question, planning the test, stating the hypotheses, determine if we are meeting the test criteria, and computing the test statistic.
• We continue the test by: determining the critical region, sketching the test statistic and critical region, determining the $\text{p}$-value, stating whether we reject or fail to reject the null hypothesis and making meaningful conclusions.
• Our example revolves around Michele, a statistics student who replicates a study conducted by Cell Phone Market Research Company in 2010 that found that 30% of households in the United States own at least three cell phones.
• Michele tests to see if the proportion of households owning at least three cell phones in her home town is higher than the national average.
• The sample data does not show sufficient evidence that the percentage of households in Michele’s city that have at least three cell phones is more than 30%; therefore, we do not have strong evidence against the null hypothesis.

#### Key Terms

• null hypothesis: A hypothesis set up to be refuted in order to support an alternative hypothesis; presumed true until statistical evidence in the form of a hypothesis test indicates otherwise.

### Hypothesis Test for a Single Proportion

For an example of a hypothesis test for a single proportion, consider the following. Cell Phone Market Research Company conducted a national survey in 2010 and found the 30% of households in the United States owned at least three cell phones. Michele, a statistics student, decides to replicate this study where she lives. She conducts a random survey of 150 households in her town and finds that 53 own at least three cell phones. Is this strong evidence that the proportion of households in Michele’s town that own at least three cell phones is more than the national percentage? Test at a 5% significance level.

1. State the question: State what we want to determine and what level of confidence is important in our decision.

We are asked to test the hypothesis that the proportion of households that own at least three cell phones is more than 30%. The parameter of interest, $\text{p}$, is the proportion of households that own at least three cell phones.

2. Plan: Based on the above question(s) and the answer to the following questions, decide which test you will be performing. Is the problem about numerical or categorical data? If the data is numerical is the population standard deviation known? Do you have one group or two groups?

We have univariate, categorical data. Therefore, we can perform a one proportion $\text{z}$-test to test this belief. Our model will be:

$\displaystyle \text{N}\left( { \text{p} }_{ 0 },\sqrt { \frac { { \text{p} }_{ 0 }(1-{ \text{p} }_{ 0 }) }{ \text{n} } } \right) =\text{N}\left( 0.3,\sqrt { \frac { 0.3(1-0.3) }{ 150 } } \right)$

3. Hypotheses: State the null and alternative hypotheses in words then in symbolic form:

• Express the hypothesis to be tested in symbolic form.
• Write a symbolic expression that must be true when the original claims is false.
• The null hypothesis is the statement which includes the equality.
• The alternative hypothesis is the statement without the equality.

Null Hypothesis in words: The null hypothesis is that the true population proportion of households that own at least three cell phones is equal to 30%.

Null Hypothesis symbolically: $\text{H}_0: \text{p}=30\%$

Alternative Hypothesis in words: The alternative hypothesis is that the population proportion of households that own at least three cell phones is more than 30%.

Alternative Hypothesis symbolically: $\text{H}_0: \text{p}>30\%$

4. The criteria for the inferential test stated above: Think about the assumptions and check the conditions.

Randomization Condition: The problem tells us Michele uses a random sample.

Independence Assumption: When we know we have a random sample, it is likely that outcomes are independent. There is no reason to think how many cell phones one household owns has any bearing on the next household.

10% Condition: We will assume that the city in which Michele lives is large and that 150 households is less than 10% of all households in her community.

Success/Failure: $\text{p}_0(\text{n}) > 10$ and $(1-\text{p}_0)\text{n}>10$

To meet this condition, both the success and failure products must be larger than 10 ($\text{p}_0$ is the value of the null hypothesis in decimal form. )

$0.3(150) = 45>10$ and $(1-0.3)(150) = 105>10$

5. Compute the test statistic:

The conditions are satisfied, so we will use a hypothesis test for a single proportion to test the null hypothesis. For this calculation we need the sample proportion, $\hat{\text{p}}$:

$\displaystyle \hat { \text{p} } =\frac { 53 }{ 100 } =0.3533$,

$\displaystyle \text{z}=\frac { \hat { \text{p} } -{ \text{p} }_{ 0 } }{ \sqrt { \dfrac { { \text{p} }_{ 0 }(1-{ \text{p} }_{ 0 }) }{ \text{n} } } } =\frac { 0.3533-0.3 }{ \sqrt { \dfrac { 0.3(1-0.3) }{ 150 } } } =\frac { 0.0533 }{ 0.0374 } =1.425$.

6. Determine the Critical Region(s): Based on our hypotheses are we performing a left-tailed, right tailed or two-tailed test?

We will perform a right-tailed test, since we are only concerned with the proportion being more than 30% of households.

7. Sketch the test statistic and critical region: Look up the probability on the table, as shown in:

Critical Region: This image shows a graph of the critical region for the test statistic in our example.

8. Determine the $\text{p}$-value:

\begin{align} \text{p}\text{-value} &= \text{P}(\text{z}>1.425) \\ &= 1-\text{P}(\text{z}<1.425)\\ &= 1-0.923 \\ &= 0.077 \end{align}

9. State whether you reject or fail to reject the null hypothesis:

Since the probability is greater than the critical value of 5%, we will fail to reject the null hypothesis.

10. Conclusion: Interpret your result in the proper context, and relate it to the original question.

Since the probability is greater than 5%, this is not considered a rare event and the large probability tells us not to reject the null hypothesis. The $\text{p}$-value tells us that there is a 7.7% chance of obtaining our sample percentage of 35.33% if the null hypothesis is true. The sample data do not show sufficient evidence that the percentage of households in Michele’s city that have at least three cell phones is more than 30%. We do not have strong evidence against the null hypothesis.

Note that if evidence exists in support of rejecting the null hypothesis, the following steps are then required:

11. Calculate and display your confidence interval for the alternative hypothesis.

## Testing a Single Mean

In this section we will evaluate an example of hypothesis testing for a single mean.

### Learning Objectives

Construct and evaluate a hypothesis test for a single mean.

### Key Takeaways

#### Key Points

• Our hypothesis test involves the following steps: stating the question, planning the test, stating the hypotheses, determine if we are meeting the test criteria, and computing the test statistic.
• We continue the test by: determining the critical region, sketching the test statistic and critical region, determining the $\text{p}$-value, stating whether we reject or fail to reject the null hypothesis and making meaningful conclusions.
• Our example revolves around statistics students believe that the mean score on the first statistics test is 65 and a statistics instructor thinks the mean score is lower than 65.
• Since the resulting probability is greater than than the critical value of 5%, we will fail to reject the null hypothesis.

#### Key Terms

• null hypothesis: A hypothesis set up to be refuted in order to support an alternative hypothesis; presumed true until statistical evidence in the form of a hypothesis test indicates otherwise.

### A Hypothesis Test for a Single Mean—Standard Deviation Unknown

As an example of a hypothesis test for a single mean, consider the following. Statistics students believe that the mean score on the first statistics test is 65. A statistics instructor thinks the mean score is lower than 65. He randomly samples 10 statistics student scores and obtains the scores [62, 54, 64, 58, 70, 67, 63, 59, 69, 64]. He performs a hypothesis test using a 5% level of significance.

1. State the question: State what we want to determine and what level of significance is important in your decision.

We are asked to test the hypothesis that the mean statistics score, $\mu$, is less than 65. We do not know the population standard deviation. The significance level is 5%.

2. Plan: Based on the above question(s) and the answer to the following questions, decide which test you will be performing. Is the problem about numerical or categorical data ? If the data is numerical is the population standard deviation known? Do you have one group or two groups? What type of model is this?

We have univariate, quantitative data. We have a sample of 10 scores. We do not know the population standard deviation. Therefore, we can perform a Student’s $\text{t}$-test, with $\text{n}-1$, 9 degrees of freedom. Our model will be:

$\displaystyle \overline { \text{X} } \sim \left( \mu,\frac { \text{s} }{ \sqrt { \text{n} } } \right) =\text{T}\left( 65,\frac { 5.0111 }{ \sqrt { 10 } } \right)$

3. Hypotheses: State the null and alternative hypotheses in words and then in symbolic form Express the hypothesis to be tested in symbolic form. Write a symbolic expression that must be true when the original claim is false. The null hypothesis is the statement which included the equality. The alternative hypothesis is the statement without the equality.

Null hypothesis in words: The null hypothesis is that the true mean of the statistics exam is equal to 65.

Null hypothesis symbolically: $\text{H}_0: \mu =65$

Alternative hypothesis in words: The alternative is that the true mean statistics score on average is less than 65.

Alternative hypothesis symbolically: $\text{H}_\text{a}: \mu <65$

4. The criteria for the inferential test stated above: Think about the assumptions and check the conditions. If your assumptions include the need for particular types of data distribution , construct appropriate graphs or charts.

Randomization Condition: The sample is a random sample.

Independence Assumption: It is reasonable to think that the scores of students are independent in a random sample. There is no reason to think the score of one exam has any bearing on the score of another exam.

10% Condition: We assume the number of statistic students is more than 100, so 10 scores is less than 10% of the population.

Nearly Normal Condition: We should look at a boxplot and histogram for this, shown respectively in and.

Histogram: This figure shows a histogram for the dataset in our example.

Boxplot: This figure shows a boxplot for the dataset in our example.

Since there are no outliers and the histogram is bell shaped, the condition is satisfied.

Sample Size Condition: Since the distribution of the scores is normal, our sample of 10 scores is large enough.

5. Compute the test statistic:

The conditions are satisfied and σ is unknown, so we will use a hypothesis test for a mean with unknown standard deviation. We need the sample mean, sample standard deviation and Standard Error (SE).

$\overline { \text{x} } =63;\text{s}=5.0111;\text{n}=10;$

$\displaystyle { \text{SE} =\left( \frac { \text{s} }{ \sqrt { \text{n} } } \right) =\left( \frac { 5.0111 }{ \sqrt { 10 } } \right) =1.585}$

$\overline { \text{x} } =63;\text{df}=10-1=9;$

$\displaystyle \text{t}=\frac { \text{x}-\mu }{ \frac { \text{s} }{ \sqrt { \text{n} } } } =\frac { 63-65 }{ 1.585 } =-1.2618$.

6. Determine the Critical Region(s): Based on your hypotheses, should we perform a left-tailed, right-tailed, or two-sided test?

We will perform a left-tailed test, since we are only concerned with the score being less than 65.

7. Sketch the test statistic and critical region: Look up the probability on the table shown in .

Critical Region: This graph shows the critical region for the test statistic in our example.

8. Determine the $\text{P}$-value:

$\text{P}(\text{t}<-1.2618) > 0.10$

9. State whether you reject or fail to reject the null hypothesis:

Since the probability is greater than than the critical value of 5%, we will fail to reject the null hypothesis.

10. Conclusion: Interpret your result in the proper context, and relate it to the original question.

Since the probability is greater than 5%, this is not considered a rare event and the large probability tells us not to reject the null hypothesis. It is likely that the average statistics score is 65. The $\text{p}$-value tells us that there is more than 10% chance of obtaining our sample mean of 63 if the null hypothesis is true. This is not a rare event. We conclude that the sample data do not show sufficient evidence that the mean score is less than 65. We do not have strong evidence against the null hypothesis.

## Testing a Single Variance

In this section we will evaluate an example of hypothesis testing for a single variance.

### Learning Objectives

Construct and evaluate a hypothesis test for a single variance.

### Key Takeaways

#### Key Points

• A test of a single variance assumes that the underlying distribution is normal.
• The null and alternate hypotheses are stated in terms of the population variance (or population standard deviation ).
• A test of a single variance may be right-tailed, left-tailed, or two-tailed.

#### Key Terms

• variance: a measure of how far a set of numbers is spread out
• null hypothesis: A hypothesis set up to be refuted in order to support an alternative hypothesis; presumed true until statistical evidence in the form of a hypothesis test indicates otherwise.

A test of a single variance assumes that the underlying distribution is normal. The null and alternate hypotheses are stated in terms of the population variance (or population standard deviation). The test statistic is:

$\dfrac { (\text{n}-1)\cdot { \text{s} }^{ 2 } }{ { \sigma }^{ 2 } }$

where:

$\text{n}$ is the total number of data,

${ \text{s} }^{ 2 }$ is the sample variance, and

${ \sigma }^{ 2 }$ is the population variance.

We may think of $\text{s}$ as the random variable in this test. The degrees of freedom are $\text{df}=\text{n}-1$.

A test of a single variance may be right-tailed, left-tailed, or two-tailed.

The following example shows how to set up the null hypothesis and alternate hypothesis. The null and alternate hypotheses contain statements about the population variance.

### Examples

#### Example 1

Math instructors are not only interested in how their students do on exams, on average, but how the exam scores vary. To many instructors, the variance (or standard deviation) may be more important than the average.

Suppose a math instructor believes that the standard deviation for his final exam is 5 points. One of his best students thinks otherwise. The student claims that the standard deviation is more than 5 points. If the student were to conduct a hypothesis test, what would the null and alternate hypotheses be?

#### Solution

Even though we are given the population standard deviation, we can set the test up using the population variance as follows.

${ \text{H} }_{ 0 }={ \sigma }^{ 2 }={ 5 }^{ 2 }$

${ \text{H} }_{ \text{a} }={ \sigma }^{ 2 }>{ 5 }^{ 2 }$

#### Example 2

With individual lines at its various windows, a post office finds that the standard deviation for normally distributed waiting times for customers on Friday afternoon is 7.2 minutes. The post office experiments with a single main waiting line and finds that for a random sample of 25 customers, the waiting times for customers have a standard deviation of 3.5 minutes.

With a significance level of 5%, test the claim that a single line causes lower variation among waiting times (shorter waiting times) for customers.

#### Solution

Since the claim is that a single line causes lower variation, this is a test of a single variance. The parameter is the population variance, $\sigma^2$, or the population standard deviation, $\sigma$.

Random Variable: The sample standard deviation, $\text{s}$, is the random variable. Let $\text{s}$ be the standard deviation for the waiting times.

• ${ \text{H} }_{ 0 }={ \sigma }^{ 2 }={ 7.2 }^{ 2 }$
• ${ \text{H} }_{ \text{a} }={ \sigma }^{ 2 }<{ 7.2 }^{ 2 }$

The word “lower” tells you this is a left-tailed test.

Distribution for the test: ${ \text{x} }_{ 24 }^{ 2 }$, where:

• $\text{n}$ is the number of customers sampled
• $\text{df} = \text{n}-1 = 25-1 = 24$

Calculate the test statistic:

$\displaystyle { \text{x} }^{ 2 }=\frac { (\text{n}-1)\cdot { \text{s} }^{ 2 } }{ { \sigma }^{ 2 } } =\frac { (25-1)\cdot { 3.5 }^{ 2 } }{ 7.2^{ 2 } } =5.67$

where $\text{n}=25$, $\text{s}=3.5$, and $\sigma = 7.2$.

Graph:

Critical Region: This image shows the graph of the critical region in our example.

Probability statement: $\text{p}\text{-value} = \text{P}(\text{x}^2 < 5.67) = 0.000042$

Compare $\alpha$ and the $\text{p}$-value: $\alpha =0.05; \ \text{p}\text{-value} = 0.000042; \ \alpha > \text{p}\text{-value}$

Make a decision: Since $\alpha > \text{p}\text{-value}$, reject $\text{H}_0$. This means that we reject $\sigma^2 = 7.22$. In other words, we do not think the variation in waiting times is 7.2 minutes, but lower.

Conclusion: At a 5% level of significance, from the data, there is sufficient evidence to conclude that a single line causes a lower variation among the waiting times; or, with a single line, the customer waiting times vary less than 7.2 minutes.