Further Considerations for Data

The Sample Average

The sample average/mean can be calculated taking the sum of every piece of data and dividing that sum by the total number of data points.

Learning Objectives

Distinguish the sample mean from the population mean.

Key Takeaways

Key Points

  • The sample mean makes a good estimator of the population mean, as its expected value is equal to the population mean. The law of large numbers dictates that the larger the size of the sample, the more likely it is that the sample mean will be close to the population mean.
  • The sample mean of a population is a random variable, not a constant, and consequently it will have its own distribution.
  • The mean is the arithmetic average of a set of values, or distribution; however, for skewed distributions, the mean is not necessarily the same as the middle value ( median ), or the most likely ( mode ).

Key Terms

  • finite: limited, constrained by bounds, having an end
  • random variable: a quantity whose value is random and to which a probability distribution is assigned, such as the possible outcome of a roll of a die

Sample Average vs. Population Average

The sample average (also called the sample mean) is often referred to as the arithmetic mean of a sample, or simply, [latex]\bar{\text{x}}[/latex] (pronounced “x bar”). The mean of a population is denoted [latex]\mu[/latex], known as the population mean. The sample mean makes a good estimator of the population mean, as its expected value is equal to the population mean. The sample mean of a population is a random variable, not a constant, and consequently it will have its own distribution. For a random sample of [latex]\text{n}[/latex] observations from a normally distributed population, the sample mean distribution is:

[latex]\displaystyle \bar{\text{x}}\sim \text{N}\left \{ \mu,\frac{\sigma ^2}{\text{n}}\right \}[/latex]

For a finite population, the population mean of a property is equal to the arithmetic mean of the given property while considering every member of the population. For example, the population mean height is equal to the sum of the heights of every individual divided by the total number of individuals.The sample mean may differ from the population mean, especially for small samples. The law of large numbers dictates that the larger the size of the sample, the more likely it is that the sample mean will be close to the population mean.

Calculation of the Sample Mean

The arithmetic mean is the “standard” average, often simply called the “mean”. It can be calculated taking the sum of every piece of data and dividing that sum by the total number of data points:

[latex]\displaystyle \bar{\text{x}}= \frac{1}{\text{n}}\cdot \sum_{\text{i}=1}^{\text{n}}\text{x}_{\text{i}}[/latex]

For example, the arithmetic mean of five values: 4, 36, 45, 50, 75 is:

[latex]\displaystyle \frac{4+36+45+50+75}{5}=\frac{210}{5}= 42[/latex]

The mean may often be confused with the median, mode or range. The mean is the arithmetic average of a set of values, or distribution; however, for skewed distributions, the mean is not necessarily the same as the middle value (median), or the most likely (mode). For example, mean income is skewed upwards by a small number of people with very large incomes, so that the majority have an income lower than the mean. By contrast, the median income is the level at which half the population is below and half is above. The mode income is the most likely income, and favors the larger number of people with lower incomes. The median or mode are often more intuitive measures of such data.

image

Measures of Central Tendency: This graph shows where the mean, median, and mode fall in two different distributions (one is slightly skewed left and one is highly skewed right).

Which Standard Deviation (SE)?

Although they are often used interchangeably, the standard deviation and the standard error are slightly different.

Learning Objectives

Differentiate between standard deviation and standard error.

Key Takeaways

Key Points

  • Standard error is an estimate of how close to the population mean your sample mean is likely to be, whereas standard deviation is the degree to which individuals within the sample differ from the sample mean.
  • Standard deviation (represented by the symbol sigma, σ) shows how much variation or dispersion exists from the average (mean), or expected value.
  • The standard error is the standard deviation of the sampling distribution of a statistic, such as the mean.
  • Standard error should decrease with larger sample sizes, as the estimate of the population mean improves. Standard deviation will be unaffected by sample size.

Key Terms

  • sample mean: the mean of a sample of random variables taken from the entire population of those variables
  • central limit theorem: The theorem that states: If the sum of independent identically distributed random variables has a finite variance, then it will be (approximately) normally distributed.
  • standard error: A measure of how spread out data values are around the mean, defined as the square root of the variance.

The standard error is the standard deviation of the sampling distribution of a statistic. The term may also be used to refer to an estimate of that standard deviation, derived from a particular sample used to compute the estimate.

For example, the sample mean is the usual estimator of a population mean. However, different samples drawn from that same population would in general have different values of the sample mean. The standard error of the mean (i.e., of using the sample mean as a method of estimating the population mean) is the standard deviation of those sample means over all possible samples (of a given size) drawn from the population. Secondly, the standard error of the mean can refer to an estimate of that standard deviation, computed from the sample of data being analyzed at the time.

In scientific and technical literature, experimental data is often summarized either using the mean and standard deviation or the mean with the standard error. This often leads to confusion about their interchangeability. However, the mean and standard deviation are descriptive statistics, whereas the mean and standard error describes bounds on a random sampling process. Despite the small difference in equations for the standard deviation and the standard error, this small difference changes the meaning of what is being reported from a description of the variation in measurements to a probabilistic statement about how the number of samples will provide a better bound on estimates of the population mean, in light of the central limit theorem. Put simply, standard error is an estimate of how close to the population mean your sample mean is likely to be, whereas standard deviation is the degree to which individuals within the sample differ from the sample mean. Standard error should decrease with larger sample sizes, as the estimate of the population mean improves. Standard deviation will be unaffected by sample size.

image

Standard Deviation: This is an example of two sample populations with the same mean and different standard deviations. The red population has mean 100 and SD 10; the blue population has mean 100 and SD 50.

Estimating the Accuracy of an Average

The standard error of the mean is the standard deviation of the sample mean’s estimate of a population mean.

Learning Objectives

Evaluate the accuracy of an average by finding the standard error of the mean.

Key Takeaways

Key Points

  • Any measurement is subject to error by chance, which means that if the measurement was taken again it could possibly show a different value.
  • In general terms, the standard error is the standard deviation of the sampling distribution of a statistic.
  • The standard error of the mean is usually estimated by the sample estimate of the population standard deviation (sample standard deviation) divided by the square root of the sample size.
  • The standard error and standard deviation of small samples tend to systematically underestimate the population standard error and deviations because the standard error of the mean is a biased estimator of the population standard error.
  • The standard error is an estimate of how close the population mean will be to the sample mean, whereas standard deviation is the degree to which individuals within the sample differ from the sample mean.

Key Terms

  • confidence interval: A type of interval estimate of a population parameter used to indicate the reliability of an estimate.
  • central limit theorem: The theorem that states: If the sum of independent identically distributed random variables has a finite variance, then it will be (approximately) normally distributed.
  • standard error: A measure of how spread out data values are around the mean, defined as the square root of the variance.

Any measurement is subject to error by chance, meaning that if the measurement was taken again, it could possibly show a different value. We calculate the standard deviation in order to estimate the chance error for a single measurement. Taken further, we can calculate the chance error of the sample mean to estimate its accuracy in relation to the overall population mean.

Standard Error

In general terms, the standard error is the standard deviation of the sampling distribution of a statistic. The term may also be used to refer to an estimate of that standard deviation, derived from a particular sample used to compute the estimate. For example, the sample mean is the standard estimator of a population mean. However, different samples drawn from that same population would, in general, have different values of the sample mean.

image

Standard Deviation as Standard Error: For a value that is sampled with an unbiased normally distributed error, the graph depicts the proportion of samples that would fall between 0, 1, 2, and 3 standard deviations above and below the actual value.

The standard error of the mean (i.e., standard error of using the sample mean as a method of estimating the population mean) is the standard deviation of those sample means over all possible samples (of a given size) drawn from the population. Secondly, the standard error of the mean can refer to an estimate of that standard deviation, computed from the sample of data being analyzed at the time.

In practical applications, the true value of the standard deviation (of the error) is usually unknown. As a result, the term standard error is often used to refer to an estimate of this unknown quantity. In such cases, it is important to clarify one’s calculations, and take proper account of the fact that the standard error is only an estimate.

Standard Error of the Mean

As mentioned, the standard error of the mean (SEM) is the standard deviation of the sample-mean’s estimate of a population mean. It can also be viewed as the standard deviation of the error in the sample mean relative to the true mean, since the sample mean is an unbiased estimator. Generally, the SEM is the sample estimate of the population standard deviation (sample standard deviation) divided by the square root of the sample size:

[latex]\displaystyle \text{S}{ \text{E} }_{ \bar { \text{x} } }=\frac { \text{s} }{ \sqrt { \text{n} } }[/latex]

Where s is the sample standard deviation (i.e., the sample-based estimate of the standard deviation of the population), and [latex]\text{n}[/latex] is the size (number of observations) of the sample. This estimate may be compared with the formula for the true standard deviation of the sample mean:

[latex]\displaystyle \text{S}{ \text{D} }_{ \bar { \text{x} } }=\frac { \sigma }{ \sqrt { \text{n} } }[/latex]

Where [latex]\sigma[/latex] is the standard deviation of the population. Note that the standard error and the standard deviation of small samples tend to systematically underestimate the population standard error and deviations because the standard error of the mean is a biased estimator of the population standard error. For example, with [latex]\text{n}=2[/latex], the underestimate is about 25%, but for [latex]\text{n}=6[/latex], the underestimate is only 5%. As a practical result, decreasing the uncertainty in a mean value estimate by a factor of two requires acquiring four times as many observations in the sample. Decreasing standard error by a factor of ten requires a hundred times as many observations.

Assumptions and Usage

If the data are assumed to be normally distributed, quantiles of the normal distribution and the sample mean and standard error can be used to calculate approximate confidence intervals for the mean. In particular, the standard error of a sample statistic (such as sample mean) is the estimated standard deviation of the error in the process by which it was generated. In other words, it is the standard deviation of the sampling distribution of the sample statistic.

Standard errors provide simple measures of uncertainty in a value and are often used for the following reasons:

  • If the standard error of several individual quantities is known, then the standard error of some function of the quantities can be easily calculated in many cases.
  • Where the probability distribution of the value is known, it can be used to calculate a good approximation to an exact confidence interval.
  • Where the probability distribution is unknown, relationships of inequality can be used to calculate a conservative confidence interval.
  • As the sample size tends to infinity, the central limit theorem guarantees that the sampling distribution of the mean is asymptotically normal.

Chance Models

A stochastic model is used to estimate probability distributions of potential outcomes by allowing for random variation in one or more inputs over time.

Learning Objectives

Support the idea that stochastic modeling provides a better representation of real life by building randomness into a simulation.

Key Takeaways

Key Points

  • Accurately determining the standard error of the mean depends on the presence of chance.
  • Stochastic modeling builds volatility and variability (randomness) into a simulation and, therefore, provides a better representation of real life from more angles.
  • Stochastic models help to assess the interactions between variables and are useful tools to numerically evaluate quantities.

Key Terms

  • stochastic: random; randomly determined
  • Monte Carlo simulation: a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results–i.e., by running simulations many times over in order to calculate those same probabilities

The calculation of the standard error of the mean for repeated measurements is easily carried out on a data set; however, this method for determining error is only viable when the data varies as if drawing a name out of a hat. In other words, the data should be completely random, and should not show a trend or pattern over time. Therefore, accurately determining the standard error of the mean depends on the presence of chance.

Stochastic Modeling

“Stochastic” means being or having a random variable. A stochastic model is a tool for estimating probability distributions of potential outcomes by allowing for random variation in one or more inputs over time. The random variation is usually based on fluctuations observed in historical data for a selected period using standard time-series techniques. Distributions of potential outcomes are derived from a large number of simulations (stochastic projections) which reflect the random variation in the input(s).

In order to understand stochastic modeling, consider the example of an insurance company projecting potential claims. Like any other company, an insurer has to show that its assets exceed its liabilities to be solvent. In the insurance industry, however, assets and liabilities are not known entities. They depend on how many policies result in claims, inflation from now until the claim, investment returns during that period, and so on. So the valuation of an insurer involves a set of projections, looking at what is expected to happen, and thus coming up with the best estimate for assets and liabilities.

A stochastic model, in the case of the insurance company, would be to set up a projection model which looks at a single policy, an entire portfolio, or an entire company. But rather than setting investment returns according to their most likely estimate, for example, the model uses random variations to look at what investment conditions might be like. Based on a set of random outcomes, the experience of the policy/portfolio/company is projected, and the outcome is noted. This is done again with a new set of random variables. In fact, this process is repeated thousands of times.

At the end, a distribution of outcomes is available which shows not only the most likely estimate but what ranges are reasonable, too. The most likely estimate is given by the center of mass of the distribution curve (formally known as the probability density function), which is typically also the mode of the curve. Stochastic modeling builds volatility and variability (randomness) into a simulation and, therefore, provides a better representation of real life from more angles.

Numerical Evaluations of Quantities

Stochastic models help to assess the interactions between variables and are useful tools to numerically evaluate quantities, as they are usually implemented using Monte Carlo simulation techniques.

image

Monte Carlo Simulation: Monte Carlo simulation (10,000 points) of the distribution of the sample mean of a circular normal distribution for 3 measurements.

While there is an advantage here, in estimating quantities that would otherwise be difficult to obtain using analytical methods, a disadvantage is that such methods are limited by computing resources as well as simulation error. Below are some examples:

Means

Using statistical notation, it is a well-known result that the mean of a function, [latex]\text{f}[/latex], of a random variable, [latex]\text{x}[/latex], is not necessarily the function of the mean of [latex]\text{x}[/latex]. For example, in finance, applying the best estimate (defined as the mean) of investment returns to discount a set of cash flows will not necessarily give the same result as assessing the best estimate to the discounted cash flows. A stochastic model would be able to assess this latter quantity with simulations.

Percentiles

This idea is seen again when one considers percentiles. When assessing risks at specific percentiles, the factors that contribute to these levels are rarely at these percentiles themselves. Stochastic models can be simulated to assess the percentiles of the aggregated distributions.

Truncations and Censors

Truncating and censoring of data can also be estimated using stochastic models. For instance, applying a non-proportional reinsurance layer to the best estimate losses will not necessarily give us the best estimate of the losses after the reinsurance layer. In a simulated stochastic model, the simulated losses can be made to “pass through” the layer and the resulting losses are assessed appropriately.

The Gauss Model

The normal (Gaussian) distribution is a commonly used distribution that can be used to display the data in many real life scenarios.

Learning Objectives

Explain the importance of the Gauss model in terms of the central limit theorem.

Key Takeaways

Key Points

  • If [latex]\mu = 0[/latex] and [latex]\sigma = 1[/latex], the distribution is called the standard normal distribution or the unit normal distribution, and a random variable with that distribution is a standard normal deviate.
  • It is symmetric around the point [latex]\text{x}=\mu[/latex], which is at the same time the mode, the median and the mean of the distribution.
  • The Gaussian distribution is sometimes informally called the bell curve. However, there are many other distributions that are bell-shaped as well.
  • About 68% of values drawn from a normal distribution are within one standard deviation σ away from the mean; about 95% of the values lie within two standard deviations; and about 99.7% are within three standard deviations. This fact is known as the 68-95-99.7 (empirical) rule, or the 3-sigma rule.

Key Terms

  • central limit theorem: The theorem that states: If the sum of independent identically distributed random variables has a finite variance, then it will be (approximately) normally distributed.

The Normal (Gaussian) Distribution

In probability theory, the normal (or Gaussian) distribution is a continuous probability distribution, defined by the formula:

[latex]\displaystyle \text{f}(\text{x})= \frac{1}{\sigma \sqrt{2\pi }}\text{e}^\frac{{-(\text{x}-\mu )^{2}}}{2\sigma ^{2}}[/latex]

The parameter [latex]\mu[/latex] in this formula is the mean or expectation of the distribution (and also its median and mode). The parameter [latex]\sigma[/latex] is its standard deviation; its variance is therefore [latex]\sigma^2[/latex]. A random variable with a Gaussian distribution is said to be normally distributed and is called a normal deviate.

If [latex]\mu = 0[/latex] and [latex]\sigma = 1[/latex], the distribution is called the standard normal distribution or the unit normal distribution, and a random variable with that distribution is a standard normal deviate.

Importance of the Normal Distribution

Normal distributions are extremely important in statistics, and are often used in the natural and social sciences for real-valued random variables whose distributions are not known. One reason for their popularity is the central limit theorem, which states that, under mild conditions, the mean of a large number of random variables independently drawn from the same distribution is distributed approximately normally, irrespective of the form of the original distribution. Thus, physical quantities that are expected to be the sum of many independent processes (such as measurement errors) often have a distribution very close to normal. Another reason is that a large number of results and methods (such as propagation of uncertainty and least squares parameter fitting) can be derived analytically, in explicit form, when the relevant variables are normally distributed.

The normal distribution is symmetric about its mean, and is non-zero over the entire real line. As such it may not be a suitable model for variables that are inherently positive or strongly skewed, such as the weight of a person or the price of a share. Such variables may be better described by other distributions, such as the log-normal distribution or the Pareto distribution.

The normal distribution is also practically zero once the value [latex]\text{x}[/latex] lies more than a few standard deviations away from the mean. Therefore, it may not be appropriate when one expects a significant fraction of outliers, values that lie many standard deviations away from the mean. Least-squares and other statistical inference methods which are optimal for normally distributed variables often become highly unreliable. In those cases, one assumes a more heavy-tailed distribution, and the appropriate robust statistical inference methods.

The Gaussian distribution is sometimes informally called the bell curve. However, there are many other distributions that are bell-shaped (such as Cauchy’s, Student’s, and logistic). The terms Gaussian function and Gaussian bell curve are also ambiguous since they sometimes refer to multiples of the normal distribution whose integral is not 1; that is, for arbitrary positive constants [latex]\text{a}[/latex], [latex]\text{b}[/latex] and [latex]\text{c}[/latex].

Properties of the Normal Distribution

The normal distribution [latex]\text{f}(\text{x})[/latex], with any mean [latex]\mu[/latex] and any positive deviation [latex]\sigma[/latex], has the following properties:

  • It is symmetric around the point [latex]\text{x} = \mu[/latex], which is at the same time the mode, the median and the mean of the distribution.
  • It is unimodal: its first derivative is positive for [latex]\text{x}<\mu[/latex], negative for [latex]\text{x}>\mu[/latex], and zero only at [latex]\text{x}=\mu[/latex].
  • It has two inflection points (where the second derivative of [latex]\text{f}[/latex] is zero), located one standard deviation away from the mean, namely at [latex]\text{x} = \mu - \sigma[/latex] and [latex]\text{x} = \mu + \sigma[/latex].
  • About 68% of values drawn from a normal distribution are within one standard deviation [latex]\sigma[/latex] away from the mean; about 95% of the values lie within two standard deviations; and about 99.7% are within three standard deviations. This fact is known as the 68-95-99.7 (empirical) rule, or the 3-sigma rule.

Notation

The normal distribution is also often denoted by [latex]\text{N}(\mu, \sigma^2)[/latex]. Thus when a random variable [latex]\text{x}[/latex] is distributed normally with mean [latex]\mu[/latex] and variance [latex]\sigma^2[/latex], we write [latex]\text{X}\sim N\left ( \mu,\sigma ^{2} \right )[/latex]

Comparing Two Sample Averages

Student’s t-test is used in order to compare two independent sample means.

Learning Objectives

Contrast two sample means by standardizing their difference to find a t-score test statistic.

Key Takeaways

Key Points

  • Very different sample means can occur by chance if there is great variation among the individual samples.
  • In order to account for the variation, we take the difference of the sample means and divide by the standard error in order to standardize the difference, resulting in a t-score test statistic.
  • The independent samples t-test is used when two separate sets of independent and identically distributed samples are obtained, one from each of the two populations being compared.
  • Paired samples t-tests typically consist of a sample of matched pairs of similar units or one group of units that has been tested twice (a “repeated measures” t-test).
  • An overlapping samples t-test is used when there are paired samples with data missing in one or the other samples.

Key Terms

  • null hypothesis: A hypothesis set up to be refuted in order to support an alternative hypothesis; presumed true until statistical evidence in the form of a hypothesis test indicates otherwise.
  • Student’s t-distribution: A distribution that arises when the population standard deviation is unknown and has to be estimated from the data; originally derived by William Sealy Gosset (who wrote under the pseudonym “Student”).

The comparison of two sample means is very common. The difference between the two samples depends on both the means and the standard deviations. Very different means can occur by chance if there is great variation among the individual samples. In order to account for the variation, we take the difference of the sample means,

[latex]\bar { { \text{X} }_{ 1 } } -\bar { { \text{X} }_{ 2 } }[/latex],

and divide by the standard error in order to standardize the difference. The result is a t-score test statistic.

t-Test for Two Means

Although the t-test will be explained in great detail later in this textbook, it is important for the reader to have a basic understanding of its function in regard to comparing two sample means. A t-test is any statistical hypothesis test in which the test statistic follows Student’s t distribution, as shown in, if the null hypothesis is supported. It can be used to determine if two sets of data are significantly different from each other.

image

Student t Distribution: This is a plot of the Student t Distribution for various degrees of freedom.

In the t-test comparing the means of two independent samples, the following assumptions should be met:

  1. Each of the two populations being compared should follow a normal distribution.
  2. If using Student’s original definition of the t-test, the two populations being compared should have the same variance. If the sample sizes in the two groups being compared are equal, Student’s original t-test is highly robust to the presence of unequal variances.
  3. The data used to carry out the test should be sampled independently from the populations being compared. This is, in general, not testable from the data, but if the data are known to be dependently sampled (i.e., if they were sampled in clusters), then the classical t-tests discussed here may give misleading results.

Two-sample t-tests for a difference in mean involve independent samples, paired samples and overlapping samples. The independent samples t-test is used when two separate sets of independent and identically distributed samples are obtained, one from each of the two populations being compared. For example, suppose we are evaluating the effects of a medical treatment. We enroll 100 subjects into our study, then randomize 50 subjects to the treatment group and 50 subjects to the control group. In this case, we have two independent samples and would use the unpaired form of the t-test.

Paired sample t-tests typically consist of a sample of matched pairs of similar units or one group of units that has been tested twice (a “repeated measures” t-test). A typical example of the repeated measures t-test would be where subjects are tested prior to a treatment (say, for high blood pressure) and the same subjects are tested again after treatment with a blood-pressure lowering medication. By comparing the same patient’s numbers before and after treatment, we are effectively using each patient as their own control.

An overlapping sample t-test is used when there are paired samples with data missing in one or the other samples. These tests are widely used in commercial survey research (e.g., by polling companies) and are available in many standard crosstab software packages.

Odds Ratios

The odds of an outcome is the ratio of the expected number of times the event will occur to the expected number of times the event will not occur.

Learning Objectives

Define the odds ratio and demonstrate its computation.

Key Takeaways

Key Points

  • The odds ratio is one way to quantify how strongly having or not having the property [latex]\text{A}[/latex] is associated with having or not having the property [latex]\text{B}[/latex] in a population.
  • The odds ratio is a measure of effect size, describing the strength of association or non-independence between two binary data values.
  • To compute the odds ratio, we 1) compute the odds that an individual in the population has [latex]\text{A}[/latex] given that he or she has [latex]\text{B}[/latex], 2) compute the odds that an individual in the population has [latex]\text{A}[/latex] given that he or she does not have [latex]\text{B}[/latex] and 3) divide the first odds by the second odds.
  • If the odds ratio is greater than one, then having [latex]\text{A}[/latex] is associated with having [latex]\text{B}[/latex] in the sense that having [latex]\text{B}[/latex] raises the odds of having [latex]\text{A}[/latex].

Key Terms

  • logarithm: for a number $x$, the power to which a given base number must be raised in order to obtain $x$
  • odds: the ratio of the probabilities of an event happening to that of it not happening

The odds of an outcome is the ratio of the expected number of times the event will occur to the expected number of times the event will not occur. Put simply, the odds are the ratio of the probability of an event occurring to the probability of no event.

An odds ratio is the ratio of two odds. Imagine each individual in a population either does or does not have a property [latex]\text{A}[/latex], and also either does or does not have a property [latex]\text{B}[/latex]. For example, [latex]\text{A}[/latex] might be “has high blood pressure,” and [latex]\text{B}[/latex] might be “drinks more than one alcoholic drink a day.” The odds ratio is one way to quantify how strongly having or not having the property [latex]\text{A}[/latex] is associated with having or not having the property [latex]\text{B}[/latex] in a population. In order to compute the odds ratio, one follows three steps:

  1. Compute the odds that an individual in the population has [latex]\text{A}[/latex]given that he or she has [latex]\text{B}[/latex] (probability of [latex]\text{A}[/latex] given [latex]\text{B}[/latex] divided by the probability of not-[latex]\text{A}[/latex] given [latex]\text{B}[/latex]).
  2. Compute the odds that an individual in the population has [latex]\text{A}[/latex] given that he or she does not have [latex]\text{B}[/latex].
  3. Divide the first odds by the second odds to obtain the odds ratio.

If the odds ratio is greater than one, then having [latex]\text{A}[/latex] is associated with having [latex]\text{B}[/latex] in the sense that having [latex]\text{B}[/latex] raises (relative to not having [latex]\text{B}[/latex]) the odds of having [latex]\text{A}[/latex]. Note that this is not enough to establish that [latex]\text{B}[/latex] is a contributing cause of [latex]\text{A}[/latex]. It could be that the association is due to a third property, [latex]\text{C}[/latex], which is a contributing cause of both [latex]\text{A}[/latex] and [latex]\text{B}[/latex].

In more technical language, the odds ratio is a measure of effect size, describing the strength of association or non-independence between two binary data values. It is used as a descriptive statistic and plays an important role in logistic regression.

Example

Suppose that in a sample of [latex]100[/latex] men [latex]90[/latex] drank wine in the previous week, while in a sample of [latex]100[/latex] women only [latex]20[/latex] drank wine in the same period. The odds of a man drinking wine are [latex]90[/latex] to [latex]10[/latex] (or [latex]9:1[/latex]) while the odds of a woman drinking wine are only [latex]20[/latex] to [latex]80[/latex] (or [latex]1:4=0.25:1[/latex]). The odds ratio is thus [latex]\frac{9}{0.25}[/latex] (or [latex]36[/latex]) showing that men are much more likely to drink wine than women. The detailed calculation is:

[latex]\dfrac { 0.9/0.1 }{ 0.2/0.8 } =\dfrac { 0.9\cdot 0.8 }{ 0.1\cdot 0.2 } =\dfrac { 0.72 }{ 0.02 } =36[/latex]

This example also shows how odds ratios are sometimes sensitive in stating relative positions. In this sample men are [latex]\frac{90}{20} = 4.5[/latex] times more likely to have drunk wine than women, but have [latex]36[/latex] times the odds. The logarithm of the odds ratio—the difference of the logits of the probabilities—tempers this effect and also makes the measure symmetric with respect to the ordering of groups. For example, using natural logarithms, an odds ratio of [latex]\frac{36}{1}[/latex] maps to [latex]3.584[/latex], and an odds ratio of [latex]\frac{1}{36}[/latex] maps to [latex]−3.584[/latex].

image

Odds Ratios: A graph showing how the log odds ratio relates to the underlying probabilities of the outcome [latex]\text{X}[/latex] occurring in two groups, denoted [latex]\text{A}[/latex] and [latex]\text{B}[/latex]. The log odds ratio shown here is based on the odds for the event occurring in group [latex]\text{B}[/latex] relative to the odds for the event occurring in group [latex]\text{A}[/latex]. Thus, when the probability of [latex]\text{X}[/latex] occurring in group [latex]\text{B}[/latex] is greater than the probability of [latex]\text{X}[/latex] occurring in group [latex]\text{A}[/latex], the odds ratio is greater than [latex]1[/latex], and the log odds ratio is greater than [latex]0[/latex].

When Does the Z-Test Apply?

A [latex]\text{z}[/latex]-test is a test for which the distribution of the test statistic under the null hypothesis can be approximated by a normal distribution.

Learning Objectives

Identify how sample size contributes to the appropriateness and accuracy of a [latex]\text{z}[/latex]-test

Key Takeaways

Key Points

  • The term [latex]\text{z}[/latex]-test is often used to refer specifically to the one- sample location test comparing the mean of a set of measurements to a given constant.
  • To calculate the standardized statistic [latex]\text{Z} = \frac{\text{X} - \mu_0}{\text{s}}[/latex], we need to either know or have an approximate value for [latex]\sigma^2[/latex] [latex]\text{z}[/latex] σ2, from which we can calculate [latex]\text{s}^2 = \frac{\sigma^2}{\text{n}}[/latex].
  • For a [latex]\text{z}[/latex]-test to be applicable, nuisance parameters should be known, or estimated with high accuracy.
  • For a [latex]\text{z}[/latex]-test to be applicable, the test statistic should follow a normal distribution.

Key Terms

  • nuisance parameters: any parameter that is not of immediate interest but which must be accounted for in the analysis of those parameters which are of interest; the classic example of a nuisance parameter is the variance $\sigma^2$, of a normal distribution, when the mean, $\mu$, is of primary interest
  • null hypothesis: A hypothesis set up to be refuted in order to support an alternative hypothesis; presumed true until statistical evidence in the form of a hypothesis test indicates otherwise.

[latex]\text{Z}[/latex]-test

A [latex]\text{Z}[/latex]-test is any statistical test for which the distribution of the test statistic under the null hypothesis can be approximated by a normal distribution. Because of the central limit theorem, many test statistics are approximately normally distributed for large samples. For each significance level, the [latex]\text{Z}[/latex]-test has a single critical value (for example, [latex]1.96[/latex] for 5% two tailed) which makes it more convenient than the Student’s t-test which has separate critical values for each sample size. Therefore, many statistical tests can be conveniently performed as approximate [latex]\text{Z}[/latex]-tests if the sample size is large or the population variance known. If the population variance is unknown (and therefore has to be estimated from the sample itself) and the sample size is not large ([latex]\text{n}<30[/latex]), the Student [latex]\text{t}[/latex]-test may be more appropriate.

If [latex]\text{T}[/latex] is a statistic that is approximately normally distributed under the null hypothesis, the next step in performing a [latex]\text{Z}[/latex]-test is to estimate the expected value [latex]\theta[/latex] of [latex]\text{T}[/latex] under the null hypothesis, and then obtain an estimate [latex]\text{s}[/latex] of the standard deviation of [latex]\text{T}[/latex]. We then calculate the standard score [latex]\text{Z} = \frac{(\text{T}-\theta)}{\text{s}}[/latex], from which one-tailed and two-tailed [latex]\text{p}[/latex]-values can be calculated as [latex]\varphi(-\text{Z})[/latex] (for upper-tailed tests), [latex]\varphi(\text{Z})[/latex] (for lower-tailed tests) and [latex]2\varphi(\left|-\text{Z}\right|)[/latex] (for two-tailed tests) where [latex]\varphi[/latex] is the standard normal cumulative distribution function.

Use in Location Testing

The term [latex]\text{Z}[/latex]-test is often used to refer specifically to the one-sample location test comparing the mean of a set of measurements to a given constant. If the observed data [latex]\text{X}_1, \cdots, \text{X}_\text{n}[/latex] are uncorrelated, have a common mean [latex]\mu[/latex], and have a common variance [latex]\sigma^2[/latex], then the sample average [latex]\bar{\text{X}}[/latex] has mean [latex]\mu[/latex] and variance [latex]\frac{\sigma^2}{\text{n}}[/latex]. If our null hypothesis is that the mean value of the population is a given number [latex]\mu_0[/latex], we can use [latex]\bar{\text{X}} - \mu_0[/latex] as a test-statistic, rejecting the null hypothesis if [latex]\bar{\text{X}}-\mu_0[/latex] is large.

To calculate the standardized statistic [latex]\text{Z} = \frac{(\text{X} − μ_0)} {\text{s}}[/latex], we need to either know or have an approximate value for [latex]\sigma^2[/latex], from which we can calculate [latex]\text{s}^2 = \frac{\sigma^2}{\text{n}}[/latex]. In some applications, [latex]\sigma^2[/latex] is known, but this is uncommon. If the sample size is moderate or large, we can substitute the sample variance for [latex]\sigma^2[/latex], giving a plug-in test. The resulting test will not be an exact [latex]\text{Z}[/latex]-test since the uncertainty in the sample variance is not accounted for—however, it will be a good approximation unless the sample size is small. A [latex]\text{t}[/latex]-test can be used to account for the uncertainty in the sample variance when the sample size is small and the data are exactly normal. There is no universal constant at which the sample size is generally considered large enough to justify use of the plug-in test. Typical rules of thumb range from 20 to 50 samples. For larger sample sizes, the [latex]\text{t}[/latex]-test procedure gives almost identical [latex]\text{p}[/latex]-values as the [latex]\text{Z}[/latex]-test procedure. The following formula converts a random variable [latex]\text{X}[/latex] to the standard [latex]\text{Z}[/latex]:

[latex]\text{Z} = \dfrac{\text{X}-\mu}{\sigma}[/latex]

Conditions

For the [latex]\text{Z}[/latex]-test to be applicable, certain conditions must be met:

  • Nuisance parameters should be known, or estimated with high accuracy (an example of a nuisance parameter would be the standard deviation in a one-sample location test). [latex]\text{Z}[/latex]-tests focus on a single parameter, and treat all other unknown parameters as being fixed at their true values. In practice, due to Slutsky’s theorem, “plugging in” consistent estimates of nuisance parameters can be justified. However if the sample size is not large enough for these estimates to be reasonably accurate, the [latex]\text{Z}[/latex]-test may not perform well.
  • The test statistic should follow a normal distribution. Generally, one appeals to the central limit theorem to justify assuming that a test statistic varies normally. There is a great deal of statistical research on the question of when a test statistic varies approximately normally. If the variation of the test statistic is strongly non-normal, a [latex]\text{Z}[/latex]-test should not be used.