Essential Concepts
- The median of a data set can be computed by ordering the data values and identifying the value in “the middle”.
- The mean represents the balance point of the data, and the median represents the 50th percentile, or the value that splits the data in half.
- The following steps can be applied to calculate a standard deviation by hand:
- Calculate the mean of the population or sample
- Take the difference between each data value and the mean, then square each difference
- Add up all the squared differences
- Divide by either the total number of observations in the case of a population, or by 1 fewer than the total in the case of a sample
- Take the square root of the result from step 4
- Variability can be judged from a histogram by examining the distance of the bars from the statistical center (mean or median) of the graph. If the variability is high, equally sized or taller bars will appear away from the center of the graph. It the variability is low, the data will appear clustered around the center.
- Larger values of range indicate more variability in the data, but the range value only utilizes two observations in the entire data set to measure variability. This is not an ideal measure of spread, but when used in combination with other measures of spread, it can help you gain a clearer understanding of the spread of a distribution.
- The median stays relatively fixed in a data set if one value changes by a large amount, the mean does not. This is indication that the mean is sensitive to the presence of extreme values in the data set.
- When a distribution is symmetric, the mean and median occupy the same value. Under a skew, the mean is “pulled” in the direction of the outliers:
- Right-skewed: the mean is greater than the median
- Left-skewed: the mean is less than the median
- The mean, under certain conditions, can be a misleading indicator of a “typical” observation value.
- A boxplot captures only the median of the data set, not the mean, as a measure of center. It provides a quick glance (or five-number summary) of the data to make comparisons based on the median, skew, outliers, and quartiles.
- Interquartile range [latex](IQR)[/latex] is the best method for determining if an observation is an outlier in the distribution. This value equals either the distance [latex]1.5(IQR)[/latex] less than [latex]Q1[/latex] or greater than [latex]Q3[/latex].
- Standardizing the value includes finding the difference between the given value and the mean, and dividing that distance by the standard deviation. The resulting value is a number of standard deviations, and has no units associated with it.
- Standardized scores can result in positive and negative values. A negative can be thought of as indicating a value that lies to the left of the mean, and a positive indicates a value that lies to the right of the mean.
- An estimate of how many observations are within a certain number of standard deviations can be made if a distribution is bell shaped, unimodal, and symmetric.
- The Empirical Rule states that:
- about 68% of observations in a data set will be within one standard deviation of the mean
- about 95% of observations in a data set will be within two standard deviations of the mean
- about 99.7% of the observations in a data set will be within three standard deviations of the mean
Key Equations
Converting values into standardized scores
[latex]z=\dfrac{x-\mu}{\sigma}[/latex], where [latex]x[/latex] represents the value of the observation, [latex]\mu[/latex] represents the population mean, [latex]\sigma[/latex] represents the population standard deviation, and [latex]z[/latex] represents the standardized value, or z-score.
Deviation from the mean
[latex]\left(x-\bar{x}\right)[/latex]
where [latex]\left(x\right)[/latex] is the observation in the data set, and [latex]\left(\bar{x}\right)[/latex] is the sample mean.
Interquartile range ([latex]IQR[/latex])
[latex]Q3–Q1[/latex]
Lower outlier
[latex]Q1-1.5(IQR)[/latex], remember to multiply [latex]1.5[/latex] by [latex]IQR[/latex] first, then subtract from [latex]Q1[/latex]
Mean
[latex]\dfrac{\text{sum of data values}}{\text{total number of data values}}[/latex] or [latex]\bar{x}=\dfrac{\sum{x}}{n}[/latex]
where [latex]\bar{x}[/latex] is the mean, [latex]{\sum{x}}[/latex] is the symbol for “sum of”, [latex]{x}[/latex] represents the data values, and [latex]{n}[/latex] is the total number of data values.
Standard deviation of a population
[latex]\sigma = \sqrt{\dfrac{\sum \left(x-\mu\right)^2}{n}}[/latex]
where [latex]\sum[/latex] is the summation of [latex]{\left(x-\mu\right)^2}[/latex] for each observation, [latex]\left(x\right)[/latex] is the observation in the data set, [latex]\left(\mu\right)[/latex] is the mean, and [latex]\left({n}\right)[/latex] is the number of observations.
Standard deviation of a sample
[latex]s=\sqrt{\dfrac{\sum \left(x-\bar{x}\right)^2}{n-1}}[/latex]
where [latex]\sum[/latex] is the summation of [latex]{\left(x-\bar{x}\right)^2}[/latex] for each observation, [latex]\left(x\right)[/latex] is the observation in the data set, [latex]\left(\bar{x}\right)[/latex] is the mean, and [latex]\left({n}\right)[/latex] is the number of observations.
Upper outlier
[latex]Q3+1.5(IQR)[/latex], remember to multiply [latex]1.5[/latex] by [latex]IQR[/latex] first, then add to [latex]Q3[/latex]
Variance of a population
[latex]\sigma^{2}=\dfrac{\sum\left(x-\mu\right)^{2}}{n}[/latex]
where [latex]\sum[/latex] is the summation of [latex]{\left(x-\mu\right)^2}[/latex] for each observation, [latex]\left(x\right)[/latex] is the observation in the data set, [latex]\left(\mu\right)[/latex] is the mean, and [latex]\left({n}\right)[/latex] is the number of observations.
Variance of a sample
[latex]s^{2}=\dfrac{\sum\left(x-\bar{x}\right)^{2}}{n-1}[/latex]
where [latex]\sum[/latex] is the summation of [latex]{\left(x-\bar{x}\right)^2}[/latex] for each observation, [latex]\left(x\right)[/latex] is the observation in the data set, [latex]\left(\bar{x}\right)[/latex] is the mean, and [latex]\left({n}\right)[/latex] is the number of observations.
Glossary
- [latex]s[/latex]
- the standard deviation of a sample of observations.
- [latex]\sigma[/latex]
- the standard deviation of a population of observations.
- [latex]s^{2}[/latex]
- the variation of a sample of observations.
- [latex]\sigma^{2}[/latex]
- the variance of a population of observations.
- deviation from the mean
- the distance between an observation ([latex]{x}[/latex]) in a data set and the mean [latex]\left(\bar{x}\right)[/latex] of the data set.
- Empirical Rule
- a guideline that predicts the percentage of observations within a certain number of standard deviations. Also known as the [latex]\textbf{68-95-99.7}[/latex] Rule which states that in a bell-shaped, unimodal distribution, almost all of the observed data values, [latex]x[/latex], lie within three standard deviations, [latex]\sigma[/latex], to either side of the mean, [latex]\mu[/latex]. More specifically, about [latex]68\%[/latex] of observations in a data set will be within one standard deviation of the mean [latex]\left(\mu\pm\sigma\right)[/latex], about [latex]95\%[/latex] of the observations in a data set will be within two standard deviations of the mean [latex]\left(\mu\pm2\sigma\right)[/latex], and about [latex]99.7\%[/latex] of the observations in a data set will be within three standard deviations of the mean [latex]\left(\mu\pm3\sigma\right)[/latex].
- first quartile
- the value below which one quarter of the data lies, also equal to the [latex]25[/latex]th percentile. Sometimes denoted [latex]Q1[/latex].
- five-number summary
- the collection of the minimum, first quartile, median, third quartile, and maximum of the variable.
- interquartile range
- the quantity [latex]Q3-Q1[/latex]. Sometimes denoted [latex]IQR[/latex].
- left-skewed (negative skew)
- most of the data is bunched up to the right of the graph with a “tail” of infrequent values on the left (lower) end of the distribution.
- lower outlier
- an observation that is less than [latex]Q1-1.5(IQR)[/latex].
- mean
- an average of a set of values calculated by adding the values and then dividing the total by the number of values in the data set.
- median
- the “middlemost” value of a set of values listed in numerical order.
- outlier
- an unusual or extreme value, given the other values in the data set.
- range
- the maximum (or largest) value – the minimum (or smallest) value.
- resistant
- not affected by the skewness of a graph.
- right-skewed (positive skew)
- most of the data is bunched up to the left of the graph with a “tail” of infrequent values on the right (upper) end of the distribution.
- standard deviation
- a measure of how spread out observations are from the mean.
- standardized value
- the number of standard deviations an observation is away from the mean. Also referred to as a z-score.
- symmetric
- the left and right sides of the distribution (closely) mirror each other. If you drew a vertical line down the center of the distribution and folded the distribution in half, the left and right sides would closely match one another.
- third quartile
- the value below which three quarters of the data lay, also equal to the 75th percentile. Sometimes denoted as [latex]Q3[/latex].
- upper outlier
- an observation that is greater than [latex]Q3+1.5(IQR)[/latex].
- variability
- a measure of how dispersed (spread out) the data are. It is often referred to as the spread, or dispersion, of a data set.
- variance
- the standard deviation squared.