Glossary







[latex]s[/latex]: the standard deviation of a sample of observations.

[latex]\sigma[/latex]: the standard deviation of a population of observations.

[latex]s^{2}[/latex]: the variation of a sample of observations.

[latex]\sigma^{2}[/latex]: the variance of a population of observations.

absolute difference: indicates the change occurred in a number of percentage points.

acceptance sampling
sampling where a random sample is drawn from each lot of a product, the items in the sample are tested, and if the number of nonconforming items is above a pre-determined threshold, then the whole lot of the product is rejected.

alternative hypothesis: what we consider to be plausible if the null hypothesis is false.


bar graph: a graph in which the categories are represented by bars that are separated from each other.

base (1)
the horizontal side of a right triangle.

base (2): the number that is multiplied in an exponent.

Bernoulli trial: a chance experiment with exactly two possible outcomes, the same probability of success for every trial, and trials that are independent from one another.

biased: tending to produce samples that are not representative of the population.

bimodal: two prominent peaks in the distribution.

bin: a range of values that the quantitative variable can take.

binomial experiment: an experiment consisting of a fixed number, 𝑛, of independent Bernoulli trials that counts the number of successes out of 𝑛 trials.

bivariate data: two quantitative variables.

blinding: nondisclosure of the treatment an experimental unit is receiving.


block: a group of subjects that are similar, but blocks differ in ways that might affect the outcome of the experiment.

blocking: grouping together of homogeneous (similar) experimental units followed by the random assignment of the experimental units within each group to a treatment.

bootstrap sample: a sample that is selected from the values in the original sample.


𝑪% prediction interval for an individual response: a range of plausible values of the response when an individual observation has a value of the explanatory variable equal to 𝑥₀.

categorical variable
a variable that places an individual into one of several groups.

center: a measure that describes where the middle of the distribution is. The center is a number that describes a typical value. For example, one way to think about center is that it could be the point in the distribution where about half of the observations are below it and half are above it.

Central Limit Theorem: as the sample size gets larger, the distribution of the sample proportion will become closer to a normal distribution.

chance experiment: an experiment involving making an observation in a situation where there is uncertainty about which of two or more possible outcomes will result.

coefficient of determination: the proportion of the variation in the response variable that can be explained by its linear relationship with the explanatory variable; denoted 𝑅² and pronounced “R squared.”

complements: events whose sum of their probabilities is equal to 1.

completely randomized block design: a design where the experimental units are divided into homogeneous groups called blocks and within each block, the experimental units are randomly assigned to treatments.

complex fraction: a fraction in which the numerator and/or the denominator include fractions.

conditional: assuming a certain condition has to be true.

conditional distribution: the counts or the relative frequencies of one variable restricted to only that value of a second variable.

conditional probability: probability that is conditional on something else.

confidence interval for a population proportion: a reasonable range of values where we expect the population proportion to fall within, with a chosen degree of confidence.

confidence level, 𝑪: how much confidence we have in the method used to construct the interval.

confidence interval for the mean response: a range of plausible values the mean value of the response variable takes when 𝑥 = 𝑥₀.

confounding variable: a variable that was not accounted for in a study and may actually influence other variables in a study.

contingency table (two-way table): a table that displays the results of two categorical variables simultaneously.

continuity correction: an adjustment that is made when a discrete distribution is approximated by a continuous distribution.

continuous: including an infinite number of possible values.

continuous variable: a variable that includes an infinite number of possible values.

convenience sampling: sampling where the individuals are those who are most accessible to the researcher. A convenience sample is usually not random or representative of the population.

counterexample: an example that contradicts or disproves a general statement.

cubing: multiplying a number by itself twice.

data dictionary: the format for displaying and describing the variables in a data set.

data snooping: showing only the comparisons you want to show based on the boxplot, also called data fishing.

data set: a collection of data.

dependent sample: a sample where the same variable is recorded for each sample, and there is a logical way to pair the observations from one sample with the observations in the other sample.

design: use of attributes such as color, symbols, or lines/curves to encourage comparisons that create a clear purpose for the graphical display.

deviation from the mean: the distance between an observation ([latex]{x}[/latex]) in a data set and the mean [latex]\left(\bar{x}\right)[/latex] of the data set.

discrete: taking a fixed set of possible numerical values where it is not possible to get any value in between.

discrete random variable: a variable that takes a fixed set of possible numerical values and it is not possible to get any value in between.

discrete variable: a variable that takes a fixed set of possible values, and it is not possible to get any value in between.

dotplot: a graphical display of the distribution of a quantitative variable showing the variable’s possible values and the frequency of each value.

double-blind: when neither the subject nor those having contact with the subject know the treatment assignment.

efficiency: requiring minimal effort for the reader to understand the purpose of the graphical display

empirical probability: the probability estimated from a chance experiment.

Empirical Rule: a guideline that predicts the percentage of observations within a certain number of standard deviations. Also known as the 68-95-99.7 Rule which states that in a bell-shaped, unimodal distribution, almost all of the observed data values, [latex]x[/latex], lie within three standard deviations, [latex]\sigma[/latex], to either side of the mean, [latex]\mu[/latex]. More specifically, about 68% of observations in a data set will be within one standard deviation of the mean [latex]\left(\mu\pm\sigma\right)[/latex], about 95% of the observations in a data set will be within two standard deviations of the mean [latex]\left(\mu\pm2\sigma\right)[/latex], and about 99.7% of the observations in a data set will be within three standard deviations of the mean [latex]\left(\mu\pm3\sigma\right)[/latex].

endpoints: the smallest and largest values of the quantitative variable represented in the bin.

error sum of squares: the statistic measuring the variation within the groups.

error sum of squares (SSError): the total variation within the groups of interest.

event: an outcome or collection of outcomes for a chance experiment.

expected count: the number of the variable that we expect.

experimental study: a statistical study based on data collected from designed experiments and is useful for determining cause and effect.

explanatory variable: the variable that is of interest to the researcher and is controlled by the researcher, also referred to as the independent variable or factor of interest.

exponent: the number of times to multiply the base by itself.

extrapolation: using the model to predict for values of the explanatory variable far outside the range in our data.

family-wise error rate: the probability of rejecting at least one of the true null hypotheses.

first quartile: the value below which one quarter of the data lies, also equal to the 25th percentile. Sometimes denoted Q1.

Fisher’s Exact Test: a test that can be done on a 2×2 contingency table when the expected frequencies do not meet the conditions for the chi-square test, requiring a simple random sample from the population and two categorical variables, each with two possible values.

five-number summary: the collection of the minimum, first quartile, median, third quartile, and maximum of the variable.

frequency: the number of times an event or a value occurs. It is commonly referred to as the count.


frequency table: a table that lists the number of observations (the frequency or count) of each unique value of a categorical variable.

generalize: when the sample is representative of the population, this transfers our analysis of the sample to the population.

grand mean: the mean of all the data values.

group of interest: the group from which data is collected. This can sometimes be referred to as the sample.

group sum of squares (SSGroup): the total variation between the groups of interest.

heat map: a representation of data in the form of a map or diagram where data values are grouped into different colors.

histogram: a graphical display that groups observations into bins rather than having a single dot for each observation.

independent: when one event has no effect on the probability of another event occurring.

independent samples: samples where one is selected from one population and another is independently selected from the second population.

indicator variable: a binary variable with only two values: 0 and 1.

influential point: an observation that does not fit the trend of the data.

interaction term: a variable that represents an interaction between two variables.

interquartile range: the quantity Q3–Q1. Sometimes denoted IQR.

Least Squares Regression (LSR) analysis: determining the equation of a line of best fit to make predictions based on an existing data set, also be described as linear modeling.


left-skewed (negative skew): most of the data is bunched up to the right of the graph with a “tail” of infrequent values on the left (lower) end of the distribution.

linear: resembling a straight line.

lower outlier: an observation that is less than Q1 – 1.5 × (IQR).

lurking variable: a third variable not included in the study that impacts the values of both of the variables being considered.

margin of error, E: the width of the confidence interval

marginal distribution: the distribution of one of the variables with no regard to the other variable whatsoever.

maximum: the largest observation or value.

mean: an average of the values calculated by adding the values and then dividing the total by the number of values in the data set.

mean square: the sum of square values divided by the degrees of freedom associated with the respective source (i.e., Group or Error).

median: the “middlemost” number.

minimum: the smallest observation or value.

modality: the number of peaks in the description of the shape in a data set.

multimodal: three or more prominent peaks in the distribution.

multivariate: displaying more than one variable on a graphical display to encourage the reader to make comparisons.

mutually-exclusive events: events that cannot occur at the same time.

negative trend: the response variable tends to decrease as the explanatory variable increases.

non-response bias: when an individual chosen for a sample cannot be contacted or decides to not participate in the study or research. This type of bias occurs after the sample has been selected and can create potential bias in the data collected.

normal distribution: a distribution where 𝑥 is a continuous random variable, the distribution is symmetrical, and there is a single peak around the mean.

nuisance factors: factors that are kept the same across all levels of the factor or are explicitly controlled in the experimental design. These factors are not of interest in the study but may affect a change in the response variable.

null hypothesis: a baseline assumption about a population parameter of interest; what we assume to be true to begin with.

observational study: a study where a researcher will observe an outcome without changing who is and who is not exposed to some sort of treatment.


observational units: individuals or items whose characteristics we are interested in.

one-sample t interval: no description

one-way ANOVA: a statistical test for comparing and making inferences about means associated with two or more groups, also called the one-factor ANOVA.

outlier: an unusual or extreme value, given the other values in the data set.

P-value: the probability of obtaining a test statistic at least as extreme (in the direction of the alternative hypothesis) as the one that is actually seen if the null hypothesis is true.

pair-wise comparisons: comparisons between two things.

paired samples: samples chosen in a way that results in the observations in one sample being paired with the observations in the other sample; also called dependent samples.

paired t-test: a test comparing the mean of the differences, 𝜇𝑑,to a hypothesized value, which is often 0, also called a dependent t-test.

parameter: a numerical summary measure that summarizes that population.

partial slopes: the regression coefficients for explanatory variables in multiple linear regression.

percent: out of one hundred.

percentile: the value at which a certain percentage falls below that value.

pie chart: a chart in which categories are represented by wedges in a circle and are proportional in size to the percentage of individuals/items in each category.

placebo: a harmless version of the treatment that does not contain any active ingredients (e.g., a sugar pill).

placebo effect: a positive response that people who believe they are receiving treatment for a condition have, even if what they are actually receiving is a placebo.

point estimate: a single value based on representative sample data that is a plausible estimate of the population parameter.

population: the entire collection of individuals or objects that you want to learn about.

population distribution: the distribution showing how individuals vary in a population.

positive trend: when the response variable tends to increase as the explanatory variable increases

practical significance: having results that are meaningful

precision: use of appropriate statistical transformations for the type of visualization.

probability: a numeric measure of how likely the event is to happen.

probability distribution: a distribution that includes all possible values of a random variable and the probabilities associated with those values.

probability model: a model that includes all possible outcomes of a chance experiment and the probabilities associated with those outcomes.


quantitative variable: a variable that takes numerical values that can be used in arithmetic.

randomization test: simulating many randomizations under the null hypothesis and calculating the proportion of randomizations that produce results like the hypothesis.

range: the maximum (or largest) value – the minimum (or smallest) value.


relative difference: indicates the amount of something that has changed by some percent relative to its original amount, expressed with the % symbol.

relative frequency
the proportion of observations that are in a particular category and can be expressed as a decimal or a percentage.

representative: when the characteristics of a sample tend to match the characteristics of the population.

residual: a representation of how far off a prediction calculated from the line is compared to the actual, observed 𝑦 value, illustrated by a vertical line; also called vertical error.

residual standard error: 𝒔𝒆, is a measure of the variability in the residuals.

resistant: not affected by the skewness of a graph.

response bias: a systemic pattern of inaccurate responses to questions. This type of bias can occur when a person does not understand a question or feels influenced to respond to a question in a certain way. Response bias can also occur as a result of the wording of questions that are of a sensitive nature.

response variable: the variable that allows the researcher to objectively compare the differences in the levels of the factor of interest, also referred to as the dependent variable.


right-skewed (positive skew): most of the data is bunched up to the left of the graph with a “tail” of infrequent values on the right (upper) end of the distribution.

right triangle: a triangle that contains one right angle (90 degrees).

sample: a part of the population that is selected for study.

sample mean: the mean of a random sample.

sample space: the list of all possible outcomes of a chance experiment.

sampling distribution: the probability distribution of a sample statistic, such as a sample mean or sample proportion, as it varies from sample to sample.


sampling with replacement: sampling where after an individual is selected for the sample and data are recorded for that individual, they are “replaced” (put back into the population) before the next selection is made.

sampling without replacement
sampling where once an individual from the population is selected for the sample and data are recorded for that individual, they are not considered again when making additional selections from the population for that sample.

sampling frame: a numbered list of all the items in the population.

sampling variability: the tendency of samples to have different statistics (means, proportions) than the population as a whole due to randomness.

scatterplot: a graph used to visualize the relationship between bivariate data.

sensitivity: the probability that a person with the condition is correctly identified as having it.

shape: the overall pattern (left skewed, right skewed, symmetric) and the number of peaks (unimodal, bimodal, multimodal, uniform).

side-by-side bar chart: a chart in which two different categorical variables are compared in bars that are placed beside one another.

stacked bar chart: a chart in which two different categorical variables are compared in bars that are stacked on top of one another.

sign: the indicator of whether a number is positive or negative.

significance level: the cut-off for P-values at which we have enough evidence to reject the null hypothesis.

simple random sample: a sample chosen by a random mechanism, without replacement, from the population so that every sample of the given size is equally likely to be chosen.

simple random sampling: sampling where every sample of a given size has the same chance of being selected.

simplified fraction: a fraction with all the common factors removed from the numerator and denominator, also called a reduced fraction.

skew/skewness: a visual difference from symmetry in a data set.

specificity: the probability that a person without the condition is correctly identified as not having it.

spread: a measure of how far apart the data are. In this lesson, the range is used to measure spread. The range is the difference between the maximum value and minimum value.

squaring: multiplying a number by itself once.

standard deviation: a measure of how spread out observations are from the mean.

standard error: the estimated standard deviation of sample proportions.

standard normal distribution.: a normal distribution with a mean of 0 and a standard deviation of 1.

standardized residuals: values that standardize the residuals so that if the null hypothesis is assumed to be true, they can be interpreted as normal z-scores; also sometimes called Standardized Pearson Residuals.

standardized value: the number of standard deviations an observation is away from the mean. Also referred to as a z-score.

statistic: a numerical summary measure of a sample.

statistical investigative question: a question that can be used as the starting point for an investigation that involves data collection and data analysis.

statistical significance: having enough evidence against the null hypothesis to convince us to reject the null hypothesis.

stratified sampling: sampling where a population is divided into two or more groups (called strata) according to some criterion and a sample is selected from each strata using simple random sampling or systematic sampling.

study group: a group of people joining in the study of a particular topic and usually meeting at scheduled intervals to discuss individual observations, reading, and research.

survey question: a question answered with a single numerical value.

symmetric: the left and right sides of the distribution (closely) mirror each other. If you drew a vertical line down the center of the distribution and folded the distribution in half, the left and right sides would closely match one another.

systematic sampling: sampling where every individual in the population is given a number and individuals are chosen at regular intervals, with a random starting point (usually among the first several).

t-score: no description

test statistic: a measure of the distance between the sample statistic and the null hypothesis value in terms of the standard error of the statistic.

The Law of Large Numbers: as we increase the number of times we repeat a chance experiment, the closer we can expect the empirical probability calculated from our chance experiment to be to the true probability

third quartile: the value below which three quarters of the data lay, also equal to the 75th percentile. Sometimes denoted as Q3.

treatments: the different levels of the factor of interest you are changing.

Tukey method: a method that adjusts the length of the confidence interval (to ensure an overall level of confidence) and the P-value (to ensure an overall significance level for all pair-wise comparisons).

two-sample t-test: a hypothesis test for comparing two population means.

two-sample test of proportions: a test that tests a claim about two population proportions.

two-way table: a table giving the counts for each value of the distributions of a categorical variable for multiple populations, also called a contingency table.

type I error: rejection of a correct null hypothesis.

type II error: not rejecting a null hypothesis that is actually incorrect.

unbiased: resulting in a representative sample of the population.

undercoverage: when some groups of the population are left out of the sampling process and the individuals in these groups do not have an equal chance of being selected for the sample.

uniform: no prominent peaks in the distribution.

unimodal: one prominent peak in the distribution.

unit fraction: a fraction whose numerator is 1 and whose denominator is a positive integer.

upper outlier: an observation that is greater than Q3 + 1.5 × (IQR).

variability: a measure of how dispersed (spread out) the data are. It is often referred to as the spread, or dispersion, of a data set.

variable of interest: a measurable variable that changes in an experimental observation.

variables: the characteristics we record on the observational units. These may be quantitative or categorical variables.

variance: the standard deviation squared.

vary: differ.

voluntary response bias: a form of bias because the sample is not random or representative of the population. The people who volunteer for a study or survey may be more inclined to respond to questions or report certain behaviors.

width: a numerical value that is calculated by the difference in the values of the end points.

z critical value (𝒛^∗): the point on the standard normal distribution such that the proportion of area under the curve between −𝑧^∗ and +𝑧^∗ is 𝐶, the confidence level.

z-score: a measure of a value’s distance from the mean in units of standard deviation, also called standardized score.

Student Resources