Glossary

[latex]s[/latex]
the standard deviation of a sample of observations.
[latex]\sigma[/latex]
the standard deviation of a population of observations.
[latex]s^{2}[/latex]
the variation of a sample of observations.
[latex]\sigma^{2}[/latex]
the variance of a population of observations.
absolute difference
indicates the change occurred in a number of percentage points.

acceptance sampling
sampling where a random sample is drawn from each lot of a product, the items in the sample are tested, and if the number of nonconforming items is above a pre-determined threshold, then the whole lot of the product is rejected.

alternative hypothesis
what we consider to be plausible if the null hypothesis is false.
bar graph
a graph in which the categories are represented by bars that are separated from each other.

base (1)
the horizontal side of a right triangle.

base (2)
the number that is multiplied in an exponent.
Bernoulli trial
a chance experiment with exactly two possible outcomes, the same probability of success for every trial, and trials that are independent from one another.
biased
tending to produce samples that are not representative of the population.
bimodal
two prominent peaks in the distribution.
bin
a range of values that the quantitative variable can take.
binomial experiment
an experiment consisting of a fixed number, š‘›, of independent Bernoulli trials that counts the number of successes out of š‘› trials.
bivariate data
two quantitative variables.
blinding
nondisclosure of the treatment an experimental unit is receiving.
block
a group of subjects that are similar, but blocks differ in ways that might affect the outcome of the experiment.
blocking
grouping together of homogeneous (similar) experimental units followed by the random assignment of the experimental units within each group to a treatment.
bootstrap sample
a sample that is selected from the values in the original sample.
š‘Ŗ% prediction interval for an individual response
a range of plausible values of the response when an individual observation has a value of the explanatory variable equal to š‘„0.

categorical variable
a variable that places an individual into one of several groups.

center
a measure that describes where the middle of the distribution is. The center is a number that describes a typical value. For example, one way to think about center is that it could be the point in the distribution where about half of the observations are below it and half are above it.
Central Limit Theorem
as the sample size gets larger, the distribution of the sample proportion will become closer to a normal distribution.
chance experiment
an experiment involving making an observation in a situation where there is uncertainty about which of two or more possible outcomes will result.
coefficient of determination
the proportion of the variation in the response variable that can be explained by its linear relationship with the explanatory variable;Ā denoted š‘…2 and pronounced ā€œR squared.ā€
complements
events whose sum of their probabilities is equal to 1.
completely randomized block design
a design where the experimental units are divided into homogeneous groups called blocks and within each block, the experimental units are randomly assigned to treatments.
complex fraction
a fraction in which the numerator and/or the denominator include fractions.
conditional
assuming a certain condition has to be true.
conditional distribution
the counts or the relative frequencies of one variable restricted to only that value of a second variable.
conditional probability
probability that is conditional on something else.
confidence interval for a population proportion
a reasonable range of values where we expect the population proportion to fall within, with a chosen degree of confidence.
confidence level, š‘Ŗ
how much confidence we have in the method used to construct the interval.
confidence interval for the mean response
a range of plausible values the mean value of the response variable takes when š‘„ = š‘„0.
confounding variable
a variable that was not accounted for in a study and may actually influence other variables in a study.
contingency table (two-way table)
a table that displays the results of two categorical variables simultaneously.
continuity correction
an adjustment that is made when a discrete distribution is approximated by a continuous distribution.
continuous
including an infinite number of possible values.
continuous variable
a variable that includes an infinite number of possible values.
convenience sampling
sampling where the individuals are those who are most accessible to the researcher. A convenience sample is usually not random or representative of the population.
counterexample
an example that contradicts or disproves a general statement.
cubing
multiplying a number by itself twice.
data dictionary
the format for displaying and describing the variables in a data set.
data snooping
showing only the comparisons you want to show based on the boxplot, also calledĀ data fishing.
data set
a collection of data.
dependent sample
a sample where the same variable is recorded for each sample, and there is a logical way to pair the observations from one sample with the observations in the other sample.
design
use of attributes such as color, symbols, or lines/curves to encourage comparisons that create a clear purpose for the graphical display.
deviation from the mean
the distance between an observation ([latex]{x}[/latex]) in a data set and the meanĀ [latex]\left(\bar{x}\right)[/latex] of the data set.
discrete
taking a fixed set of possible numerical values where it is not possible to get any value in between.
discrete random variable
a variable that takes a fixed set of possible numerical values and it is not possible to get any value in between.
discrete variable
a variable that takes a fixed set of possible values, and it is not possible to get any value in between.
dotplot
a graphical display of the distribution of a quantitative variable showing the variable’s possible values and the frequency of each value.
double-blind
when neither the subject nor those having contact with the subject know the treatment assignment.
efficiency
requiring minimal effort for the reader to understand the purpose of the graphical display
empirical probability
the probability estimated from a chance experiment.
Empirical Rule
a guideline that predicts the percentage of observations within a certain number of standard deviations. Also known as theĀ 68-95-99.7 Rule which states thatĀ in a bell-shaped, unimodal distribution, almost all of the observed data values, [latex]x[/latex], lie within three standard deviations, [latex]\sigma[/latex], to either side of the mean, [latex]\mu[/latex]. More specifically, about 68% of observations in a data set will be within one standard deviation of the mean [latex]\left(\mu\pm\sigma\right)[/latex],Ā about 95% of the observations in a data set will be within two standard deviations of the mean [latex]\left(\mu\pm2\sigma\right)[/latex], andĀ about 99.7% of the observations in a data set will be within three standard deviations of the mean [latex]\left(\mu\pm3\sigma\right)[/latex].
endpoints
the smallest and largest values of the quantitative variable represented in the bin.
error sum of squares
the statistic measuring the variation within the groups.
error sum of squares (SSError)
the total variation within the groups of interest.
event
an outcome or collection of outcomes for a chance experiment.
expected count
the number of the variable that we expect.
experimental study
a statistical study based on data collected from designed experiments and is useful for determining cause and effect.
explanatory variable
the variable that is of interest to the researcher and is controlled by the researcher,Ā also referred to as the independent variable or factor of interest.
exponent
the number of times to multiply the base by itself.
extrapolation
using the model to predict for values of the explanatory variable far outside the range in our data.
family-wise error rate
the probability of rejecting at least one of the true null hypotheses.
first quartile
the value below which one quarter of the data lies, also equal to the 25th percentile. Sometimes denoted Q1.
Fisher’s Exact Test
a test that can be done on a 2Ɨ2 contingency table when the expected frequencies do not meet the conditions for the chi-square test, requiring a simple random sample from the population and two categorical variables, each with two possible values.
five-number summary
the collection of the minimum, first quartile, median, third quartile, and maximum of the variable.
frequency
the number of times an event or a value occurs. It is commonly referred to as theĀ count.
frequency table
a table that lists the number of observations (the frequencyĀ orĀ count) of each unique value of a categorical variable.
generalize
when the sample is representative of the population, this transfers our analysis of the sample to the population.
grand mean
the mean of all the data values.
group of interest
the group from which data is collected. This can sometimes be referred to as the sample.
group sum of squares (SSGroup)
the total variation between the groups of interest.
heat map
a representation of data in the form of a map or diagram where data values are grouped into different colors.
histogram
a graphical display that groups observations into bins rather than having a single dot for each observation.
independent
when one event has no effect on the probability of another event occurring.
independent samples
samples where one is selected from one population and another is independently selected from the second population.
indicator variable
a binary variable with only two values: 0 and 1.
influential point
an observation that does not fit the trend of the data.
interaction term
a variable that represents an interaction between two variables.
interquartile range
the quantity Q3–Q1. Sometimes denoted IQR.
Least Squares Regression (LSR) analysis
determining the equation of a line of best fit to make predictions based on an existing data set, also be described as linear modeling.
left-skewed (negative skew)
most of the data is bunched up to the right of the graph with a “tail” of infrequent values on the left (lower) end of the distribution.
linear
resembling a straight line.
lower outlier
an observation that is less than Q1 – 1.5 Ɨ (IQR).
lurking variable
a third variable not included in the study that impacts the values of both of the variables being considered.
margin of error, E
the width of the confidence interval
marginal distribution
the distribution of one of the variables with no regard to the other variable whatsoever.
maximum
the largest observation or value.
mean
an average of the values calculated by adding the values and then dividing the total by the number of values in the data set.
mean square
the sum of square values divided by the degrees of freedom associated with the respective source (i.e., Group or Error).
median
the “middlemost” number.
minimum
the smallest observation or value.
modality
the number of peaks in the description of the shape in a data set.
multimodal
three or more prominent peaks in the distribution.
multivariate
displaying more than one variable on a graphical display to encourage the reader to make comparisons.
mutually-exclusive events
events that cannot occur at the same time.
negative trend
the response variable tends to decrease as the explanatory variable increases.
non-response bias
when an individual chosen for a sample cannot be contacted or decides to not participate in the study or research.Ā This type of bias occurs after the sample has been selected and can create potential bias in the data collected.
normal distribution
a distribution where š‘„ is a continuous random variable, the distribution is symmetrical, and there is a single peak around the mean.
nuisance factors
factors that are kept the same across all levels of the factor or are explicitly controlled in the experimental design. These factors are not of interest in the study but may affect a change in the response variable.
null hypothesis
a baseline assumption about a population parameter of interest; what we assume to be true to begin with.
observational study
a study whereĀ a researcher will observe an outcome without changing who is and who is not exposed to some sort of treatment.
observational units
individuals or items whose characteristics we are interested in.
one-sample t interval
no description
one-way ANOVA
a statistical test for comparing and making inferences about means associated with two or more groups, also called theĀ one-factor ANOVA.
outlier
an unusual or extreme value, given the other values in the data set.
P-value
the probability of obtaining a test statistic at least as extreme (in the direction of the alternative hypothesis) as the one that is actually seen if the null hypothesis is true.
pair-wise comparisons
comparisons between two things.
paired samples
samples chosen in a way that results in the observations in one sample being paired with the observations in the other sample; also calledĀ dependent samples.
paired t-test
a test comparing the mean of the differences, šœ‡š‘‘,to a hypothesized value, which is often 0, also called a dependent t-test.
parameter
a numerical summary measure that summarizes that population.
partial slopes
the regression coefficients for explanatory variables in multiple linear regression.
percent
out of one hundred.
percentile
the value at which a certain percentage falls below that value.
pie chart
a chart in which categories are represented by wedges in a circle and are proportional in size to the percentage of individuals/items in each category.
placebo
a harmless version of the treatment that does not contain any active ingredients (e.g., a sugar pill).
placebo effect
a positive response that people who believe they are receiving treatment for a condition have, even if what they are actually receiving is a placebo.
point estimate
a single value based on representative sample data that is a plausible estimate of the population parameter.
population
the entire collection of individuals or objects that you want to learn about.
population distribution
the distribution showing how individuals vary in a population.
positive trend
whenĀ the response variable tends to increase as the explanatory variable increases
practical significance
having results that are meaningful
precision
use of appropriate statistical transformations for the type of visualization.
probability
a numeric measure of how likely the event is to happen.
probability distribution
a distribution that includes all possible values of a random variable and the probabilities associated with those values.
probability model
a model that includes all possible outcomes of a chance experiment and the probabilities associated with those outcomes.
quantitative variable
a variable thatĀ takes numerical values that can be used in arithmetic.
randomization test
simulating many randomizations under the null hypothesis and calculating the proportion of randomizations that produce results like the hypothesis.
range
the maximum (or largest) value – the minimum (or smallest) value.
relative difference
indicates the amount of something that has changed by some percent relative to its original amount, expressed with the % symbol.

relative frequency
the proportion of observations that are in a particular category and can be expressed as a decimal or a percentage.

representative
when the characteristics of a sample tend to match the characteristics of the population.
residual
a representation of how far off a prediction calculated from the line is compared to the actual, observed š‘¦ value, illustrated by a vertical line; also called vertical error.
residual standard error
š’”š’†, is a measure of the variability in the residuals.
resistant
not affected by the skewness of a graph.
response bias
a systemic pattern of inaccurate responses to questions. This type of bias can occur when a person does not understand a question or feels influenced to respond to a question in a certain way. Response bias can also occur as a result of the wording of questions that are of a sensitive nature.
response variable
the variable that allows the researcher to objectively compare the differences in the levels of the factor of interest,Ā also referred to as the dependent variable.
right-skewed (positive skew)
most of the data is bunched up to the left of the graph with a “tail” of infrequent values on the right (upper) end of the distribution.
right triangle
a triangle that contains one right angle (90 degrees).
sample
a part of the population that is selected for study.
sample mean
the mean of a random sample.
sample space
the list of all possible outcomes of a chance experiment.
sampling distribution
the probability distribution of a sample statistic, such as a sample mean or sample proportion, as it varies from sample to sample.
sampling with replacement
sampling where after an individual is selected for the sample and data are recorded for that individual, they are ā€œreplacedā€ (put back into the population) before the next selection is made.

sampling without replacement
sampling where once an individual from the population is selected for the sample and data are recorded for that individual, they are not considered again when making additional selections from the population for that sample.

sampling frame
a numbered list of all the items in the population.
sampling variability
the tendency of samples to have different statistics (means, proportions) than the population as a whole due to randomness.
scatterplot
a graph used to visualize the relationship between bivariate data.
sensitivity
the probability that a person with the condition is correctly identified as having it.
shape
the overall pattern (left skewed, right skewed, symmetric) and the number of peaks (unimodal, bimodal, multimodal, uniform).
side-by-side bar chart
a chart in which two different categorical variables are compared in bars that are placed beside one another.
stacked bar chart
a chart in which two different categorical variables are compared in bars that are stacked on top of one another.
sign
the indicator of whether a number is positive or negative.
significance level
the cut-off for P-values at which we have enough evidence to reject the null hypothesis.
simple random sample
a sample chosen by a random mechanism, without replacement, from the population so that every sample of the given size is equally likely to be chosen.
simple random sampling
sampling where every sample of a given size has the same chance of being selected.
simplified fraction
a fraction with all the common factors removed from the numerator and denominator, also called a reduced fraction.
skew/skewness
a visual difference from symmetry in a data set.
specificity
the probability that a person without the condition is correctly identified as not having it.
spread
a measure of how far apart the data are. In this lesson, the range is used to measure spread. The range is the difference between the maximum value and minimum value.
squaring
multiplying a number by itself once.
standard deviation
a measure of how spread out observations are from the mean.
standard error
the estimated standard deviation of sample proportions.
standard normal distribution.
a normal distribution with a mean of 0 and a standard deviation of 1.
standardized residuals
values that standardize the residuals so that if the null hypothesis is assumed to be true, they can be interpreted as normal z-scores; also sometimes calledĀ Standardized Pearson Residuals.
standardized value
the number of standard deviations an observation is away from the mean. Also referred to as a z-score.
statistic
a numerical summary measure of a sample.
statistical investigative question
a question that can be used as the starting point for an investigation that involves data collection and data analysis.
statistical significance
having enough evidence against the null hypothesis to convince us to reject the null hypothesis.
stratified sampling
sampling where a population is divided into two or more groups (called strata) according to some criterion and a sample is selected from each strata using simple random sampling or systematic sampling.
study group
a group of people joining in the study of a particular topic and usually meeting at scheduled intervals to discuss individual observations, reading, and research.
survey question
a questionĀ answered with a single numerical value.
symmetric
the left and right sides of the distribution (closely) mirror each other. If you drew a vertical line down the center of the distribution and folded the distribution in half, the left and right sides would closely match one another.
systematic sampling
sampling where every individual in the population is given a number and individuals are chosen at regular intervals, with a random starting point (usually among the first several).
t-score
no description
test statistic
a measure of the distance between the sample statistic and the null hypothesis value in terms of the standard error of the statistic.
The Law of Large Numbers
as we increase the number of times we repeat a chance experiment, the closer we can expect the empirical probability calculated from our chance experiment to be to the true probability
third quartile
the value below which three quarters of the data lay, also equal to the 75th percentile. Sometimes denoted as Q3.
treatments
the different levels of the factor of interest you are changing.
Tukey method
a method that adjusts the length of the confidence interval (to ensure an overall level of confidence) and the P-value (to ensure an overall significance level for all pair-wise comparisons).
two-sample t-test
a hypothesis test for comparing two population means.
two-sample test of proportions
a test that tests a claim about two population proportions.
two-way table
a table giving the counts for each value of the distributions of a categorical variable for multiple populations, also called aĀ contingency table.
type I error
rejection of a correct null hypothesis.
type II error
not rejecting a null hypothesis that is actually incorrect.
unbiased
resulting in a representative sample of the population.
undercoverage
when some groups of the population are left out of the sampling process and the individuals in these groups do not have an equal chance of being selected for the sample.
uniform
no prominent peaks in the distribution.
unimodal
one prominent peak in the distribution.
unit fraction
a fraction whose numerator is 1 and whose denominator is a positive integer.
upper outlier
an observation that is greater than Q3 + 1.5Ā Ć— (IQR).
variability
a measure of how dispersed (spread out) the data are. It is often referred to as the spread, or dispersion, of a data set.
variable of interest
a measurable variable that changes in an experimental observation.
variables
the characteristics we record on the observational units. These may be quantitative or categorical variables.
variance
the standard deviation squared.
vary
differ.
voluntary response bias
a form of bias because the sample is not random or representative of the population. The people who volunteer for a study or survey may be more inclined to respond to questions or report certain behaviors.
width
a numerical value that is calculated by the difference in the values of the end points.
z critical value (š’›āˆ—)
the point on the standard normal distribution such that the proportion of area under the curve between āˆ’š‘§āˆ— and +š‘§āˆ— is š¶, the confidence level.
z-score
a measure of a value’s distance from the mean in units of standard deviation, also calledĀ standardized score.