Errors in Sampling

Expected Value and Standard Error

Expected value and standard error can provide useful information about the data recorded in an experiment.

Learning Objectives

Solve for the standard error of a sum and the expected value of a random variable

Key Takeaways

Key Points

  • The expected value (or expectation, mathematical expectation, EV, mean, or first moment) of a random variable is the weighted average of all possible values that this random variable can take on.
  • The expected value may be intuitively understood by the law of large numbers: the expected value, when it exists, is almost surely the limit of the sample mean as sample size grows to infinity.
  • The standard error is the standard deviation of the sampling distribution of a statistic.
  • The standard error of the sum represents how much one can expect the actual value of a repeated experiment to vary from the expected value of that experiment.

Key Terms

  • discrete random variable: obtained by counting values for which there are no in-between values, such as the integers 0, 1, 2, ….
  • standard deviation: a measure of how spread out data values are around the mean, defined as the square root of the variance
  • continuous random variable: obtained from data that can take infinitely many values

Expected Value

In probability theory, the expected value (or expectation, mathematical expectation, EV, mean, or first moment) of a random variable is the weighted average of all possible values that this random variable can take on. The weights used in computing this average are probabilities in the case of a discrete random variable, or values of a probability density function in the case of a continuous random variable.

The expected value may be intuitively understood by the law of large numbers: the expected value, when it exists, is almost surely the limit of the sample mean as sample size grows to infinity. More informally, it can be interpreted as the long-run average of the results of many independent repetitions of an experiment (e.g. a dice roll). The value may not be expected in the ordinary sense—the “expected value” itself may be unlikely or even impossible (such as having 2.5 children), as is also the case with the sample mean.

The expected value of a random variable can be calculated by summing together all the possible values with their weights (probabilities):

[latex]\text{E}\left [ \text{X} \right ]= \text{x}_{1}\text{p}_{1}+\text{x}_{2}\text{p}_{2}+…+\text{x}_{\text{k}}\text{p}_{\text{k}}[/latex]

where [latex]\text{x}[/latex] represents a possible value and [latex]\text{p}[/latex] represents the probability of that possible value.

Standard Error

The standard error is the standard deviation of the sampling distribution of a statistic. For example, the sample mean is the usual estimator of a population mean. However, different samples drawn from that same population would in general have different values of the sample mean. The standard error of the mean (i.e., of using the sample mean as a method of estimating the population mean) is the standard deviation of those sample means over all possible samples of a given size drawn from the population.

image

Standard Deviation: This is a normal distribution curve that illustrates standard deviations. The likelihood of being further away from the mean diminishes quickly on both ends.

Expected Value and Standard Error of a Sum

Suppose there are five numbers in a box: 1, 1, 2, 3, and 4. If we were to selected one number from the box, the expected value would be:

[latex]\displaystyle \text{E}\left [ \text{X} \right ]= 1\cdot \frac{1}{5}+1\cdot \frac{1}{5}+2\cdot \frac{1}{5}+3\cdot \frac{1}{5}+4\cdot \frac{1}{5}=2.2[/latex]

Now, let’s say we draw a number from the box 25 times (with replacement). The new expected value of the sum of the numbers can be calculated by the number of draws multiplied by the expected value of the box: [latex]25\cdot 2.2 = 55[/latex]. The standard error of the sum can be calculated by the square root of number of draws multiplied by the standard deviation of the box: [latex]\sqrt{25} \cdot \text{SD of box} = 5\cdot 1.17 = 5.8[/latex]. This means that if this experiment were to be repeated many times, we could expect the sum of 25 numbers chosen to be within 5.8 of the expected value of 55, either higher or lower.

Using the Normal Curve

The normal curve is used to find the probability that a value falls within a certain standard deviation away from the mean.

Learning Objectives

Calculate the probability that a variable is within a certain range by finding its z-value and using the Normal curve

Key Takeaways

Key Points

  • In order to use the normal curve to find probabilities, the observed value must first be standardized using the following formula: [latex]\text{z}=\frac{\text{x}-\mu }{\sigma }[/latex].
  • To calculate the probability that a variable is within a range, we have to find the area under the curve. Luckily, we have tables to make this process fairly easy.
  • When reading the table, we must note that the leftmost column tells you how many sigmas above the the mean the value is to one decimal place (the tenths place), the top row gives the second decimal place (the hundredths), and the intersection of a row and column gives the probability.
  • It is important to remember that the table only gives the probabilities to the left of the [latex]\text{z}[/latex]-value and that the normal curve is symmetrical.
  • In a normal distribution, approximately 68% of values fall within one standard deviation of the mean, approximately 95% of values fall with two standard deviations of the mean, and approximately 99.7% of values fall within three standard of the mean.

Key Terms

  • z-value: the standardized value of an observation found by subtracting the mean from the observed value, and then dividing that value by the standard deviation; also called $z$-score
  • standard deviation: a measure of how spread out data values are around the mean, defined as the square root of the variance

[latex]\text{z}[/latex]-Value

The functional form for a normal distribution is a bit complicated. It can also be difficult to compare two variables if their mean and or standard deviations are different. For example, heights in centimeters and weights in kilograms, even if both variables can be described by a normal distribution. To get around both of these conflicts, we can define a new variable:

[latex]\displaystyle \text{z}=\frac{\text{x}-\mu }{\sigma }[/latex]

image

Standard Normal Table: This table can be used to find the cumulative probability up to the standardized normal value [latex]\text{z}[/latex].

This variable gives a measure of how far the variable is from the mean ([latex]\text{x}-\mu[/latex]), then “normalizes” it by dividing by the standard deviation ([latex]\sigma[/latex]). This new variable gives us a way of comparing different variables. The [latex]\text{z}[/latex]-value tells us how many standard deviations, or “how many sigmas”, the variable is from its respective mean.

Areas Under the Curve

To calculate the probability that a variable is within a range, we have to find the area under the curve. Normally, this would mean we’d need to use calculus. However, statisticians have figured out an easier method, using tables, that can typically be found in your textbook or even on your calculator.

These tables can be a bit intimidating, but you simply need to know how to read them. The leftmost column tells you how many sigmas above the the mean to one decimal place (the tenths place).The top row gives the second decimal place (the hundredths).The intersection of a row and column gives the probability.

For example, if we want to know the probability that a variable is no more than 0.51 sigmas above the mean, [latex]\text{P}(\text{z}<0.51)[/latex], we look at the 6th row down (corresponding to 0.5) and the 2nd column (corresponding to 0.01). The intersection of the 6th row and 2nd column is 0.6950, which tells us that there is a 69.50% percent chance that a variable is less than 0.51 sigmas (or standard deviations) above the mean.

A common mistake is to look up a [latex]\text{z}[/latex]-value in the table and simply report the corresponding entry, regardless of whether the problem asks for the area to the left or to the right of the [latex]\text{z}[/latex]-value. The table only gives the probabilities to the left of the [latex]\text{z}[/latex]-value. Since the total area under the curve is 1, all we need to do is subtract the value found in the table from 1. For example, if we wanted to find out the probability that a variable is more than 0.51 sigmas above the mean, [latex]\text{P}(\text{z}>0.51)[/latex], we just need to calculate [latex]1-\text{P}(\text{z}<0.51) = 1-0.6950 = 0.3050[/latex], or 30.5%.

There is another note of caution to take into consideration when using the table: The table provided only gives values for positive [latex]\text{z}[/latex]-values, which correspond to values above the mean. What if we wished instead to find out the probability that a value falls below a [latex]\text{z}[/latex]-value of [latex]-0.51[/latex], or 0.51 standard deviations below the mean? We must remember that the standard normal curve is symmetrical, meaning that [latex]\text{P}(\text{z}<-0.51) = \text{P}(\text{z}>0.51)[/latex], which we calculated above to be 30.5%.

image

Symmetrical Normal Curve: This images shows the symmetry of the normal curve. In this case, P(z2.01).

We may even wish to find the probability that a variable is between two z-values, such as between 0.50 and 1.50, or [latex]\text{P}(0.50)[/latex].

68-95-99.7 Rule

Although we can always use the [latex]\text{z}[/latex]-score table to find probabilities, the 68-95-99.7 rule helps for quick calculations. In a normal distribution, approximately 68% of values fall within one standard deviation of the mean, approximately 95% of values fall with two standard deviations of the mean, and approximately 99.7% of values fall within three standard deviations of the mean.

image

68-95-99.7 Rule: Dark blue is less than one standard deviation away from the mean. For the normal distribution, this accounts for about 68% of the set, while two standard deviations from the mean (medium and dark blue) account for about 95%, and three standard deviations (light, medium, and dark blue) account for about 99.7%.

The Correction Factor

The expected value is a weighted average of all possible values in a data set.

Learning Objectives

Recognize when the correction factor should be utilized when sampling

Key Takeaways

Key Points

  • The expected value refers, intuitively, to the value of a random variable one would “expect” to find if one could repeat the random variable process an infinite number of times and take the average of the values obtained.
  • The intuitive explanation of the expected value above is a consequence of the law of large numbers: the expected value, when it exists, is almost surely the limit of the sample mean as the sample size grows to infinity.
  • From a rigorous theoretical standpoint, the expected value of a continuous variable is the integral of the random variable with respect to its probability measure.
  • A positive value for r indicates a positive association between the variables, and a negative value indicates a negative association.
  • Correlation does not necessarily imply causation.

Key Terms

  • random variable: a quantity whose value is random and to which a probability distribution is assigned, such as the possible outcome of a roll of a die
  • weighted average: an arithmetic mean of values biased according to agreed weightings
  • integral: the limit of the sums computed in a process in which the domain of a function is divided into small subsets and a possibly nominal value of the function on each subset is multiplied by the measure of that subset, all these products then being summed

In probability theory, the expected value refers, intuitively, to the value of a random variable one would “expect” to find if one could repeat the random variable process an infinite number of times and take the average of the values obtained. More formally, the expected value is a weighted average of all possible values. In other words, each possible value the random variable can assume is multiplied by its assigned weight, and the resulting products are then added together to find the expected value.

The weights used in computing this average are the probabilities in the case of a discrete random variable (that is, a random variable that can only take on a finite number of values, such as a roll of a pair of dice), or the values of a probability density function in the case of a continuous random variable (that is, a random variable that can assume a theoretically infinite number of values, such as the height of a person).

From a rigorous theoretical standpoint, the expected value of a continuous variable is the integral of the random variable with respect to its probability measure. Since probability can never be negative (although it can be zero), one can intuitively understand this as the area under the curve of the graph of the values of a random variable multiplied by the probability of that value. Thus, for a continuous random variable the expected value is the limit of the weighted sum, i.e. the integral.

Simple Example

Suppose we have a random variable X, which represents the number of girls in a family of three children. Without too much effort, you can compute the following probabilities:

[latex]\begin{matrix} \text{P}[\text{X}=0]=0.125 \\ \text{P}[\text{X}=1]=0.375 \\ \text{P}[\text{X}=2]=0.375 \\ \text{P}[\text{X}=3]=0.125 \end{matrix}[/latex]

The expected value of X, E[X], is computed as:

[latex]\displaystyle{ \text{E}[\text{X}] =\sum_{\text{x}=0}^{3}\text{xP}[\text{X}=\text{x}]}[/latex]

[latex]\displaystyle{ =0 \cdot 0.125 + 1 \cdot 0.375 + 2 \cdot 0.375 + 3 \cdot 0.125}[/latex]

[latex]\displaystyle{= 1.5}[/latex]

This calculation can be easily generalized to more complicated situations. Suppose that a rich uncle plans to give you $2,000 for each child in your family, with a bonus of $500 for each girl. The formula for the bonus is:

[latex]\text{Y} = 1,000 + 500\text{X}[/latex]

What is your expected bonus?

[latex]\displaystyle{\text{E}[1000+500\text{X}] = \sum_{\text{x}=0}^{3}(1000+500\text{x})\text{P}[\text{X}=\text{x}]}[/latex]

[latex]\displaystyle{=1000 \cdot 0.125 + 1500 \cdot 0.375 + 2000 \cdot 0.375 + 2500 \cdot 0.125}[/latex]

[latex]\displaystyle{=1750}[/latex]

We could have calculated the same value by taking the expected number of children and plugging it into the equation:

[latex]\text{E}[1,000 + 500\text{X}] = 1,000 + 500\text{E}[\text{X}][/latex]

Expected Value and the Law of Large Numbers

The intuitive explanation of the expected value above is a consequence of the law of large numbers: the expected value, when it exists, is almost surely the limit of the sample mean as the sample size grows to infinity. More informally, it can be interpreted as the long-run average of the results of many independent repetitions of an experiment (e.g. a dice roll). The value may not be expected in the ordinary sense—the “expected value” itself may be unlikely or even impossible (such as having 2.5 children), as is also the case with the sample mean.

Uses and Applications

To empirically estimate the expected value of a random variable, one repeatedly measures observations of the variable and computes the arithmetic mean of the results. If the expected value exists, this procedure estimates the true expected value in an unbiased manner and has the property of minimizing the sum of the squares of the residuals (the sum of the squared differences between the observations and the estimate). The law of large numbers demonstrates (under fairly mild conditions) that, as the size of the sample gets larger, the variance of this estimate gets smaller.

This property is often exploited in a wide variety of applications, including general problems of statistical estimation and machine learning, to estimate (probabilistic) quantities of interest via Monte Carlo methods.

The expected value plays important roles in a variety of contexts. In regression analysis, one desires a formula in terms of observed data that will give a “good” estimate of the parameter giving the effect of some explanatory variable upon a dependent variable. The formula will give different estimates using different samples of data, so the estimate it gives is itself a random variable. A formula is typically considered good in this context if it is an unbiased estimator—that is, if the expected value of the estimate (the average value it would give over an arbitrarily large number of separate samples) can be shown to equal the true value of the desired parameter.

In decision theory, and in particular in choice under uncertainty, an agent is described as making an optimal choice in the context of incomplete information. For risk neutral agents, the choice involves using the expected values of uncertain quantities, while for risk averse agents it involves maximizing the expected value of some objective function such as a von Neumann-Morgenstern utility function.

A Closer Look at the Gallup Poll

The Gallup Poll is an opinion poll that uses probability samples to try to accurately represent the attitudes and beliefs of a population.

Learning Objectives

Examine the errors that can still arise in the probability samples chosen by Gallup

Key Takeaways

Key Points

  • The Gallup Poll has transitioned over the years from polling people in their residences to using phone calls. Today, both landlines and cell phones are called, and are selected randomly using a technique called random digit dialing.
  • Opinion polls like Gallup face problems such as nonresponse bias, response bias, undercoverage, and poor wording of questions.
  • Contrary to popular belief, sample sizes as small as 1,000 can accurately represent the views of the general population within 4 percentage points, if chosen properly.
  • To make sure that the sample is representative of the whole population, each respondent is assigned a weight so that demographic characteristics of the weighted sample match those of the entire population. Gallup weighs for gender, race, age, education, and region.

Key Terms

  • nonresponse: the absence of a response
  • undercoverage: Occurs when a survey fails to reach a certain portion of the population.
  • probability sample: a sample in which every unit in the population has a chance (greater than zero) of being selected in the sample, and this probability can be accurately determined

Overview of the Gallup Poll

The Gallup Poll is the division of Gallup, Inc. that regularly conducts public opinion polls in more than 140 countries around the world. Historically, the Gallup Poll has measured and tracked the public’s attitudes concerning virtually every political, social, and economic issue of the day, including highly sensitive or controversial subjects. It is very well known when it comes to presidential election polls and is often referenced in the mass media as a reliable and objective audience measurement of public opinion. Its results, analyses, and videos are published daily on Gallup.com in the form of data -driven news. The poll has been around since 1935.

How Does Gallup Choose its Samples?

The Gallup Poll is an opinion poll that uses probability sampling. In a probability sample, each individual has an equal opportunity of being selected. This helps generate a sample that can represent the attitudes, opinions, and behaviors of the entire population.

In the United States, from 1935 to the mid-1980s, Gallup typically selected its sample by selecting residences from all geographic locations. Interviewers would go to the selected houses and ask whatever questions were included in that poll, such as who the interviewee was planning to vote for in an upcoming election.

image

Voter Polling Questionnaire: This questionnaire asks voters about their gender, income, religion, age, and political beliefs.

There were a number of problems associated with this method. First of all, it was expensive and inefficient. Over time, Gallup realized that it needed to come up with a more effective way to collect data rapidly. In addition, there was the problem of nonresponse. Certain people did not wish to answer the door to a stranger, or simply declined to answer the questions the interviewer asked.

In 1986, Gallup shifted most of its polling to the telephone. This provided a much quicker way to poll many people. In addition, it was less expensive because interviewers no longer had to travel all over the nation to go to someone’s house. They simply had to make phone calls. To make sure that every person had an equal opportunity of being selected, Gallup used a technique called random digit dialing. A computer would randomly generate phone numbers found from telephone exchanges for the sample. This method prevented problems such as undercoverage, which could occur if Gallup had chosen to select numbers from a phone book (since not all numbers are listed). When a house was called, the person over eighteen with the most recent birthday would be the one to respond to the questions.

A major problem with this method arose in the mid-late 2000s, when the use of cell phones spiked. More and more people in the United States were switching to using only their cell phones over landline telephones. Now, Gallup polls people using a mix of landlines and cell phones. Some people claim that the ratio they use is incorrect, which could result in a higher percentage of error.

Sample Size and Error

A lot of people incorrectly assume that in order for a poll to be accurate, the sample size must be huge. In actuality, small sample sizes that are chosen well can accurately represent the entire population, with, of course, a margin of error. Gallup typically uses a sample size of 1,000 people for its polls. This results in a margin of error of about 4%. To make sure that the sample is representative of the whole population, each respondent is assigned a weight so that demographic characteristics of the weighted sample match those of the entire population (based on information from the US Census Bureau). Gallup weighs for gender, race, age, education, and region.

Potential for Inaccuracy

Despite all the work done to make sure a poll is accurate, there is room for error. Gallup still has to deal with the effects of nonresponse bias, because people may not answer their cell phones. Because of this selection bias, the characteristics of those who agree to be interviewed may be markedly different from those who decline. Response bias may also be a problem, which occurs when the answers given by respondents do not reflect their true beliefs. In addition, it is well established that the wording of the questions, the order in which they are asked, and the number and form of alternative answers offered can influence results of polls. Finally, there is still the problem of coverage bias. Although most people in the United States either own a home phone or a cell phone, some people do not (such as the homeless). These people can still vote, but their opinions would not be taken into account in the polls.