O.02: Section 1 Part 1

Section 1: Compound models—adding two model formulas to model combined effects

Compound-model application 1: Modeling mixed populations

In an earlier topic we discussed normal-distribution models, whose graph is a “bell-shaped” curve with a single peak. This model is appropriate in many situations where some characteristic of a group of generally-similar objects is distributed around an average value. But when two groups with different averages are mixed, the resulting distribution may have two partially-overlapping peaks, as illustrated below by a dataset on the height of a combined population of male and female students. A single normal distribution will not do a good job of modeling this data.

Example 1:

Height data from a co-educational population

            The data that produced the above graph is shown in a table to the right.   A plausible model for data of this kind is the sum of two normal distributions. This will entail six parameters: total, average, and width for each normal. The formula to be placed into C3 will thus be “=$G$3*NORMDIST(A3,$G$4,$G$5,FALSE)+
$G$6*NORMDIST(A3,$G$7,$G$8,FALSE)”;   as always, it should be spread down column C beside all the data rows.As usual, we will set the initial values for the parameters to reasonable approximations before using Solver. In this case, we can do that by setting each “total” parameter to 5,000 (half the overall total), setting the “average” parameters to the approximately the x positions of the two peaks (64 and 70 are close enough), and setting both “width” parameters to a value, such as 2 or 3, that gives a model that is reasonably similar to the data. Then use Solver to minimize the sum of squared deviations, answering “G3:G8” in the “By changing cells” entry field so that all six parameters will be used. As shown below, a good fit is found.

Parameters (two-normal sum)
5529.128 Total 1
63.72487 Average 1
2.685645 Width 1
4569.312 Total 2
70.38119 Average 2
2.728443 Width 2

 

Goodness of fit of this model
1710.994 Sum of squared deviations
7.429222 Standard deviation

This best-fit model does more than give a formula for describing the height distribution of the combined male-female population. Because it fits the data so well, we can confidently deduce the characteristics of the male and female distributions from the parameters for the two normal distributions, even though we do not have any data that identifies gender, and even though both males and females contribute to each height total. Clearly the first normal describes the height distribution of females and the second normal describes the height distribution of males. Thus we can use these results to see that 45.7% is the answer to the question “How much of this data came from males?”, and that 63.7 inches is the answer to the question “What was the average height of the female part of this sample?”.

Height Students
48 0
49 0
50 0
51 0
52 0
53 0
54 0
55 2
56 12
57 29
58 79
59 180
60 326
61 481
62 673
63 813
64 864
65 825
66 751
67 717
68 677
69 700
70 735
71 658
72 568
73 424
74 280
75 167
76 75
77 25
78 13
79 7
80 0
81 1
82 1
83 0
84 0

If we individually graph the two normal distributions found here along with their sum we can see the components that combined to form the data distribution:

This same sum-of-two-normal-distributions approach can be used in a variety of situations. Sometimes the component distributions have about the same average but greatly different widths, so that the resultant distribution has “fat tails”. Other times the distributions have substantially different totals but nearby averages, so that the result looks like the larger distribution with a bump added on one side. It is often the case that an investigator is interested in only one of the component distributions, with the other values being ignored after the fitting process. An example of each of these kinds of two-normal situations are given below, with the components shown (thin lines) as well as the sum (dots) that reflects the data that would be observed.

 
Examples of the sum of two differing-parameter normal-distribution functions