4. Test Interpretation

       Definition: Placing measurement data in a context, or making sense of test scores.

       Description. Interpretation of test scores depends upon all the steps that came before it. That is, the test construction process must have produced a valid test if the interpretation is to be valid; the test must have been administered and scored with a minimum of error during those processes. Since tests are never perfectly valid, interpretation should include statements about the limits of the test as influenced by demonstrated and likely sources of error. Without such statements of limitations, you may misinterpret the scores of the measurement methods you employ.

Test interpretation, particularly in educational settings, traditionally has focused on norms. In norm-referenced tests a test score is interpreted by comparing it to a group of scores. I can say, for example, that a 3rd grade student’s score on an achievement test places her or him at the 90th percentile of performance. Norm-referenced interpretations are typically contrasted with criterion-referenced test interpretations (i.e., comparison to a standard, instead of other persons). That same 3rd grade student may have correctly answered 35 of 40 test items that assessed previously taught material; the teacher may have set of criterion of 30 correct answers for students to pass the course.

Other types of interpretations are also useful. With formative tests, interpretation focuses on an individual’s performance on the components of a course. In a mathematics class, for  example, a formative test might provide information about the particular types of addition or subtraction problems a particular student answered correctly and incorrectly. During a course, formative tests provide feedback to the teacher and student that reveal progress and guide adjustment of the teaching. In education, Cross and Angelo (1988) described this process as a loop “from teaching technique to feedback on student learning to revision of the technique” (p. 2).

Summative tests provide an overall evaluation of an individual’s performance in a course (e.g., a course grade). Summative tests provide data convenient for administrative decision-making. Summative tests can suggest initial hypotheses relevant to teaching:  A standardized achievement test, for example, can describe a student’s strengths and weaknesss (compared to other students) across subject areas. This information might be relevant to inclusion or exclusion from an educational program (e.g., a remedial course or repeating a grade). More sensitive measures will be needed to develop and test those hypotheses, however, and it is here that formative tests can be useful (Bloom, Hastings, & Madaus, 1971; Cross & Angelo, 1988). The interpretation of summative tests focus on an aggregate score (of items and components), while administrators of formative tests tend to examine item response patterns (Bloom et al., 1971).

Much more attention has been paid in the literature to how the test administrator or researcher interprets test scores than how test-takers make sense of them. One exception to this is research on the Barnum effect. Gauging the validity of a particular test interpretation depends upon making comparisons with other types of test interpretation. The Barnum effect occurs when individuals take a test and receive test interpretations based not on their test data, but simple generic statements that might apply to anyone, such as the statements that appear in horoscopes (“Work hard today and your efforts will pay off”). Test-takers usually find such bogus feedback as accurate as actual test interpretations. Guastello and Rieke (1990) evaluated the accuracy of real computer-based test interpretations (CBTIs) based on 16PF scores (a personality inventory) with bogus reports. A sample of 54 college students rated the real reports as 76% accurate and the bogus reports as 71% accurate. Computer-based reports are likely to increase the Barnum effect because many people ascribe increased credibility to computer operations.

4.1 Norms

       Definition: Data about a distribution of scores for a particular test.

       Description. As described previously, in norm-referenced interpretations the purpose of testing is to compare scores among individuals. Thus, the test is intended to detect individual differences on a construct of interest. Gronlund (1988) indicated that developers of norm-referenced tests seek items with the greatest possible variability. With achievement tests, these items are pursued through a selection process that retains items of average difficulty; easy and difficult items, that everyone passes or fails, are likely to be discarded. Aggregation of such items increases the possibility of making valid distinctions among individuals.

Norm-referenced testing has been the predominant approach in selection testing (Murphy & Davidshofer, 1994). Besides their lower cost, norm-referenced tests also seem more applicable when the test administrator desires to select some portion of a group (e.g., the top 10% of applicants) as compared to all applicants who could successfully perform a function. Thus, norm-referenced tests are useful in selection situations where individuals are chosen partially on the basis of scarce resources. Suppose you conduct a research study and find that 95% of all graduate students who score 600 or above on the GRE Verbal scale are able to pass all required graduate school courses. From the perspective of criterion-referenced testing, everyone scoring 600 or above should be admitted. In many graduate departments, however, that would mean admitting more students than available courses, instructors, or financial supports. Such a situation certainly occurs in other educational, occupational, and clinical settings with fixed quotas. Norm-referenced testing, then, provides a solution: identify the top-scoring number who match the available resources.

If a test is intended to function as a selection device, its items should be developed on a sample representative of the population for whom the test is intended. Thus, the selection of a norm group for test development has potentially serious consequences for the interpretation of future scores compared to that group. Much controversy has occurred over the widespread use of intelligence tests or vocational interest inventories, for example, that were developed and normed on predominantly white, middle class individuals.

4.2 Measurement-related statistics

       Definition: Statistics employed to facilitate the interpretation of test scores.

       Description. Making sense of test scores often depends at least partially on understanding a number of statistical indices normally computed with tests. For example, test developers usually examine (and present information about) the frequency distribution of all test scores to determine if it is normally distributed. Similarly, developers may present information about the range and standard deviation of scores to examine whether sufficient individual differences exist. Below I describe statistics commonly used during the test interpretation process. Students who wish to review or learn about statistics may consult this website or try this video.

A mean or average is a measure of central tendency; that is, in a group of scores, where is the middle or most representative value? The mean is found by summing the scores in a group and dividing the number of scores. Other measures of central tendency are the median and the mode. These measures provide a typical score that characterizes the performance of the entire sample. A mean, along with the other measures of central tendency, is particularly useful for comparing different groups (such as children of different ages) who take the same test as well as describing individuals in relation to a group’s set of scores (where does one individual score on a course quiz in relation to the whole class).

Besides knowing the central tendency in a group of scores, it is often useful to know how dispersed the scores are. One such index of dispersion, the standard deviation, refers to the average deviation of scores from the mean. The larger the standard deviation, the more widely spread the distribution of scores.

A correlation refers to the extent to which two variables covary. A correlation coefficient expresses the degree of relationship between two sets of scores. For example, if the highest scoring individual on Test 1 has also obtained the top score on Test 2, and the second-best individual on Test 1 is also second-best on Test 2, as so on down to the lowest scoring individual on each test, a perfect positive correlation would exist (+1.00). If there is a complete reversal of scores, so that the highest scoring individual on variable 1 obtains the lowest score on variable 2 and so forth, there would be a perfect negative correlation (-1.00). A zero correlation indicates the absence of a relationship between two variables, such as might occur by chance. Thus, correlation coefficients fall between the range of -1.00 and +1.00.

The data that form the basis of a correlation coefficient can also be graphed. The graph below shows the relation between the number of quiz questions students answered incorrectly in relation to the order in which they turned in their quiz:

Figure 12

Scatterplot Between Number of Incorrect Responses and Order of Quiz Completion

As the scatterplot  shows (Figure 3-4), students who completed the quiz sooner generally had a fewer number of incorrect answers. However, the relationship is not perfect; for example, the second student to turn in a quiz had 3 incorrect answers. The correlation computed for these data is .51, with a mean of 1.66 and a standard deviation of 2.29. Although the reason to present these (actual) data is to explain the idea of correlation, do you have any substantive idea about why this relation should exist? In other words, how would you explain why students who finished a quiz faster generally had better grades?

A standard score or z score is a transformation of a raw score to show how many deviations from the mean that score lies. The formula is:

z = (Raw score – mean) / Standard deviation)

Thus z equals the person’s raw score minus the mean of the group of scores, divided by the standard deviation of the group of scores. Frequently the best information that a test score can give us is the degree to which a person scores in the high or low portion of the distribution of scores. The z score is a quick summary of the person’s standing: positive z scores indicate that the person was above the mean, while negative scores indicate the person scored below the mean.

Other types of standard scores have also been developed, including stanines, deviation IQs, sten scores, and T-scores. T-scores, for example, allow us to translate scores on a test to a distribution of scores of our choice. T-scores use arbitrarily fixed means and standard deviations and eliminate decimal points and signs. The formula is:

T = (SD * z) + M

where SD is the chosen standard deviation, M is the chosen mean, and z is the standard score for a person’s score on a test. For example, I might find it simpler to give feedback using a distribution of scores whose mean is 50 and whose standard deviation is 10. If a person had a score on a test whose z equaled -.5, the T-score would be:

(10 -.5) + 50 = 45

Tests such as the Analysis of Learning Potential use a fixed mean of 50 and a standard deviation of 20, while the Scholastic Aptitude Test (SAT) and Graduate Record Exam (GRE) historically have employed  500 as the mean and 100 as the standard deviation. Again, the T-score provides a convenient translation of scores so that they might be more understandable during test interpretation.

Acknowledging that error influences any particular testing occasion, the standard error of measurement (SEM) is the standard deviation that would be obtained for a series of measurements of the same individual if the individual did not change on the measured construct over that time period. For example, assume that I administer a test measuring a stable trait 10 times to a particular person. If that person received the same score for each test occasion, there would be no error of measurement. In reality, however, the test score would vary for each testing, and SEM is a statistic designed to summarize the amount of variation. If you have an estimate of a test’s reliability, SEM can be calculated as follows:

SEM = Standard deviation * SqRt (1 – r)

Thus, SEM equals the standard deviation of the group of scores times the square root of 1 minus the reliability estimate. SEMs help us know the extent to which an individual’s particular test score can be trusted as indicative of the person’s true score on the test.

Finally, the standard error of estimate (SEE) helps us know the trustworthiness of a test score’s ability to predict a criterion of some sort. Just as no test produces the same score when administered repeatedly to a person, no single score will be associated with the identical score on a criterion. Thus, the SEE refers to the spread of scores around a criterion, or more precisely, the standard deviation of criterion scores for individuals who all have the same score on the predictor test. The formula for SEE is:

SEE = Standard deviation * SqRt (1 – v2)

SEE equals the standard deviation for the group of criterion scores times the square root of 1 minus the squared validity coefficient (v). The validity coefficient is simply the correlation between the predictor test and the criterion that is attempted to be predicted. For example, graduate schools frequently screen candidates on the basis of their GRE scores because GRE scores (the predictor test) have been shown to have a modest correlation with first year GPA (the criterion). SEE helps us gain a sense of how large the variation is likely to be around the criterion given an individual’s particular test score.

Let’s walk through simple computations of the standard score, SEM, and SEE. Start with the z or standard score. Assume that the following represents a group of test scores. To compute a z score, I need the mean (which equals 87.95) and standard deviation (6.82) for this group of scores.

Figure 13

Sample Dataset of Test Scores

If your score on this test was 90, your z score would be:

(90 – 87.95) / 6.82 = .30

A z of .30 indicates you scored slightly above the mean in this group of scores.

On the other hand, if your score was 70, your z score would be:

(70 – 87.95) / 6.82 = -2.63

This z indicates your score was well below the mean.

SEM depends upon the standard deviation and the reliability of the particular test. If I have a test with a reliability estimate of .90 (high) and a standard deviation of 15, then SEM equals:

15 * SqRt (1-.9) = 4.7

Thus, 4.7 represents 1 standard deviation unit for the distribution of scores around the individual’s true score. However, if the test’s reliability estimate was .7, SEM increases:

15 * SqRT (1-.7) = 8.21

Thus, the lower the reliability of the test, the less confidence I have that an individual’s true score is close to the actual score obtained.

Finally, with SEE, I need the correlation between the test and criterion as well as the standard deviation for the group of criterion scores. If the correlation between test and criterion equaled .61, and the standard deviation for the criterion scores equaled 100, then SEE would be:

100 * SqRt (1-[.61*.61]) = 79

Thus, 79 represents 1 standard deviation unit around the criterion score. However, if the correlation between predictor and criterion dropped to .30, the SEE would increase:

100 * SqRT (1-[.30*.30]) = 95

Thus, the lower the correlation, the less confidence I have that the predicted criterion score is the true score the individual would actually obtain.

4.3 Criterion-referenced interpretations

       Definition: Interpreting a test score in relation to a criterion or pre-established level instead of other persons.

       Description. Suppose an individual received a score of 95% on a classroom test. What does that mean? In a norm-referenced interpretation, that would indicate that the student scored higher than 94% of the rest of the class. A criterion-referenced statement would be “correctly completed 95 of 100 questions.” Criterion-referenced interpretations simply describe performance in relation to a standard other than other persons.

With criterion-referenced tests items are retained during test development because of their relation to a criterion, regardless of the frequencies of correct or incorrect responses. However, criterion-referenced tests cost more than norm-referenced tests because they (a) require considerable effort in the analysis and definition of the performance criteria to be measured and (b) may necessitate special facilities and equipment beyond self-report materials. If one is interested in predicting performance on a criterion–the major purpose of selection testing–then criterion-referenced approaches would seem a logical choice. If one is interested in knowing whether a person can shoot a basketball, it usually makes more sense to give her or him 20 shots than a test of eye-hand coordination.

During item development of criterion-referenced tests, Swezey (1981) emphasized the importance of precisely specifying test objectives. Criteria can be described in terms of variables such as product or process, quality, quantity, time to complete, number of errors, precision, and rate (Gronlund, 1988). A criterion may be a product such as “student correctly completes 10 mathematics problems”; a process criterion would be “student completes division problems in the proper sequence.” Process measurement is useful when diagnostic information is required, when the product always follows from the process, and when product data are difficult to obtain.

Criterion-referenced tests should be reliable and valid to the extent that performances, testing conditions, and standards are precisely specified in relation to the criteria. Swezey (1981) preferred “within 5 minutes” to “under normal time conditions” as a precise testing standard. In some respects, the criterion-referenced approach represents a move away from a search for general laws and toward a specification of the meaning of test scores in terms of important measurement facets. Discussing test validity, Wiley (1991) presented a similar theme when he wrote that the labeling of a test ought to be “sufficiently precise to allow the separation of components of invalidity from valid variations in performance” (p. 86). Swezey’s and Wiley’s statements indicate the field’s increasing emphasis on construct explication.


Gentile and Murnyack (1989) described a set of criteria for grading students performing art criticism assignments. They noted that art criticism is a complex analytic skills requiring students to evaluate and interpret their and others’ art work. Gentile and Murnyack suggested a 50-point rating system for evaluating students’ assignments:

1. Applies critical thinking criteria (0-10).

2. Employs technical vocabulary (0-10).

3. Provides feedback according to criteria (0-10).

4. Presents the criticism (0-10).

Gentile and Murnyack (1989) suggested a possible passing grade of 35 points. Students who scored lower would revise and resubmit their paper based on the instructor’s feedback on these criteria.