3. Test Scoring

       Definition: Method by which test data are assigned to produce a score or category.

       Description. Aggregating or summing individual test responses or items is the predominant method of scoring tests. Luzzo (1995) summed college students’ answers to the 50-item attitude scale of the Career Maturity Inventory (CMI) and found an average score of 36.84 for the 401 persons who completed it. This means that on average this group answered about 37 of the 50 items in a manner indicating a mature career attitude.

Items, tasks, and ratings can also be weighted (e.g, counted more or less in relation to other items) prior to aggregation. If you were creating a measure of aggression in children, for example, you migt possess a theoretical reason for assigning more weight to physical acts of violence (e.g., hitting, kicking) than verbal acts (e.g., insults, threats).

Some test items are not scored per se but employed as decision trees where answers direct the tester toward some final decision, typically about diagnosis. Versions of the Diagnostic and Statistical Manual have presented decision trees where diagnosticians can follow a set of branching questions that leads to a specific diagnosis. For example, the tree of differential diagnosis of Organic Brain Syndromes begins with the question “Disturbance of attention, memory and orientation developing over a short period of time and fluctuating over time?” A Yes answer leads to a possible diagnosis of Delirium, while No branches to the next question, and so forth through the set of possible related diagnoses.

Computer scoring of tests generally eliminates errors. However, some procedures require the participant or experimenter to score a test, and here research suggests that a surprisingly high percentage of mistakes can be made. Ryan, Prifitera, and Powers (1983) asked 19 psychologists and 20 graduate students to score WAIS-R (Wechsler Adult Intelligence Scale-Revised) information that had been administered to two vocational counseling clients. They found that regardless of professional experience, participants’ scoring of the identical materials produced scores that varied by as much as 4 to 18 IQ points. Other examples of scoring errors with seemingly straightforward procedures abound (Worthen et al., 1993). Scoring becomes even more problematic when human judgment is introduced into the scoring procedures, as with many projective tests and diagnostic tasks.

3.1 Factor analysis

       Definition: A statistical method for understanding the number and type of constructs influencing a test’s score, frequently used for item selection.

       Description. Factor analysis is a method for analysis of test data, typically for the purpose of item selection. Factor analysis has been such an important technique in the development of scoring procedures for tests that I discuss it here. Test developers assume that any large number of items or tests reflect a smaller number of more basic factors or traits. These factors consist of a group of highly intercorrelated variables (Vogt, 1993). Factor analysis refers to a set of statistical procedures used to examine the relations among items or tests and produce an estimate of the smaller number of factors that accounts for those relations.

Two basic types of factor analysis are commonly employed: exploratory and confirmatory. In exploratory factor analysis little or no knowledge is available about the number and type of factors underlying a set of data. Test developers employ exploratory factor analysis when evaluating a new set of items. With confirmatory factor analysis knowledge of expected factors is available (e.g., from theory or a previous exploratory factor analysis) and used to compare factors found in a new dataset. A good way to begin learning about factor analytic techniques and their output is through statistical user’s manuals as provided by companies like SPSSx and SAS.

Golden, Sawicki, and Franzen (1984) maintained that test developers must understand the theory employed to select items in a factor analysis “since the resulting factors can only be interpreted accurately within the context of a theoretical base” (p. 27). Nevertheless, many, if not most test developers base their item selection only loosely on theory. Gould (1981) similarly criticized the use of factor analysis in the creation of intelligence tests. Gould believes many social scientists have reified intelligence, treating it as a physical entity instead of as a construct. Gould maintained that “such a claim can never arise from the mathematics alone” (p. 250) and that no such evidence exists in the case of intelligence.

One decision that test developers must take during the course of a factor analysis is whether to rotate the factor loadings. If test developers desire their factors to be independent of one another (i.e., orthogonal), the analysis includes a rotation (but see Pedhazur & Schmelkin, 1991, for a different perspective). Another issue is deciding how many factors should be extracted during an analysis. One approach is to examine the eigenvalues of the found factors; eigenvalues roughly correspond to the proportion of variance explained by summing the squared loadings on a factor. A general rule of thumb is that factors with eigenvalues of 1 or more be considered useful.


Blaha and Wallbrown (1996) conducted factor analyses on the Wechsler Intelligence Scale for Children (WISC-III) subtest intercorrelations. Subtests include arithmetic, vocabulary, picture completion, and mazes. Blaha and Wallbrown obtained two and four-factor solutions for four age levels (6-7, 8-10, 11-13, and 14-16 years old). The two-factor results supported a general g factor (defined as an overlap among different assessments of intelligence) as well as two major group factors of verbal-numerical-educational ability and spatial-mechanical-practical ability. The four-factor solution suggested factors of perceptual organization, verbal comprehension, freedom from distractibility, and perceptual speed. Blaha and Wallbrown concluded that these results support the construct validity of the Full Scale IQ of the WISC-III as a measure of general intelligence.


3.2 Scale types

         Definition. The type and amount of information contained in test scores.

         Description. As noted in Chapter 1, four types of measurement scales are commonly described: (a) nominal scales that possess qualitative categories; ordinal scales with rank information; interval scales containing rank information with equal intervals; and ratio scales that contain equal intervals with a meaningful zero point (e.g., number of cigarettes smoked in a day). This picture of a child’s toy has objects that exemplify nominal, ordinal, and interval scales.  Can you identify each?

Figure 11

Scale Types

Each successive type contains more information than the previous. Ratio scales, for example, provide more information about a construct than interval, ordinal, or nominal scales. Ratio scales should be the most precise if they reflect the actual values present in a phenomenon. Regarding the photo example, for each row of logs, you can observe colors (nominal), rank (height of each log increases), and interval (height increases by the same amount).  No zero point exists on these scales of height, so no ratio scale is present.


Diagnostic categories typically contain nominal information, that is, they distinguish between different types of phenomena but provide no information about differences within a particular phenomenon. Dihoff et al. (1993) developed ordinal diagnostic criteria with 20 autistic children aged 2-3 years. Subgroupings of the children were identified and found to differ on behavioral measures, standardized tests, and school achievement. Dihoff et al. (1993) reported that use of the ordinal criteria promoted diagnostic agreement among therapists.