1. Test Construction

           Definition: Procedures employed to create a test.

           Description. The rules employed to create a test have serious implications for the interpretation of its scores. Given its importance, it is surprising that little consensus has developed concerning the best procedures for test construction. In this and the following sections I describe several sets of construction guidelines.

Gregory (1992) described five steps in test construction: (a) defining the test (e.g., purpose, content), (b) selecting a scaling method (i.e., rules by which numbers or categories are assigned to responses), (c) constructing the items (e.g., developing a table of specifications that describes the specific method employed to measure the test’s content areas ), (d) testing the items (i.e., administer the items and then conduct an item analysis), and (e) revising the test (e.g., cross-validate it with another sample because validity shrinkage almost always occurs). A researcher evaluating a new mathematics curriculum, for example, might (a) desire a test that could show changes over time in mathematics skills, (b) assign a score of 1 to each math item correctly scored, (c) create a table of specifications indicating what kind of skills would be expected to be acquired, (d) run a study to determine which items were sensitive to change, and (e) repeat the process with the selected items with a new group of students.

Similarly, Burisch (1984) described three approaches to personality test construction representative of many domains:

  1. External approaches that rely on criteria or empirical data to distinguish useful items. The content of the item is less important than its ability to meet pre-established criteria. The Minnesota Multiphasic Personality Inventory (MMPI) items, for example, were chosen on the basis of their ability to distinguish between normal persons and those with a diagnosed psychopathology.
  2. Inductive approaches that require the generation of a large item pool, which are then completed by a large number of subjects, with the resulting data subjected to a statistical procedure (such as factor analysis) designed to reveal an underlying structure. Many aptitude tests such as the General Aptitude Test Battery (GATB) were constructed in this fashion.
  3. Deductive approaches that rely on a theory to generate items. Items that clearly convey the meaning of the trait to be measure and that measure specific (as opposed to global) traits are more likely to be useful. Items for the Myers-Briggs Type Indicator, for example, were originally derived from Jung’s (1923) theory of types.

Burisch’s (1984) review of the literature found no superiority for any of these approaches in producing reliable and valid tests. In fact, he suggests that it is better to simply ask individuals to rate themselves on a trait that they understand and for tasks in which they possess high motivation.

Educational and psychological researchers frequently wrestle with the question of whether they need to create a new scale for a study. In the psychological arena alone, however, estimates are that 20,000 new psychological, behavioral, and cognitive measures are developed each year (American Psychological Association, 1992). It is quite likely that a self-report scale, interview, or another operation has already been developed in your practice or research area.

The question then becomes finding that operation. Most disciplines have books or databases that are good places to start. In education and psychology, for example, sources of information about published tests includes Tests in Print (Buros Institute for Mental Measurements) and Mental Measurements Yearbook (Buros Institute for Mental Measurements). Sources of unpublished tests include the Directory of Unpublished Experimental Mental Measures (Wm. C. Brown) and Measures for Psychological Assessment: A Guide to 3,000 Original Sources and Their Application (Institute for Social Research, University of Michigan). Additionally, the Health and Psychosocial Instruments (HAPI) database lists over 7,000 instruments.

1.1 Item analysis

               Definition: Methods for evaluating the usefulness of test items in relation to test purpose.

               Description. Typically test developers perform item analysis during test construction to determine which items should be retained or dropped. Although test items are usually questions or statements, here I use items to mean any distinct measurement measure, including an observation or behavioral performance.

A story about how Thomas Edison invented the light bulb is illustrative of the item analysis and test construction process. Edison reportedly evaluated thousands of types of materials in the search for a filament that could conduct electricity, emit light and minimize heat, and endure for a long period of time. Similarly, test developers sort through dozens or hundreds of items in an attempt to find those items that exhibit the characteristics desired for that particular test.

Regarding guidelines for item selection, Jackson (1970) proposed 4 general criteria that remain relevant.  He suggested that scales: (a) be grounded in theory, (b) suppress response style variance, (c) demonstrate reliability, homogeneity, and generalizability, and (d) demonstrate convergent and discriminant validity. Criterion (a) can be evaluated by noting the degree to which the initial item pool was rationally constructed. The degree of response style or response set variance (b) could be assessed by correlating items with a measure of social desirability. Criterion (c) can be assessed by examining item-total correlations and by checking for ceiling and floor effects (i.e., participants’ responses to an item cluster near the top or bottom of the possible range of scores). Correlations among scale items and related and different constructs can be computed to assess validity (d).


Musser and Malkus (1994) employed an item analysis to develop the Children’s Attitudes Toward the Environment Scale (CATES), a measure designed to assess children’s knowledge about the natural environment. First, they administered a pool of 90 items to 232 fourth and fifth grade students; next, they evaluated those items in terms of their internal consistency (seeking items with high item-total correlation), mean level (with items showing ceiling or floor effects dropped), and variability (with items showing low variability dropped). The 25 selected items were then administered to a new sample of 90 third, fourth, and fifth grade students, and found a coefficient alpha of .70. Finally, the 25 items were administered twice, from 4 to 8 weeks apart, to 171 third, fourth, and fifth grade students. Test-retest reliability was calculated at .68; coefficient alpha for the two administrations was .80 and .85. These repeated waves of item administration, analysis, and item selection typify most item analyses. Also notice that the analyses Musser and Malkus employed, although standard, are best used to select items that measure stable constructs. The resulting items are likely to be less useful for studying constructs that change.


1.2 Selection Tests

            Historically, tests have been created for the purposes of selection. The success of specific selection tests, such as IQ measures, led to the adoption of their test construction methods across much of the social sciences. Here I review some of the major ideas in this area, such as traits and latent traits, and one of the controversies they spawned, cross-situational consistency.

1.2.1 Traits

            Definition: Persistent personal characteristics often assumed to be of biological origin and resistant to environmental influences.

            Description. Individual differences refers to the idea that individuals could behave differently on the same tasks or the same situations (Dawis, 1992). Stable individual differences are traits. Theorists usually assume that traits are normally distributed in the population; that is, a frequency distribution of any trait should resemble a bell-shaped curve.

Selection testers typically treat measurement as nomothetic. That is, they are measuring traits–such as neuroticism, extraversion, openness to experience, agreeableness, and conscientiousness (McCrae & Costa, 1987)–presumed to presumed to be present in every person. In contrast, idiographic methods assume that individuals are unique and that traits may or may not be present in different individuals. In addition, many test theorists believe that traits are latent, that is, unobservable characteristics that may be indicated by clusters of behaviors. If no single behavior can define a construct (i.e., no single operational definition exists), then clusters of behaviors may be able to do so. For example, no single behavior is assumed to be indicative of intelligence.

The Big Five refers to the consensus reached by personality researchers about five traits considered the basic domains of personality. These traits are neuroticism, extraversion, openness to experience, agreeableness, and conscientiousness. Although this research remains open to alternative explanations (cf. Almagor, Tellegen, & Waller, 1995; Block, 1995), support for the Big Five interpretation (John, Angleitner & Ostendorf, 1988; McCrae and Costa, 1987) has been bolstered by factor analyses of trait descriptions produced by different methods (such as ratings by others and self-report) and different samples (i.e., cross-cultural).

1.2.2 Cross-situational consistency

         Definition: Tendency of a person to behave consistently across situations or contexts.

        Description. If traits are the dominant psychological phenomena, individuals should behave consistently across situations. In contrast, situational specificity refers to the tendency of individuals to behave according to the specific situation in which they find themselves.

Traits are assumed to be stable across situations. Thus, persons described as honest are expected to display honest behavior regardless of the situations in which they find themselves. For example, individuals who score low on a test of honesty may behave dishonestly in classrooms and stores, while more honest individuals behave honestly in those settings. In religious situations, however, both high and low honesty individuals may behave honestly. Honest behavior in this case is situation specific.

Use of the term trait implies that enough cross-situational stability occurs so that “useful statements about individual behavior can be made without having to specify the eliciting situations” (Epstein, 1979, p. 1122). Similarly, Campbell and Fiske (1959) stated that “any conceptual formulation of trait will usually include implicitly the proposition that this trait is a response tendency that can be observed under more than one experimental condition” (p. 100). Magnusson and Endler (1977) discussed coherence, a type of consistency that results from the interaction between individuals’ perception of a situation and individuals’ disposition to react consistently in such perceived situation. The factors that influence this interaction, such as intelligence, skills, learning history, interests, attitudes, needs and values, may be quite stable within individuals.


Lyytinen (1995) studied the effects of two different situations on children’s pretend play. She placed 81 children aged 2-6 years in either a play-alone condition or with a same-gender, same-age peer. Children playing with the familiar peer displayed a significantly higher proportion of pretend play acts than when playing by themselves. Children playing with another child, however, displayed fewer play acts overall because of the time they spent looking at and talking about each other’s play. Thus, situational specificity appears to be at work in the pretend play of children.


1.3 Tests to Measure Change

             While trait-based tests have been the dominant paradigm in much of social science testing, many researchers have been interested in studying the effects of development and interventions. A new set of ideas is needed to lay the foundation for measurement and assessment here. I discuss states, aptitude-by-treatment interactions, and change-based measurement.

1.3.1 States

             Definition: Transitory psychological phenomena that change because of psychological, developmental, or situational causes.

             Description. States are internal or external psychological characteristics that vary. Even theorists interested in measuring traits acknowledge the presence of state effects in psychological testing. For example, many cognitive abilities such as reading and mathematics skills may have a genetic component, but some aspects of those skills may still change as a result of development (e.g., improvement with age) and interventions (e.g., education).

Collins (1991) described a test construction method appropriate for measuring development. Collins was interested in predicting and measuring patterns of change in grade school students’ acquisition of mathematical skills. She proposed that children first learned addition, then subtraction, multiplication, and division, in that order. Such a sequence can be characterized as cumulative (i.e., abilities are retained even as new abilities are gained), unitary (i.e., all individuals learn in the same sequence), and irreversible (i.e., development is always in one direction) (Collins, 1991). This sequence can be employed to search for items and tasks that do and do not display the expected sequence of mathematics performance over time.


The State-Trait Anxiety Inventory (STAI; Spielberger, Gorsuch, & Lushene, 1970) is one of the most widely used state-trait measures. The STAI consists of two 20-item Likert scales to measure state anxiety (i.e., situation-specific, temporary feelings of worry and tension) and trait anxiety (i.e., a more permanent and generalized feeling). Both scales contain items with similar and overlapping content: state scale items include “I am tense,” “I feel upset,” and “I feel content,” while trait scale items include “I feel nervous and restless,” “I feel secure,” and “I am content.” However, the state scale asks test-takers to rate the items according to how they feel “at this moment” while the trait scale requests the ratings to reflect how the test-takers “generally” feels.

The instructions do seem to produce the desired difference: Test-retest reliability estimates for the state scale are considerably lower than for the trait (Spielberger, Gorsuch, & Lushene, 1970). The STAI typically correlates at moderate to high levels with other measures of anxiety. For example, Bond et al. (1995) asked patients with anxiety disorders and normal controls to complete the STAI and a visual analogue scale rating of anxiety. In this approach participants mark along a 100 mm line to indicate their level of anxiety; such visual measures are useful when frequent measures of mood are necessary and when reading is a problem. Bond et al. (1995) found correlations in the .50s and .60s between the two scales, suggesting a modest degree of overlap.


1.3.2 Aptitude-by-treatment interactions (ATIs)

       Definition: Interaction of individuals’ characteristics with interventions.

       Description. Treatments and interventions such as counseling and psychotherapy can be conceptualized as special types of situations or environments (Cronbach, 1975). In an study where an experimental group is contrasted with a control group, both groups are experiencing different types of situations. Persons can also be conceptualized as having aptitudes, that is, individual characteristics that affect response to treatments (Cronbach, 1975). In an ATI study researchers attempt to identify important individual characteristics or differences that would facilitate or hinder the usefulness of various treatments. A computer-based mathematics course or any type of distance learning course would probably be most beneficial, for example, to students with comfort and knowledge about technology.

From a common sense perspective, ATIs should be plentiful in the real world. It seems reasonable to assume that persons with certain characteristics should benefit more from some treatments than others. From the perspective of selection, intervention, and theoretical research, finding ATIs would seem to be of the utmost importance. ATIs offer the possibility of increased efficiency in these applied areas.


Domino (1971) investigated the interaction between learning environment and student learning style. Domino hypothesized that independent learners, students who learn best by setting their own assignments and tasks, might show the best outcomes in a class when paired with teachers who provided considerable independence. Similarly, conforming students who learn best when provided with assignments by the teacher might perform better when paired with instructors who stressed their own requirements. Domino did find empirical support for this interaction.


1.3.3 Change-Sensitive Tests

       Definition: Tests designed to detect change from interventions, development, and/or situational influences.

       Description. As described in Chapter 2, these measurements are intended not to detect stable traits, but states and other conditions such as moods or skills that change over time and in different situations. As noted previously, testing traditionally has focused on measuring traits such as intelligence that were assumed to be largely a function of heredity and immune to situational, developmental, and intervention influences. Attempts to measure traits affected how tests were constructed; reliability and validity became the central criteria for evaluating test’s quality. Efforts to develop tests whose purpose is to be sensitive to intervention and developmental effects are relatively new.

Drawing on concepts described by criterion-referenced and longitudinal testing (Gronlund, 1988; Tryon, 1991), Meier (2004) developed test construction rules designed to select test items and tasks sensitive to intervention effects. Intervention items, like traditional items, should also be theoretically based, unrelated to systematic error sources, and avoid ceiling and floor effects. Because empirically-derived items may be capitalizing on sample-specific variance, items should be cross-validated on new samples drawn from the same population. Intervention-sensitive items, however, should possess several unique properties, foremost of which is that they should change in response to an intervention and remain stable over time when no intervention is present.


Meier (1998) compared traditional and change-sensitive item selection rules with an alcohol attitudes scale completed by college students in an alcohol education group and a control group. The intervention and traditional item selection guidelines produced two different sets of items with differing psychometric properties. The intervention-sensitive items did detect pre-post change; these items also possessed lower test-retest reliability in intervention participants while demonstrating stability when completed by controls. In contrast, items evaluated with traditional criteria demonstrated greater internal consistency and variability, characteristics that enhance measurement of stable individual differences. In a study of a symptom checklist completed at intake and termination by students at a college counseling center, Weinstock and Meier (2003) found similar differences between intervention-sensitive and traditionally-selected items.


One of the basic analyses when assessing change in test scores involves the calculation of effect size (ES). ES refers to the size of an effect, such as a correlation between two variables or differences between a treatment and control group. ES is important with change-sensitive tests because items and scales should show greater than zero effects when they are administered to groups receiving interventions that produce change over time. A common way to calculate ES between groups is:

Figure 8

Calculation of Effect Size Between Two Groups

where the mean of the comparison group is subtracted from the mean of the intervention, with this difference divided by the pooled (combined) standard deviation of both groups.  With this calculation, ESs greater than .70 are generally considered large effects.