2. Test Administration

Definition: Process by which a test-taker completes a test.

Description. In most testing situations, the administrator’s primary job is to insure standardization (i.e., the establishment of similar test procedures; see below) of the testing environment. Worthen et al. (1993) suggested several guidelines for administered tests, including (a) checking the physical setting for appropriateness (e.g., adequate lighting, temperature); (b) insuring that participants know what they are supposed to do; (c) monitoring the test administration; and (d) following any standardized instructions carefully (e.g., as provided with a published test). Test-takers, on the other hand, bring their unique individual differences with them to the testing situation–some of which complicate the standardization effort. Kahn (2001) found that how individuals defined the construct they were asked to report (i.e., power in a family) influenced the scores they actually reported.

Recall that I employ test generically, that is, to mean any type of measurement and assessment device. How a test is administered at least partially distinguishes among measurement and assessment types. In self-reports the participants themselves read and respond to items. In interviews the assessor reads items/questions to participants. Fewer resources is the advantage for self-reports (i.e., you do not need an interviewer), while greater depth of understanding (i.e., you can ask respondents to elaborate and they can ask you to clarify) is an advantage of interviewing.

Blais, Norman, Quintar, and Herzog (1995) compared two methods of administering the Rorschach projective test (i.e., the Rapaport and Exner systems). The Rorschach consists of administration of 10 inkblots designed to provide ambiguous stimuli. Rapaport or Exner administration, that differ mainly in the examiner-examinee seating arrangements and questioning instructions, were randomly assigned first to 20 women with bulimia. Significant differences were found between the two administration systems with Exner producing more color and shading responses. Interestingly, system differences were most prominent on the first presentation of the two administrations. Other research has also shown that Rorschach scores can be changed because of administrators’ differing instructions (Exner, 1986).

2.1 Standardization

Definition: Establishment of identical or similar test procedures for each respondent.

Description. Standardization is designed to reduce error by making the test conditions and environment as similar as possible for everyone who takes the test. Conditions could include such procedures as the time to complete the test, the legibility of the test, and the order of administration of various subscales or tests.

When students take the GRE or LSAT, for example, no differences should exist in the testing environment. Lighting should be adequate, the temperature should be comfortable, the room should be quiet, and so forth. The use of computers with such tests, discussed further in a subsequent section, raises an interesting issue about standardization. While most test-takers are likely to be familiar with paper and pencil media, the introduction of computers into such testing may represent a significant change in testing conditions for a subgroup of students unfamiliar with computers.

2.2 Response strategies

Definition: Processes individuals use to complete test items, problems, and tasks.

Description. Test-takers use two major strategies to respond to test items: Retrieval and generative. Retrieval strategies involve the recall and reconstruction of information. For example, when you go to your physician for a physical examination, you may be asked whether or not you take any medications, have had previous surgeries, and so forth; to answer these questions, you must recall your past experiences.

When individuals cannot or will not employ retrieval strategies, they use generative strategies that involve the creation of information. Examples of generative strategies include random responding, dissimulation, malingering, and social desirability. When individuals randomly respond to a measurement device they enter answers by chance. With malingering, respondents simulate or exaggerate negative psychological conditions (e.g., anxiety, psychopathology). Respondents who dissimulate attempt to fake good or bad on tests. Socially desirable responses are those that are socially acceptable or present the respondent in a favorable light.

Response sets and response styles represent similar concepts that focus more on motivational than cognitive factors (Lanyon & Goodstein, 1982). With response sets the test-taker distorts answers in an attempt to generate a specific impression (e.g., “I have good work habits for this job”).

Response Set: Distort toward —-> SPECIFIC IMPRESSION

Social desirability is an example of a response set since the test respondent is attempting to answer items in such a way that leaves a positive impression. In contrast, with response styles there is a distortion in a particular direction regardless of item content.

Response Style: Distort toward —–> PARTICULAR RESPONSE DIRECTION

Examples of response styles are acquiescence errors (i.e., tendency to agree regardless of content) and criticalness errors (i.e., tendency to disagree regardless of content). This can be an issue with a test such as the Career Maturity Inventory (CMI) where false responses to 43 of the 50 attitude items are scored as indicating career maturity.

As shown in Figure 9, it is possible for multiple sources of error such as acquiescence and social desirability to be influencing a single measurement or assessment method. If the method is a test, for example, summing items that contain systematic error scores produces a total score reflecting the construct and error sources (i.e., invalidities):

Figure 9

Differential Effects of Error Sources by Item

Response sets partially result from the clarity of item content: the more transparent the item, the more likely that a response set such as social desirability will occur (Martin, 1988). Murphy and Davidshofer (1994) suggest that a question like “I hate my mother” is very clear and invites a response based on its content. If the item is ambiguous, however, then the probability of a response style such as acquiescence increases. Martin (1988) noted that projective tests were partially constructed on the assumption that more ambiguous stimuli would lead to less faking and socially desirable responding. This assumption, however, has not received much empirical support. Similarly, test experts have debated the usefulness of more subtle but ambiguous items, whose intent may be less transparent to test-takers, but that may also invite acquiescence or criticalness because individuals have little basis on which to respond.

A question like “I think Lincoln was a greater President than Washington” is less transparent, but a respondent who must generate a response may simply agree because of the positive item wording. Such a respondent might also agree with the statement that “I think Washington was a greater President than Lincoln.” Research tends to support the validity of obvious items over subtle ones. Consequently, the use of subtle items to diminish response sets may increase the likelihood of a response style and thereby diminish test validity.

Generative responses would seem more likely with reactive or transparent tests. Reactivity refers to the possible distortion that may arise from individuals’ awareness that they are being observed or are self-disclosing.

Wetter and Deitsch (1996) investigated the consistency of response to the MMPI-2 by persons faking post-traumatic stress disorder (PTSD), faking closed-head injury (CHI), or controls. The researchers asked 118 undergraduate students to imagine they were part of a lawsuit in which their faking of psychological symptoms would increase the chances of a large financial award. After reading descriptions of the disorder they were told to fake, participants completed the MMPI-2 twice (at a 2-week interval). Significantly lower reliability coefficients were found for scales completed by individuals faking CHI obtained than for controls or persons faking PTSD.

2.3 Rater errors

Definition: Judgments produced by raters that are irrelevant to the purpose of the assessment.

Description. Given the prevalence of ratings in counseling, occupational, and educational settings, it is no surprise that investigators have studied a number of different types of rater errors. I summarize the most important types below.

Murphy and Davidshofer (1994) described (a) halo errors, when a rater’s overall impressions about the ratee influences ratings about specific aspects of the person, (b) leniency errors, overestimates of ratee performance, and (c) criticalness errors, underestimates of performance of ratees. To illustrate the latter two errors, suppose you are an employee who has two supervisors. Figure 3-3 below displays a frequency count of your actual performance, that is, it summarizes the quality of a large number of your performances. You can see that you have relatively few low or high quality performances, but that most of your work would be rated as of moderate quality. In contrast, Supervisor A’s ratings are below your actual performances, while all of Supervisor B’s ratings are above your actual work quality. Your supervisors are displaying criticalness and leniency errors.

Figure 10

Leniency and Criticalness Errors

Hypothesis confirmation bias is a special type of error committed by test users, researchers, counselors, educators, and laypersons–in other words, everyone. It refers to the tendency to crystallize on early impressions and ignore later information that contradicts the initial hypothesis (Klayman, 1995). Overshadowing occurs when a rater focuses on a particularly salient aspect of a person or situation (e.g., mental retardation) while ignoring other aspects that may also be important (e.g., mental illness).

Among the solutions to rater errors are to provide thorough training, calculate interrater reliability and redo ratings if reliability is low, and recheck raters’ reliability randomly.

Haverkamp’s (1993) research provides an example of the hypothesis confirmation bias in a counseling context. She asked 65 counseling students to view a videotape of a counselor and client interaction. Students were provided with problem descriptions generated by the client and also asked to generate hypotheses themselves about the client’s problem. After viewing the videotape students were presented with a series of tasks (e.g., what further questions would you ask?) designed to determine the frequency of the type of information they were seeking (e.g., confirmatory, disconfirmatory, neutral, other). Haverkamp found that student counselors did not seek to confirm the hypotheses provided by the clients, but did attempt to confirm their own hypotheses about the client. Such an approach, Haverkamp maintained, means that counselor may ignore information that could support an equally plausible explanation and intervention for the client’s problem.

2.4 Administrator-respondent relationship

Definition: The degree of rapport and trust established between the test administrator/interviewer and the person taking the test.

Description. Traditionally, the relationship between the administrator and test-taker has been placed in the background. On the other hand, test developers and publishers urge administrators to establish rapport with test-takers, but seldom is the presence of this rapport assessed or monitored (cf. Worthen et al., 1993). While research has been conducted to examine the effects of administrator characteristics on respondents, relatively little attention has been paid to the administrator-respondent relationship because test theorists and developers usually do not consider the relationship an important factor.

In qualitative assessment, the relationship is assumed to influence the honesty and accuracy of information shared by the test-taker (Strauss & Corbin, 1990). That is, to the extent that the test-taker trusts the administrator, the test-taker is more likely to make effort to produce reliable and valid information.

One way of approaching the issue of administrator and interviewer effects is to compare traditional testing administration to situations where little or no administrator-test-taker interaction occurs. For example, are tests administered or introduced by humans equivalent to computer-administered tests and interviews? In other words, does the automation of test procedures affect the method’s reliability and validity? Some researchers have found no differences between traditional and computer-administered versions of tests. However, some who take computer-administered tests show more anxiety (cf. George, Lankford, & Wilson, 1990), alter their rate of omitting items (Mazzeo & Harvey, 1988), and increase their faking good responses (Davis & Cowles, 1989). Students who have recently taken the computer-administered version of the GRE or similar tests should compare their experiences to other testing situations. Given the equivocal research findings, the equivalence issue currently must be considered on a test-by-test, sample-by-sample basis.

Chapter 3: Test Components