3. What Diminishes Reliability, Validity, And Power?

           3.1 Measurement error

           Definition: Phenomena that influence test scores in unintended ways.

           Description. In classical test theory, test scores are a combination of true score and error. Traditionally this has been represented by the following formula,

                                Y = X +/- e

where Y is the test-taker’s observed score, X is the score the test-taker’s true or error-free score, and e is error(s) that increases or decreases the true score.

The term bias is sometimes used to refer to systematic errors associated with membership in a group. Socioeconomic status or race/ethnicity of test-takers, for example, may interact with test items to over- or underestimate their true performance. One of the central controversies with intelligence tests, for example, is whether intelligence tests underestimate the ability levels of persons of color. Such bias can be checked by investigating whether:

           1. The content of the test is more familiar to certain groups than others (Helms, 1992). First, select test-takers from different groups who have similar total scores on a test. Next, determine whether any individual items are passed or failed by different proportions of individuals in each group. If so, that item is biased.

            2. The test does a better or worse job of predicting a criterion for different groups. The relation between the test and the predictor can be expressed with a regression line. If the slope of the regression lines per group differ, then bias is present. In this case scores on the test do not indicate equal performance on the criterion.


Stone et al. (1990) noted that test researchers rarely study the ability of test-takers to understand test instructions, item content, or response alternatives. If test-takers cannot adequately read and understand such content, they may respond to tests in unintended ways–that is, error is introduced. Stone et al. proposed that if respondents lack the cognitive ability to read and interpret questionnaires, their motivation and ability to complete the questionnaire will be impaired and that such effects could be detected by comparing the psychometric properties of questionnaires completed by groups with different levels of cognitive ability.

Stone et al. (1990) used the Wonderlic Personnel Test to classify 347 Army Reserve members into low, medium and high cognitive ability groups. Subjects also completed an additional 203 items in a test battery of 27 measures that included the Job Diagnostic Survey, which measures such constructs as job satisfaction and organizational commitment. Stone et al. found significant differences in coefficient alpha for 14 of the 27 constructs. In 12 of those cases, alpha rankings were as predicted: Scales’ reliability estimates matched low to high cognitive ability groups. Stone et al. also found a significant negative correlation (r = -.23) between cognitive ability and the number of missing questionnaire responses; that is, persons with lower cognitive ability left more items unanswered. Thus, it appears that respondents’ cognitive ability can introduce error with some tests. Here is a video example of the Wonderlic test.


 

          3.2 Random and systematic errors

           Definition: Random errors are effects that influence measurement unpredictably, while systematic errors display pattern or order in their relationship with the measured construct.

            Description. Two specific types of measurement error exist, random and systematic. All measurements are presumed to be influenced by error sources, both random and systematic. Random errors reflect sources that are unrepeatable, haphazard, or do not exhibit a pattern; the degree of random error influences the reliability estimate of a test. In contrast, systematic errors display a pattern or order. Table 1-6 displays a partial list of systematic error sources that have been studied in the educational and psychological literature. In essence, a potentially infinite number of potential sources of error exist, with one ethical implication being that interpreters of educational and psychological should be cautious and humble about the meaning and predictive power of test scores.

Table 4

A Partial List of Measurement and Assessment Error Sources

Test-taking skills Carelessness Motivation
Ability to comprehend instructions and items Fatigue, boredom Valence
Response sets, styles Emotional strain Timing
Health Attention shifts Schedule of self-monitoring
Motivation Equipment failure Type of recording device (computer, paper)
Stress Variations in lighting and room temperature Examiner characteristics
External distractions Number of behaviors concurrently monitored Information overload

Sources: Nelson (1977); Paul, Mariotto, & Redfield (1986), and Thorndike (1949).

One way to think about systematic error in measurement is to consider instances when measurement method and participant mismatch, that is, interact in an undesired fashion. Such mismatches can be characterized in terms of cognitive, affective, and behavioral categories.

Cognitive mismatches occur when differences exist between (a) the test language and cultural assumptions and the test-taker, (b) the test-taker has no experience in the content area, and (c) the test-taker lacks sufficient cognitive skills (e.g., reading ability), memory skills, or education to be able to understand and complete the test. An interviewer may read complex questions in English, for example, to a person who is not a native speaker of English. When such mismatches occur, the test-taker may respond randomly to items. Such problems may be prevented by pilot testing methods with a small group of persons and subsequently rewriting items and tasks to enhance clarity and understanding.

Affective mismatches occur when test-takers (a) become fatigued or bored during the testing process, (b) have strong concerns about the consequences of testing, and (c) experience anxiety or other emotional problems that interfere with test-taking. Research participants who do not believe that their answers will be treated confidentially may answer in a socially desirable manner; such an instance might arise when teachers administer, collect, and score course evaluation forms from their students. It may be possible to check for and minimize such mismatches by (a) examining item response characteristics between the first and second half of the test (to detect fatigue and boredom effects), (b) developing rapport with test-takers and exploring their testing concerns, and (c) asking sensitive questions at the end of the test.

Mismatches resulting from behavioral and environmental factors may occur when (a) observers are present, (b) test-takers have insufficient time to adapt to testing conditions (particularly when special apparatus are required), and (c) test-takers use inappropriate testing apparatus (e.g., requiring extensive computer keyboard use by persons with no computer experience or physical disabilities that interfere with such activity). To minimize these factors, make observers as unobtrusive as possible and provide sufficient time to adapt and practice responding to tests, tasks, and special apparatus.


In practice, it may be difficult to separate random errors from systematic errors. For example, individuals who are uninterested in completing a test may begin to respond randomly to items or tasks. Berry et al. (1992) investigated such random responding in a series of studies with the MMPI-2. In a study of college students, they found that 60% gave one or more random responses to the 567 items. Seven percent reported random responding to many or most of the items; students who acknowledged some random responding averaged 36 such responses. In a second study, Berry et al. found that most subjects who admitted to random responding reported having done so at the end of the test, although another sizeable group scattered responses throughout. Finally, a study of 32 applicants to a police training program found that 53% indicated that they had randomly responded to some items.


 

          3.3 Mono-method and mono-operation errors

           Definition: Mono-operation error refers to the collection of data through a single operation, while mono-method error occurs when only a single method is used.

           Description. As noted previously, an operation is a specific, single activity designed for measurement. In contrast, method refers to a group of similar measurement operations. For example, you might have two operations (e.g., the State Anxiety Inventory and the Trait Anxiety Inventory) that share a single method (i.e., self-report). Resource problems frequently create mono-method and mono-operations errors: Researchers frequently find themselves in a situation where they must conduct a study as quickly and efficiently as possible.

Mono-method and mono-operation errors result from the fact that how data are collected strongly influences the resulting data (Campbell & Fiske, 1959). You might avoid a mono-operation bias, for example, by using two separate self-report instruments. You would still, however, have a mono-method problem because you employed only a single method, self-report. Employing multiple operations and multiple methods, in general, increases the chance that the resulting data will reflect the constructs of interest more than the measurement methods.

It is an axiom of educational and psychological measurement that no single operation totally reflects any single construct. Contemporary researchers generally embrace a philosophy of multiple operationalism, that is, they employ multiple measures or methods to measure a single construct. This approach, however, creates additional problems. When operations are measured via different methods, the methods themselves will influence scores. For example, observation and self-report of any single construct will yield at least somewhat divergent scores. Which one is more valid? The default solution in many cases is to aggregate across operations and methods, hoping that the single construct will sum while the influence of irrelevant factors (such as method) will be balanced or cancelled. The other alternative is to conduct a thorough literature review to find and apply a relevant model to derive or select the most appropriate operations for the construct in question.


Meier (1986) presented 31 college undergraduates with a self-report alcohol attitudes scale before and after they viewed a computer-assisted instruction (CAI) program on alcohol education. Statistical analysis of differences between pre- and post-test scores found a significant difference, indicating that students reported more responsible attitudes toward alcohol after the intervention. This study exemplifies both types of biases: (a) mono-operations bias is present because only one measurement device was employed to detect changes from the intervention, and (b) mono-method bias is evidenced by the use of self-report only. Any study with a single measurement device displays both mono-operations and mono-method biases. More typical in the literature are studies that employ multiple measurement devices (thus avoiding mono-operation biases) but only one method, such as self-report (i.e., mono-method bias).


          3.4 Testing and instrumentation errors

           Definition: Changes on scores that result from the use of a particular measurement operation or method.

           Description. Repeatedly administering any type of measurement device or assessment procedure can produce change in the construct of interest. For example, participants who take a test more than once may evidence practice effects, improving their scores upon repeated administrations without the presence of any intervention. They may better their performance on a classroom test, for example, by learning how the instructor writes questions and answers on a multiple choice exam.

Pretest sensitization refers to instances where pretesting influences participants’ behavior during and after an intervention. Both pretest sensitization and practice effects can be grouped as testing effects: Something changes simply because the person completed a test. Instrumentation effects refer to pretest-posttest differences that result from subtle changes in the measure being used. Response-shift error occurs when respondents’ understanding or awareness of the measured construct changes as a result of an intervention or other experiences; essentially, respondents experience a change in their frame of reference. Note that these terms are sometimes used interchangeably in the testing literature and that some overlap in meaning is present. All of these effects take place in the context of repeated administrations of a test, usually with an intervening treatment.