4. What Enhances Reliability, Validity, And Power?

As described in the previous section, events, test-taker characteristics, and situations can diminish test reliability and validity. Nevertheless, test developers and users can create and employ tests to enhance test reliability, validity, and ultimately, the test’s power to detect effects of interest.

4.1 Purpose

Definition: The intended use of scores from a test.

Description. Tests are employed for many purposes, but most of these can be classified under one of three types: Theory-building, selection, or detecting change. Tests designed for theory-building provide information to test, evaluate, and modify the hypotheses and models derived from a theory. Historically, selection tests are the dominant use: Test data help make a decision about whether or not the test-taker is selected for a job, school, service in the armed forces, or so forth. Tests designed to detect change typically attempt to find effects resulting from interventions of some type or from developmental processes.

The key issue is that problems with power can arise when tests are employed for purposes for which they were not explicitly intended. Selection tests, for example, are constructed with items designed to measure presumably stable individual traits (e.g., intelligence). Many researchers and practitioners, however, then employ these tests in an attempt to gauge the effects of interventions and developmental processes. Scores on standardized achievement tests employed in schools, for example, may partially reflect such constructs as socioeconomic status and general cognitive ability. However, they are less likely to show the effects of what is learned in the classroom than mastery or criterion-referenced tests specially created for evaluating the effects of classroom learning.

An analogy would be a meteorologist who wants to study the effect of temperature on plant growth but uses a barometer to measure temperature. Now, measurements using a barometer for some periods might actually correlate roughly with temperature; during the summer, high barometric pressure is more likely to be associated with warmer temperatures. Consequently, the meteorologist might even find some weak relation between barometric pressure and plant growth. That relation, however, will be weaker than the one found with an instrument whose primary purpose is to measure temperature, the thermometer.

Here is another perspective on the role of purpose in psychological testing.

4.2 Aggregation

Definition: Summing or averaging of measurements.

Description. Aggregation often improves the reliability and validity of measurements because random measurement errors cancel or balance each other. Even if systematic errors are present, if they are of a sufficiently different type, they may offset each other. In most instances, then, an aggregated score should better reflect the construct of interest more than any one item.

One problem with aggregation, however, occurs when you may sum incompatible sources. For example, you may be interested in studying parents’ ratings of their children’s behavior. It may be that mothers, compared to fathers, have more experience with their children and thus can provide more valid data. Adding these fathers’ data to mothers’ may be introducing a source of error rather than valid data.

Epstein (1979, 1980) provided examples of the benefits of aggregation. Epstein asked 45 undergraduates to keep daily records, for 14 consecutive days, of such behaviors as number of social phone calls made, social contacts, headaches, hours of sleep, and similar constructs. Epstein found that the average correlation of these constructs for 1 day with data provided for the 13 other days was quite low (e.g., .09 for hours slept). That is, little relationship existed between behavior on any 1 day and behavior exhibited on the other 13 days. To demonstrate the effects of aggregation, Epstein summed scores for the even and odd days and correlated these groups. For every behavior measured, the aggregated correlations exceeded the 1-day correlations. For example, the correlation between even and odd days for hours of sleep was .84.

4.3 Precision

Definition: The ability to detect small differences in a phenomenon; the ability of a test to produce data closely reflecting the natural ordering and range of a phenomenon.

Description. A number of terms and definitions similar to precision have been offered. Boyce et al. (1994) defined (a) resolution as the finest interval of an instrument’s measurement scale that can be distinguished by an observer (e.g., degrees on a thermometer), (b) accuracy as comparing values from a measurement process with measurement from other processes (e.g., comparing newly made thermometers with one known to be valid), and (c) calibration as checking a new instrument against a known standard (i.e., the process of making a particular instrument accurate).

The validity of a test depends upon naming its construct well and demonstrating adequate precision for its intended purpose. The naming aspect refers to the extent to which the test developer and test user understand the multiple constructs that influence test scores. For example, a mathematics test should reflect addition and subtraction abilities rather than reading ability. For many purposes (e.g., grading), test developers want scores on that mathematics test to reflect the full range of ability levels rather than a simple high or low classification.

Chapter 1: Foundations