2. Traditional Psychometric Properties of Tests

           When test developers discuss the evaluation of tests and other measurement and assessment devices, they use the term psychometric properties. This refers to the criteria employed to judge the usefulness of tests for any particular purpose. Reliability and validity are typically considered the two most important psychometric properties of tests. While they can be applied to any test, historically their importance results from the field’s emphasis on selection tests.

          2.1 Reliability

           Definition: The consistency of measurement.

           Description. The usual definition of reliability refers to a measurement’s ability to produce consistent scores. Reliability estimates refer to quantitative estimates of a measure’s reliability with a particular sample of individuals who compete the test. One might check the reliability of a measure of a trait by administering it to the same group of individuals one week apart and then correlating those scores. If the correlation was high (e.g., above .80), this means that the measure has a good test-retest reliability estimate (cf. Meier & Davis, 1990). Depending upon the theoretical basis of the test, a low estimate (e.g., below .70) presents a problem for subsequent interpretation of the meaning of these scores. For an additional perspective, watch this video.

You can evaluate a measurement method’s reliability through multiple methods. As summarized in Table 1, you could calculate split-half reliability (the extent to which two halves of the same test correlate), internal consistency (the average correlation between any single item and the sum of all items), alternate-form reliability (the correlation between two forms of the same test), test-retest reliability (the correlation between two administrations of the same test given to the same persons) or interrater reliability (the correlation between two raters who observe the same phenomenon). Coefficient alpha, a measure of internal consistency, currently is the most frequently used method for quantitative data because it requires only a single administration of the measurement method and is easily computed using statistical programs.

Table 1

Types of Reliability and Their Advantages

Internal Consistency (alpha) Requires only a single administration of a test; easily computed via data analysis programs
Test-Retest Provides evidence of stability over time, a major issue with trait-based tests
Alternate Form Once established empirically to be equivalent, provides two forms that can be employed at different intervals with minimal practice effects
Inter-rater Provides evidence of stability across observers, a major issue with social science constructs
Split-half Requires only a single administration of a test

Your theoretical understanding of the construct and its measurement method–not the ease of calculation–should guide the selection of the reliability analysis. You might consider, for example, to what extent the construct resembles a trait or a state. A trait is a phenomenon assumed to be relatively stable, enduring, and unresponsive to environmental influences. A state is a transitory psychological phenomenon that changes because of situational, developmental, or psychological causes. With test-retest reliability estimates, the interval between testing can be important; typically, the longer the interval, the lower the correlation. If the construct you are measuring has significant state components, then you would expect test-retest reliability to be relatively low. It would make more sense to evaluate tests of states with a measure of internal consistency such as coefficient alpha.

Because reliability depends partially on the sample of persons who complete a test under certain conditions, use the term reliability estimate when referring to the results of a reliability analysis with a set of test scores. For example, scores on Test A, when completed by a group of college students, may result in a coefficient alpha of .95. However, alpha may be considerably reduced when Test A is completed by fifth-grade students who experience difficulty comprehending Test A’s items. You should not assume that a test that has been previously exhibited high reliability estimates will do so under your research or practice conditions. You should also not assume that a test you have devised will be reliable. Any such homemade tests should be evaluated for reliability and validity, at least during pilot testing.


Meier and Lambert (1991) compared three scales developed to measure individuals’ comfort with computer use. They administered the Attitudes Toward Computer scale (ATC), the Computer Aversion Scale (CAVS), and the Computer Anxiety Rating Scale (CARS) to 1,234 college students during weeks 1, 8, and 15 (Time 1, 2, and 3, respectively) of a semester. Table 1-4 summarizes the reliability results and Figure 4 displays the results graphically:

Table 2

Coefficient Alpha and Test-Retest Reliability Estimates for Three Computer Anxiety Measures

Scale Time 1 alpha Time1-Time2 Correlation Time1-Time3 Correlation Time2-Time3 Correlation
Attitudes Toward Computers .92 .50 .39 .51
Computer Aversion Scale .88 .77 .74 .79
Computer Anxiety Rating Scale .87 .51 .47 .50

In general, the longer the interval between test administrations, the smaller the correlation between measures.  For which scale is this most evident in Table 2?

Figure 5

Visual Display of Reliability Values from Table 2

As shown in Figure 5, note that scale alphas are all higher than test-retest estimates. And while the three Time 1 alphas are approximately equal, the CAVS evidences higher test-retest reliability. If you sought a more stable measure of computer comfort, the CAVS would be the clear choice. If you were interested in a measure more likely to be responsive to intervention effects (such as training to increase comfort with computers), however, the ATC and CARS would be preferable.


          2.2 Validity

            Definition: What a test measures, or what inferences can be drawn from test scores.

            Description. Validity is the second major criteria for evaluating tests. Although many different types of validity have been described, little consensus exists in the measurement literature about which validity analyses are most useful. When assessing validity, test developers typically have no standards to compare tests against. It is for that reason that measurement theorists seldom use the term accuracy (i.e., comparing a score to an objective standard) when discussing tests. Instead, test developers gather evidence from a variety of sources to demonstrate validity. A universal, usually undesired source on all tests is method variance, that portion of the test score attributable to the method of obtaining data.

Table 3 lists the major types of validity along with their advantages. These criteria include:

  1.  A test has face validity when its item content appears to match the purpose of the test. A cynical synonym is cash validity: the more a test appears to measure what it is supposed to measure, from the perspective of test purchasers, the more cash the test accrues for its publishing company. While professionals sometimes choose tests on the basis of their face validity, test content does not insure construct validity.
  2.  Content validity refers to whether the content of a test is representative of the universe of relevant content. A test may or may not tap into all of a construct’s important domains or characteristics.
  3.  Criterion validity refers to the correlation of test scores with relevant criteria. Similarly, predictive validity refers to the degree to which a test can predict future performance on a criterion. Concurrent validation occurs by correlating a test and a criterion administered at the same time point.
  4.   Incremental validity refers to a test’s ability to increase the level of a prediction. For example, if undergraduate GPA correlates .3 with graduate school grades, can scores on a test like the Graduate Record Exam (GRE) improve the prediction of school performance above .3?
  5.  Tests of convergent validity (i.e., high correlation between two similar tests) and discriminant validity (i.e., low correlation between two tests of related, but dissimilar constructs) are conducted to assess construct validity, whether a test measures the construct it is intended to measure. All types of validity evidence ultimately relate to the construct validity of a test. For example, predictive validity depends upon construct validity because it is the phenomenon that both test and criterion measure that determines the relation between the two.

Table 3

Types of Validity and Their Advantages

Type Advantage
Face The content of tests with high face validity makes sense to test-takers, usually increasing their motivation to cooperate
Content High content validity ensures practical relevance because the test samples more domains relevant to the measured construct
Criterion / predictive / concurrent Examines prediction of behaviors/events of interest, at different time points
Incremental Enhances maximum prediction of important behaviors/events
Construct Assist in understanding of construct and relations among constructs of interest; for example, do convergent correlations exceed discriminant correlations?

 


In the initial decades of educational and psychological measurement, predictive or criterion validity was viewed as the most important type of validity. That is, test administrators gave tests for the purpose of selecting individuals in and out of settings such as schools and jobs. Ceci (1991, p. 703) summarized the predictive validity of the first major type of test, the IQ test:

Although it takes little more than 90 min to administer, an IQ test is alleged to capture much of what is important and stable about an individual’s academic, social, and occupational behavior. In addition to their well-documented prediction of school grades (r = .55, on average; Anastasi, 1968; Matarazzo, 1970), IQ scores have been reported to have impressive validity coefficients for predicting everything from mental health and criminality to marital dissolution rates and job performance (Gordon, 1976, 1980, 1987; Gottfredson, 1986; Hunter, 1983, 1986). For example, IQ scores have been shown to predict postal workers’ speed and accuracy of sorting mail by zip code, military recruits’ ability to steer a Bradley tank through an obstacle course, mechanics’ ability to repair engines, and many other real-world endeavors (see Hunter & Schmidt, 1982; Hunter, Schmidt, & Rauschenberg, 1984). Moreover, IQ has been touted as a better predictor of such accomplishments than any other measure that has been studied thus far.

More recent work associates intelligence with such seemingly disparate constructs as associative matching and health indicators. An important paradox of intelligence, however, still remains: Although cognitive ability tests have good predictive validity, their construct validity remains in question. That is, there exists little consensus about what such tests actually measure, leading to much controversy (as introduced in this video).


While these descriptions of validity have been accepted as central to test development and evaluation, an expanded perspective is needed for tests whose purpose is to measure aspects of constructs that change. Measuring change is important for educational and psychological domains that focus on development and the effects of interventions. A test intended to evaluate the effectiveness of an educational or psychosocial intervention, for example, should contain items and scales that demonstrate relative stability in the absence of an intervention and sensitivity to change over time when an intervention is implemented. Test users can employ a test for purposes other than that for which it was designed, but test scores are likely to be less sensitive for detecting the construct of interest. A measure of trait anxiety, such as the Trait form of the State-Trait Anxiety Inventory, may evidence change during a psychosocial intervention aimed at decreasing test anxiety in students, but the observed effect size is likely to be less than that detected by the state form of that inventory.