The terms described in this chapter are relevant to all educational and psychological tests. This list starts with the most basic components:
1.1. Construct
Definition: Abstract summary of natural regularities indicated by observable events.
Construct refers to the basic element, the atom, that is the focus of any test. A construct is a social construction, a name we assign to a phenomenon of interest. Typically, a construct does not have a single physical manifestation; humans do not express intelligence, for example, by any single act or characteristic. If you are skeptical about the idea of a construct, ask yourself: Does love exist? Love is an example of a construct most people have experienced, but we cannot point to one thought, feeling, behavior, or event as indicative of “love” for every individual. For more description, watch this video.
Construct explication is the process by which a construct is connected to observable events (Torgerson, 1958). Construct explication is important because most social science constructs usually cannot be sufficiently defined through a single operation; watch this video describe the process of explication using the construct of anxiety. Many test users believe their choice of method for measuring a construct is unimportant, that is, the same data will be produced no matter their choice of operation and method. To explicate constructs, however, test developers should understand theoretical domains of interest as well as methodological choices available for measurement. When constructing a test, decisions must be made regarding who will be measured (sampling), how the data should be observed (test characteristics), and how the data will be employed (test purpose). Unfortunately, test developers often default to convenience sampling and self-report as the default method of data collection. These choices minimize the power of the resulting test, that is, its ability to detect effects of interest.
1.2. Measurement
Definition: The process of assigning numbers or categories to phenomenon according to agreed upon rules.
Description. Measurement is a more specific term than test and begins to move us toward a discussion of what constitutes a better or worse test. Krantz et al. (1971) defined measurement as assigning numbers to objects “in such a way the properties of the attributes are faithfully represented as numerical properties” (p. 1). In other words, data that result from the measurement process should reflect the characteristics present in the phenomenon we are interested in measuring.
One of the complications involved with assigning numbers is the scale type (Stevens, 1951). Nominal scales are those that contain qualitative categories (e.g., red, blue or green) but do not have information about differences in amount. Ordinal scales describe objects that differ from each other by some amount and that may be ranked in terms of that amount. Interval scales describe objects whose differences are marked by equal intervals, while ratio scales are interval scales that possess a zero point. Ordinal scales provide more precise information than nominal, interval more than ordinal, and so on. For more description, try this video.
1.3. Tests
Definition: Tools or systematic procedures for observing some aspect of human behavior and describing it with a numerical scale or category system.
Description. Tests produce a description of characteristics of individuals or groups. Historically, most tests have been developed with the purpose of selection, that is, to classify or make decisions about large groups of individuals. Tests such as the SAT or GRE, for example, are intended to help administrators decide who to admit to school. Historically, many intelligence tests were intended to measure g, an intelligence factor that presumably could influence a person’s performance on a wide variety of tasks.
Throughout the remainder of the book I employ the term test to refer to any type of measurement or assessment method. Similarly, researchers sometimes use the term operation to refer to any single method of gathering data. Synonyms for tests include scales, inventories, questionnaires, checklists, and rating scales.
1.4. Assessment
Definition: Human judge’s conclusion resulting from a combination of data from tests, interviews, observation, and other sources.
Description. Assessment is a broader term than measurement and includes any measurement method that involves human judgment. A reading specialist, for example, might make an assessment of a child’s reading problems based on data from test scores, observation of the child during class time as well as during test-taking, and interviews with the child, parents, and teachers. In a clinical context, assessment information can include a history of the presenting problems, family background and social supports, current medical problems, education, current employment, and financial resources. Assessments can also be more specialized and focused on (a) samples such as children, adolescents, and adults and (b) domains, including education, psychology, and mental health. For an additional perspective, watch this video.
Ideally, test developers design assessments to control unwanted factors in the information-gathering process. To cope with potential error (i.e., unwanted influences on test scores), assessors may compare data from multiple settings (e.g., home and school), sources (e.g., parents and teachers), and methods (e.g., observations, tests, and interviews). Assessors may also pay attention to test-taker characteristics that influence scores, including motivation to respond accurately, age, race/ethnicity, ability to read, socioeconomic status, cultural and family values, and motor ability.
1.5. Method Effects
Definition: The contribution of how a construct is measured to the test score.
Description. Also known as method variance, scores on every quantitative variable, index, and measure reflect one or more aspects of the methodology employed to collect data. Test developers have long recognized that methodology can substantially influence scores independently of theoretically identified effects; in fact, test scores may reflect the effects of method as much as an expected theoretical effect. Although method effects are always present in testing, researchers historically have approached method effects as bias or error in that such effects represent alternative explanations that interfere with a study’s capacity to test a hypothesis (Cook & Campbell, 1979).
Incomplete or limited construct explication is a major source of method effects. As shown in Figure 3, method effects are likely be associated with lower reliability and validity estimates in measurement methods, resulting in less power to detect effects. That is, to the extent that test scores reflect aspect(s) of method instead of an intended theoretical construct, they have less power to detect effects of interest. Instead of viewing method effects simply as indicators of flaws in the research process, method effects can provide clues about where test developers should examine construct explication to increase power in measurement and design procedures.
Figure 4
Relations Between Construct Explication and Method Effects, With Consequence for Power
One indicator of method variance is that two tests of a single construct (such as anxiety) will evidence higher correlations if they are measured via the same method. Thus, two self-report tests measuring anxiety will usually display higher correlations than the correlation between scores on a self-report anxiety measure with a other-rated (e.g., clinician) measure of anxiety. A related problem is that test users who always employ the same method for measurement tend to lose sight of the fact that associations among these measures are increased by their common method. This is, in fact, common in both research and practice domains. Reviews of the occupational stress literature, for example, find that the self-report Maslach Burnout Inventory was employed in approximately 80% of studies. In the literature examining the working alliance in psychotherapy, about 70% of studies employed a version of the Working Alliance Inventory.