Practice effects refer to gains in scores on cognitive tests that occur when a person is retested on the same instrument, or tested more than once on very similar ones. These gains are due to the experience of having taken the test previously; they occur without the examinee being given specific or general feedback on test items, and they do not reflect growth or other improvement on the skills being assessed. Such practice effects denote an aspect of the test itself, a kind of systematic, built-in error that is associated with the specific skills the test measures. These effects relate to the test’s psychometric properties, and must therefore be understood well by the test user as a specific aspect of the test’s reliability. Retesting occurs fairly commonly in real circumstances for reasons such as mandatory school reevaluations, longitudinal research investigations, unwitting or deliberate duplication by different professionals who are evaluating the same individual, a parent’s or teacher’s insistence that a child be retested because the test scores imply that the child was not trying, and so forth. A keen understanding of differential practice effects facilitates competent interpretation of test score profiles in those instances in which people are retested on the same or a similar instrument, perhaps several times.
No specific length of time between tests is required to study practice effects; it depends on the generalization sought or needed. If the interval is very short—for example, a few hours, or a couple of days—then examinees are likely to remember many specific items that were administered. They are likely to retain specific picture puzzles, arithmetic problems, or block designs, and recall the strategies that proved most successful; the result is an inflated estimate of the practice effect—that is, relative to an inference about established (learned) effects. In contrast, intervals that are long, perhaps six months or a year or two, are confounded by variables other than the test’s psychometric properties and practice as such. Long intervals allow forgetting of the test’s content, and therefore reduce the magnitude of the practice effects; at the same time, in lengthy intervals there can be real growth or decline of the abilities measured. When change has occurred, it becomes difficult to separate the test’s practice effects, as such, from the person’s improvement or decay on the skills. For preschool children, who experience rapid development, even three or four months may be too long an interval for studying a test’s practice effects.
Practice effect on Wechsler’s verbal, performance, and full scale IQs for different age groups
|Gain on Full
The most commonly useful intervals for investigating a test’s practice effects are between one week and about two months, with one month or so representing a reasonable midpoint. Intervals of that approximate magnitude are typical of the test-retest reliability investigations reported in the test manuals of popular individually administered intelligence and achievement tests. Table 1 provides data on the practice effects for Wechsler’s popular series of intelligence scales. The studies from which the table figures were obtained were based on samples of normal individuals who were retested during the standardization programs of each scale. The data are taken from the test manuals of the 1967 Wechsler Preschool and Primary Scale of Intelligence (WPPSI) for ages 4 to 6.5 years and its 1989 revision (WPPSI-R) for ages 3 to 7 years; the 1974 Wechsler Intelligence Scale for Children-Revised (WISC-R) for ages 6 to 16 years and its 1991 revision (WISC-III), covering the same age range; and the Wechsler Adult Intelligence Scale-Revised (WAIS-R) for ages 16 to 74 years. Intervals averaged about one month, except for the 11-week interval used for the WPPSI; all studies were well designed.
Practice effects are show in this table for Wechsler’s Verbal (V) IQ, Performance (P) IQ, and Full Scale (FS) IQ. The verbal subtests that yield the V-IQ include factual, language-oriented items that require good verbal comprehension and expression for success; most items are reminiscent of the kinds of questions asked in school. In contrast, the performance subtests that contribute to the P-IQ require visual-perceptual-spatial skills and manipulation of concrete materials for success, and measure a person’s visual-motor coordination and nonverbal reasoning abilities. These tasks are not similar to school-related tests and activities. FS-IQ reflects a combination of the V and P scales; all three IQs are normed to have a mean equal to 100 and standard deviation equal to 15. Sample sizes for the ten groups in the table ranged from 48 to 175, with an overall total of exactly 1,000 individuals.
Practice effects on the FS-IQ averaged about 7 points across instruments and age groups, although an age trend was evident. Increases on the full scale from the first to second testing averaged about 4.5 points for preschool children, 7.5 points for elementary through high school students, and 6 points for adults. Regardless of the age of the sample, practice effects were considerably larger for P-IQ (9 points) than for V-IQ (3 points). The number of points gained on the V-IQ was a fairly constant 3 points for all age groups, but gains on P-IQ averaged 6.5 points for preschoolers, 11 points for elementary and high school students, and 8.5 points for adults.
These results for the Wechsler scales have generally been replicated for other intelligence tests. The overall gains on global IQ (about 7 points) are of the same approximate magnitude as: (a) the 5- to 6-point gains on the KAUFMAN ASSESSMENT BATTERY FOR CHILDREN (K-ABC), MCCARTHY SCALES OF CHILDREN’S ABILITIES, DIFFERENTIAL ABILITY SCALES (DAS), and Kaufman Adolescent and Adult Intelligence Test (KAIT), and (b) the 7- to 7.5-point gains on the Stanford-Binet (Fourth Edition).
As can be seen in Table 1, the gains are substantially larger for Wechsler’s P-IQ than for the V-IQ. This finding is seen also on similar scales of other tests, although the differences are not as extreme. In the K-ABC, the Simultaneous Processing Scale resembles the P-Scale, and the Achievement Scale is similar to the V-Scale. Gains on simultaneous processing averaged about 6.5 points, compared with about 2.5-point gains on achievement. In the Binet, the Abstract/Visual Reasoning Scale is similar to P, and the Verbal Reasoning Scale is similar to V. Abstract/visual gains averaged 7.5 to 8 points, where as verbal gains were 5 points. Gains on the DAS Special Nonverbal Scale (similar to P) averaged 7 points, compared with 4-point verbal ability (similar to V) gains for school-age children; at the preschool level, practice effects were 4 points on nonverbal ability and 1 to 2 points on verbal ability. In the KAIT, gains on measures of Fluid IQ (similar to P) were generally higher (7 points) than gains on Crystallized IQ (4.5 points), which is similar to V.
A number of factors seem to contribute to the practice effects that have been noted: familiarity with the kinds of tasks that compose an intelligence test, experience solving these tasks, and the development of effective strategies for solving different kinds of problems. Although an occasional specific item may be remembered (e.g., a puzzle of a horse or a car on the WISC-III), the gains in test scores are not due simply to recall of specific facts. Verbal tasks produce the smallest gains because children and adults have had much experience prior to the testing session in answering general information questions, solving arithmetic problems, or defining words. There is still a small practice effect because even school-like verbal tasks have some unique aspects to them, but the pattern of gains on Wechsler’s verbal tasks supports an “experience” hypothesis, that experience with erstwhile novel tests produces improvement. On the WISC-III, for example, gains are smallest across the age range, indeed, almost nonexistent, on tests of defining words, solving arithmetic problems, and answering “why” questions (e.g., “Why do cars have seatbelts?”); they are largest (nearly one-third SD) on those verbal tasks that are least like school tests (e.g., telling how two things are alike, repeating digits backward), tests that initially are novel. Very similar results occurred for the WPPSI-R and the WAIS-R.
The magnitude of gains on tests of verbal intelligence, incidentally, is commensurate with the practice effects observed for conventional tests of academic achievement. On the Kaufman Test of Educational Achievement (K-TEA) Brief and Comprehensive Forms, for example, gain scores averaged 3.3 points for mathematics, 2.3 points for reading, and 2.4 points for spelling over a one-week interval.
Experience also helps explain the finding of larger P-IQ than V-IQ practice effects. The P tasks tend to be novel tasks not tried before. As they are administered, they become less novel. Each time they are given, if the interval is not too long, individuals will recall trying to solve the same kinds of problems, and they may recall, too, the strategies that worked best the first time. And even if one is not able to solve many more items correctly on the retest, one is likely to respond more quickly to the items the second time around. On Wechsler’s P-Scale, quicker response times translate to higher scores, because several subtests allot bonus points for quick, perfect performance. Indeed, the increase in speed may largely account for the practice effect.
The generally heavy emphasis on visual-motor speed that characterizes P-IQ may also explain the age and test differences seen in Table 1. The largest P-IQ gains were on the WISC-R and WISC-III, followed by the WAIS-R and the preschool scales. Not surprisingly, the WISC-R and the WISC-III allot by far the most speed bonus points in the P-Scales. The WAIS-R is next, followed by the WPPSI and WPPSI-R, which place the least emphasis on motor speed. The WAIS-R, for example, does not give bonus points for any items on the Picture Arrangement subtest, whereas the WISC-III allots three bonus points for most items. On the WISC-III, Picture Arrangement (putting pictures in the right order to tell a story) had the largest practice effect of any subtest, a gain of about one standard deviation from test to retest.
The motor speed hypothesis may partially explain why the nonverbal versus verbal distinction was not as pronounced on other intelligence tests as it was on the Wechsler scales. Gains on nonverbal or fluid intelligence scales averaged about 7 to 8 points on the K-ABC, Stanford-Binet IV, DAS, and KAIT, in contrast to gains of about 2.5 to 5 points on these tests’ verbal/crystallized scales. The K-ABC, Binet IV, DAS, and KAIT nonverbal/fluid subtests place more emphasis on correct problem solving and less emphasis on motor speed than do Wechsler’s P subtests; the outcome may be less exaggerated practice effects for the nonverbal and novel tasks on these “other” tests. Research, however, has not pinpointed the precise explanations for different practice effects. Much of this discussion is therefore speculative.
Catron and Thompson (1979) investigated the role of the test interval on the size of practice effects by retesting five different samples of college students on the WAIS over five intervals: no interval (immediate retest), 1 week, 1 month, 2 months, and 4 months. Gains on V-IQ were 3 to 5 points for the immediate retest and 1-week retest, 2 points after 1 to 2 months, and 1 point after 4 months. Gains on P-IQ averaged 14 points for the immediate retest, and decreased steadily from 11 points after 1 week to 8 points after 4 months. Thus, after a 4-month interval, P-IQ was still elevated 3 points, but V-IQ was elevated only 1 point (i.e., there was virtually no gain).
In a review of 11 test-retest studies of the WAIS, Matarazzo and his colleagues found that, regardless of large differences in samples and some long intervals of time, the results were consistent in indicating about 2 IQ points of gain on the V-Scale and 7 to 8 points on the P-Scale. The intervals ranged from 1 week to 13 years; mean ages ranged from 19 to 70 years; and the samples included groups as diverse as brain-damaged elderly, mentally retarded, chronic epileptics, and college students.
In addition to novelty, motor speed, and interval, at least two other variables seem to relate to different practice effects for different tests: the nature of the task, and subtest reliability. When tests of verbal and visual memory are used in test-retest studies, for example, the pattern of different practice effects observed for cognitive problem-solving tasks no longer holds; in fact, the opposite pattern may emerge. The Wechsler Memory Scale-Revised (WMS-R) includes measures of verbal memory (retelling stories that are read aloud by the examiner, learning eight verbal word pairs) and visual memory (recognizing and recalling abstract designs that are exposed briefly, learning six pairs of visual stimuli). Results indicate that gains on the Verbal Memory Scale for three age groups average about 13 points, in contrast to an 8-point gain for the Visual Memory Scale. The visual memory practice effect was commensurate with the P-IQ gain on the WAIS-R, but the verbal memory gain was much larger than V-IQ gains. With verbal memory subtests, adults probably remember specific facts, story lines, and word associations, which greatly facilitate recall when these adults are retested more than a month later. On the KAIT, the largest practice effect for any of its ten subtests over a one-month interval was for Auditory Delayed Recall, which measures a person’s ability to remember verbal information (mock news stories) presented by cassette about a half-hour earlier.
The reliability of a subtest, particularly test-retest stability (see RELIABILITY), also relates to the size of its practice effect. Wechsler’s P subtests tend to be less reliable than the V subtests. Thus some of the change from one item to another is unreliable change. Vocabulary typically produces the smallest test-retest gain, and it is usually the most reliable Wechsler subtest. Picture Arrangement and Object Assembly tend to produce large practice effects, and these tasks are consistently among the least reliable Wechsler subtests. Block Design, easily the most reliable Wechsler P subtest on the WISC-III and the WAIS-R, shows the smallest practice effect among P subtests – despite the novel nature of the task and its reliance on bonus points for motor speed. On the KAIT, the least reliable task (Auditory Delayed Recall) has the largest practice effect, and the most reliable (Definitions) had the smallest.
Thus, practice effects do occur, they are different for verbal and nonverbal tasks, and they are of considerable practical importance. Any research study that depends on pre- and posttests should take into account gains due to practice; such gains should not be interpreted as evidence of true growth or change. In the absence of a control group, the average verbal and nonverbal gains known to occur based on routine retesting should be subtracted from any gains demonstrated for experimental groups. Failure to consider such gains or use appropriate control groups has led some researchers to infer, erroneously, gains in IQ following the surgical removal of plaque from the carotid artery in endarterectomy patients; and the inappropriate application of the practice effect data has led to specious conclusions regarding epileptic patients.
Any longitudinal study of changes in intelligence across the life span should take into account the evidence of practice effects. When the same individuals are tested every year or two on a Wechsler battery, the P-Scale, especially, can yield spuriously high IQs as a result of practice effects. Test a person over and over, and the kinds of “novel” tasks that characterize the P-Scale become as familiar as a test of vocabulary or general information. The V-IQ may continue to provide a reasonable estimate of true score over time, but the repeated use of P-IQ will not detect decrements in fluid or visual-spatial intelligence that accompany aging in adulthood. The repeated use of the same instrument in aging studies contributes to “progressive error” in longitudinal research, and has led to a confounding of data interpretation in several studies—including the well-known Duke longitudinal studies, in which the same adults were tested eleven times on the WAIS in the course of twenty-one years. This type of practice effect also makes it difficult to interpret IQs on tests that are administered every two or three years during the mandatory reevaluations of special education students.
Clinicians should understand the average practice effect gains in intelligence scores for children, adolescents, and adults. The expected increase of about 5 to 8 points in global IQ renders any score obtained on a retest as a likely overestimate of the person’s true level of functioning – especially if the retest is given within about six months of the original test, or if the person has been administered a Wechsler scale (any Wechsler scale) several times in the course of a few years. These inflated IQs, if not interpreted as overestimates resulting from the practice effect, may imply cognitive growth when none has occurred; may suggest that the earlier test yielded an invalidly low score when it was indeed valid; may suggest that a bright individual is gifted or that a retarded person is low-average; and so forth. Even though the average gain is about 5 to 8 points for various tests, the average range of gain scores makes it feasible for some individuals to gain as much as 15 IQ points due to practice alone.
And the different practice effects for verbal versus nonverbal tasks can influence the interpretation of profile results. On the WISC-III, for example, the average V–IQ gain is 2.3 points and the average P–IQ gain is 12.3 points, which translates to a net gain of 10 points on P–IQ due to the practice effect. Clinicians commonly use V–P IQ discrepancies as part of a diagnostic process. Other things being equal, V–P discrepancies will shift by an average of 10 points in favor of P–IQ. An initial P > V difference of 12 points will become about 22 points on a retest; a significant V > P difference of 15 points will become a trivial V > P difference of about 5 points. Inappropriate clinical decisions are likely for professionals who do not understand the predictable and substantial practice effects associated with verbal and nonverbal cognitive tests.
CATRON, D.W., & THOMPSON, C.C. (1979). Test-retest gains in WAIS scores after four retest intervals. Journal of Clinical Psychology, 35, 352-357. Evaluates practice effects on WAIS with college students, using four different intervals ranging from 1 week to 4 months; see also Catron’s 1978 study of WAIS practice effects on an immediate retest in Psychological Reports, 43, 279-280.
ELLIOTT, C.D. (1990). Differential Ability Scales: Introductory and technical handbook. San Antonio, TX: The Psychological Corporation. Chapter 8 includes studies of practice effects on a diversity of cognitive and achievement tasks for four groups ages 3.5 to 13 years.
KAUFMAN, A.S. (1990). Assessing adolescent and adult intelligence. Boston: Allyn & Bacon. Chapter 4 includes a comprehensive discussion of practice effects on the WAIS and WAIS-R, and their implications for clinical practice and research; chapter 7 relates practice effects to the interpretation of longitudinal data on aging and intelligence; chapter 9 discusses the role of practice effects on the interpretation of Verbal-Performance IQ differences.
KAUFMAN, A.S., & KAUFMAN, N.L. (1983, 1985,1993). Manuals for the Kaufman Assessment Battery for Children (K-ABC, 1983), Kaufman Test of Education Achievement (K-TEA, 1984), and Kaufman Adolescent and Adult Intelligence Test (KAIT, 1993). Circle Pines, MN: American Guidance Service. Each test manual includes thorough data on practice effects for the global scores and subtests for each Kaufman battery.
MATARAZZO, J.D., CARMODY, T.P. & JACOBS, L.D. (1980). Test-retest reliability and stability of the WAIS: A literature review with implications for clinical practice. Journal of Clinical Neuropsychology, 2, 89–105. Summarizes and interprets the results of eleven stability studies with the WAIS with a heterogeneous set of samples and widely varied intervals.
MATARAZZO, J.D., & HERMAN, D.O. (1984). Base rate data for the WAIS-R; Test-retest stability and VIQ-PIQ differences. Journal of Clinical Neuropsychology, 6, 351-366. Includes thorough analysis of practice effects on the WAIS-R, and discussion of their clinical and practical implications.
MATARAZZO, R.G., MATARAZZO, J.D., GALLO, A.E., JR., & WIENS, A.N. (1979). IQ and neuropsychological changes following carotid endarterectomy. Journal of Clinical Neuropsychology, 1, 97-116. Reevaluates conclusions about IQ gains following carotid artery surgery based on data on practice effects.
SEIDENBERG, M., O’LEARY, D.S., GIORDANI. B., BERENT, S., & BOLL, T.J. (1981). Test–retest IQ changes of epilepsy patients: Assessing the influence of practice effects. Journal of Clinical Neuropsychology, 3, 237-255. Investigates the relationship between practice effects and functional changes in epileptic patients.
SHATZ, M.W. (1981). WAIS practice effects in clinical neuropsychology. Journal of Clinical Neuropsychology, 3, 171-179. Disputes Matarazzo’s claims that practice effects in normal individuals are applicable to neuropsychological patients.
WECHSLER, D. (1981, 1987, 1989, 1991). Manuals for the Wechsler Adult Intelligence Scale—Revised (WAIS-R, 1981), Wechsler Memory Scale—Revised (WMS-R, 1987). Wechsler Preschool and Primary Scale of Intelligence—Revised (WPPSI-R, 1989), and Wechsler Intelligence Scale for Children—Third Edition (WISC-III). San Antonio, TX: The Psychological Corporation. East test manual includes thorough data on practice effects for the global scores and subtests for the most recent, updated version of each Wechsler battery.
From ENCYCLOPEDIA OF HUMAN INTELLIGENCE (2 VOLUMES)* By Alan S. Kaufman, Gale Group, © 1994, Gale Group. Reprinted by permission of The Gale Group.
* The article originally appeared as a reference entry in the Encyclopedia of Human Intelligence (Vol. 2, pp. 828-833). Edited by Robert J. Sternberg. New York: Macmillan Publishing Company
Send us your “What I’d like to learn about tests this year” list
As your partner in testing, we’d like you to know what we do, how we do it, and why. In turn, we’d like to know what other information we could provide to help you in your jobs. So send us your “What I’d like to learn about tests this year” list to email@example.com. And we’ll try to fulfill your wishes.
Fall is back to the books time. For you, this means test manuals. Happy reading! There’s more good stuff to discover in our manuals…really, there is.
Seize your September!