Our Blogs

Share in practical tips and insights, inside information, stories and recollections, and expert advice..

Entries Tagged With: assessment

Reporting Test or Subtest Data . . . What’s the Difference

Clinical Café by Tina Eichstadt, M.S., CCC-SLP

“I only gave three of the subtests on Bobby’s language test because I’m really short on time these days. I got what I needed for information so I stopped testing. I also like to pick and choose the subtests that I prefer.” How many of us have said something similar to these statements during our careers? Maybe it was years ago, maybe it was yesterday. No matter. After a behind-the-scenes look at tests vs. subtests in this article, we will all be able to collectively exclaim, “Eeek! What was I thinking?!?”

Simply Stated

For the purposes of test scoring, interpretation, and reporting, a test can stand alone, but a subtest cannot. You may report and interpret scores on a single test or composite of specific subtests (based on the test manual), but not on a single subtest. Why? Because good psychometricians say so. Just kidding . . . read on!

The term subtest implies rather clearly that its content is not complete on its own-it is a “sub” of something else. A subtest measures only a portion of the content area you are trying to assess, and measures that content in a particular way. Student performance on a subtest is only a part of the picture that clearly describes a student’s performance on an area of content, and therefore must be linked to the rest of the data that will fill in the picture and make the information reliable, valid, and able to be communicated quantitatively. A single subtest score (other than the raw score) is generally not reported because it is not reliable enough to stand alone. The subtest must be given in conjunction with other subtests to form a composite or total test score.

The term test refers to a set of data that is strong enough in validity and reliability to report a separate set of scores for the items and call it a test. Tests are collections of subtests and/or a set of items with statistically valid and reliable item data. Multiple tests linked and published together are often called a test battery. The CASL is a good example of a test battery with 15 highly-reliable tests, not subtests, all independently rigorous and able to stand alone or together in scoring, interpretation, and reporting.

Decisions YOU Make in Testing

Here’s an example: Basic Concepts and Synonyms are two individual tests within the CASL. You could administer one or both of these tests, score each, interpret each individually with respect to their broad content area (Lexical/Semantic), or use the Syntax Construction and Pragmatic Judgment tests with the Basic Concepts test to establish a Core Composite for a 3-year-old. If you were particularly concerned about an additional area of language, you could then give a single Supplementary Test, and the scores are valid and reliable on their own, outside the Core Tests but psychometrically linked. The benefits of using independent tests in a battery include: (1) the ability to tease out each content area individually with data that is valid and reliable; (2) a data-based “profile” of student performance that is linked to a nationally representative normative sample; and (3) an opportunity to “kill two birds with one stone” by documenting norm-referenced data while pointing directly to intervention needs. There are likely even more benefits beyond these. Very seldom would anyone need to administer an entire battery of tests. The clinician is free to give one test or several tests and can report the score(s) with confidence. You do not have this flexibility when working with subtests.

A clarifying caveat: If you choose to give only a few subtests randomly on a test, you may not score, average, or report normative data on those subtests. You can, however, use the test information qualitatively or in a criterion-referenced format. Use A description of the student’s performance, not numeric data. Because the standardization procedure was not followed, the norms will be invalid and unusable. If you need a standardized score for eligibility or release from services, make sure you follow the standardization procedure, give the whole test (or COMPLETE set of SUBtests), and report accurately.

So the moral of the story is . . . know your test! The manual should tell you in the first few sentences if you are working with tests or subtests and how they are combined. Follow the rules, and get scores you can use. It’s that simple.

Send us your “What I’d like to learn about tests this year” list

As your partner in testing, we’d like you to know what we do, how we do it, and why. In turn, we’d like to know what other information we could provide to help you in your jobs. So send us your “What I’d like to learn about tests this year” list to webmaster@agsnet.com. And we’ll try to fulfill your wishes.

Practice Effects

Practice effects refer to gains in scores on cognitive tests that occur when a person is retested on the same instrument, or tested more than once on very similar ones. These gains are due to the experience of having taken the test previously; they occur without the examinee being given specific or general feedback on test items, and they do not reflect growth or other improvement on the skills being assessed. Such practice effects denote an aspect of the test itself, a kind of systematic, built-in error that is associated with the specific skills the test measures. These effects relate to the test’s psychometric properties, and must therefore be understood well by the test user as a specific aspect of the test’s reliability. Retesting occurs fairly commonly in real circumstances for reasons such as mandatory school reevaluations, longitudinal research investigations, unwitting or deliberate duplication by different professionals who are evaluating the same individual, a parent’s or teacher’s insistence that a child be retested because the test scores imply that the child was not trying, and so forth. A keen understanding of differential practice effects facilitates competent interpretation of test score profiles in those instances in which people are retested on the same or a similar instrument, perhaps several times.

No specific length of time between tests is required to study practice effects; it depends on the generalization sought or needed. If the interval is very short—for example, a few hours, or a couple of days—then examinees are likely to remember many specific items that were administered. They are likely to retain specific picture puzzles, arithmetic problems, or block designs, and recall the strategies that proved most successful; the result is an inflated estimate of the practice effect—that is, relative to an inference about established (learned) effects. In contrast, intervals that are long, perhaps six months or a year or two, are confounded by variables other than the test’s psychometric properties and practice as such. Long intervals allow forgetting of the test’s content, and therefore reduce the magnitude of the practice effects; at the same time, in lengthy intervals there can be real growth or decline of the abilities measured. When change has occurred, it becomes difficult to separate the test’s practice effects, as such, from the person’s improvement or decay on the skills. For preschool children, who experience rapid development, even three or four months may be too long an interval for studying a test’s practice effects.

Practice effect on Wechsler’s verbal, performance, and full scale IQs for different age groups

Gain on
Verbal IQ
Gain on
Performance IQ
Gain on Full
Scale IQ
5 4 wks. WPPSI-R +2.8 +6.3 +5.1
5.5 11 wks. WPPSI +3.0 +6.6 +3.6
6.5 3 wks. WISC-III +1.7 +11.5 +7.6
7 4 wks. WISC-R +3.9 +8.6 +6.6
10.5 3 wks. WISC-III +1.9 +13.0 +7.7
11 4 wks. WISC-R +3.4 +10.8 +7.6
14.5 3 wks. WISC-III +3.3 +12.5 +8.4
15 4 wks. WISC-R +3.2 +9.2 +6.9
30 4.5 wks. WAIS-R +3.3 +8.9 +6.6
50 3.5 wks. WAIS-R +3.1 +7.7 +5.7
Median +3.2 +9.0 +6.8

The most commonly useful intervals for investigating a test’s practice effects are between one week and about two months, with one month or so representing a reasonable midpoint. Intervals of that approximate magnitude are typical of the test-retest reliability investigations reported in the test manuals of popular individually administered intelligence and achievement tests. Table 1 provides data on the practice effects for Wechsler’s popular series of intelligence scales. The studies from which the table figures were obtained were based on samples of normal individuals who were retested during the standardization programs of each scale. The data are taken from the test manuals of the 1967 Wechsler Preschool and Primary Scale of Intelligence (WPPSI) for ages 4 to 6.5 years and its 1989 revision (WPPSI-R) for ages 3 to 7 years; the 1974 Wechsler Intelligence Scale for Children-Revised (WISC-R) for ages 6 to 16 years and its 1991 revision (WISC-III), covering the same age range; and the Wechsler Adult Intelligence Scale-Revised (WAIS-R) for ages 16 to 74 years. Intervals averaged about one month, except for the 11-week interval used for the WPPSI; all studies were well designed.

Practice effects are show in this table for Wechsler’s Verbal (V) IQ, Performance (P) IQ, and Full Scale (FS) IQ. The verbal subtests that yield the V-IQ include factual, language-oriented items that require good verbal comprehension and expression for success; most items are reminiscent of the kinds of questions asked in school. In contrast, the performance subtests that contribute to the P-IQ require visual-perceptual-spatial skills and manipulation of concrete materials for success, and measure a person’s visual-motor coordination and nonverbal reasoning abilities. These tasks are not similar to school-related tests and activities. FS-IQ reflects a combination of the V and P scales; all three IQs are normed to have a mean equal to 100 and standard deviation equal to 15. Sample sizes for the ten groups in the table ranged from 48 to 175, with an overall total of exactly 1,000 individuals.

Practice effects on the FS-IQ averaged about 7 points across instruments and age groups, although an age trend was evident. Increases on the full scale from the first to second testing averaged about 4.5 points for preschool children, 7.5 points for elementary through high school students, and 6 points for adults. Regardless of the age of the sample, practice effects were considerably larger for P-IQ (9 points) than for V-IQ (3 points). The number of points gained on the V-IQ was a fairly constant 3 points for all age groups, but gains on P-IQ averaged 6.5 points for preschoolers, 11 points for elementary and high school students, and 8.5 points for adults.

These results for the Wechsler scales have generally been replicated for other intelligence tests. The overall gains on global IQ (about 7 points) are of the same approximate magnitude as: (a) the 5- to 6-point gains on the KAUFMAN ASSESSMENT BATTERY FOR CHILDREN (K-ABC), MCCARTHY SCALES OF CHILDREN’S ABILITIES, DIFFERENTIAL ABILITY SCALES (DAS), and Kaufman Adolescent and Adult Intelligence Test (KAIT), and (b) the 7- to 7.5-point gains on the Stanford-Binet (Fourth Edition).

As can be seen in Table 1, the gains are substantially larger for Wechsler’s P-IQ than for the V-IQ. This finding is seen also on similar scales of other tests, although the differences are not as extreme. In the K-ABC, the Simultaneous Processing Scale resembles the P-Scale, and the Achievement Scale is similar to the V-Scale. Gains on simultaneous processing averaged about 6.5 points, compared with about 2.5-point gains on achievement. In the Binet, the Abstract/Visual Reasoning Scale is similar to P, and the Verbal Reasoning Scale is similar to V. Abstract/visual gains averaged 7.5 to 8 points, where as verbal gains were 5 points. Gains on the DAS Special Nonverbal Scale (similar to P) averaged 7 points, compared with 4-point verbal ability (similar to V) gains for school-age children; at the preschool level, practice effects were 4 points on nonverbal ability and 1 to 2 points on verbal ability. In the KAIT, gains on measures of Fluid IQ (similar to P) were generally higher (7 points) than gains on Crystallized IQ (4.5 points), which is similar to V.

A number of factors seem to contribute to the practice effects that have been noted: familiarity with the kinds of tasks that compose an intelligence test, experience solving these tasks, and the development of effective strategies for solving different kinds of problems. Although an occasional specific item may be remembered (e.g., a puzzle of a horse or a car on the WISC-III), the gains in test scores are not due simply to recall of specific facts. Verbal tasks produce the smallest gains because children and adults have had much experience prior to the testing session in answering general information questions, solving arithmetic problems, or defining words. There is still a small practice effect because even school-like verbal tasks have some unique aspects to them, but the pattern of gains on Wechsler’s verbal tasks supports an “experience” hypothesis, that experience with erstwhile novel tests produces improvement. On the WISC-III, for example, gains are smallest across the age range, indeed, almost nonexistent, on tests of defining words, solving arithmetic problems, and answering “why” questions (e.g., “Why do cars have seatbelts?”); they are largest (nearly one-third SD) on those verbal tasks that are least like school tests (e.g., telling how two things are alike, repeating digits backward), tests that initially are novel. Very similar results occurred for the WPPSI-R and the WAIS-R.

The magnitude of gains on tests of verbal intelligence, incidentally, is commensurate with the practice effects observed for conventional tests of academic achievement. On the Kaufman Test of Educational Achievement (K-TEA) Brief and Comprehensive Forms, for example, gain scores averaged 3.3 points for mathematics, 2.3 points for reading, and 2.4 points for spelling over a one-week interval.

Experience also helps explain the finding of larger P-IQ than V-IQ practice effects. The P tasks tend to be novel tasks not tried before. As they are administered, they become less novel. Each time they are given, if the interval is not too long, individuals will recall trying to solve the same kinds of problems, and they may recall, too, the strategies that worked best the first time. And even if one is not able to solve many more items correctly on the retest, one is likely to respond more quickly to the items the second time around. On Wechsler’s P-Scale, quicker response times translate to higher scores, because several subtests allot bonus points for quick, perfect performance. Indeed, the increase in speed may largely account for the practice effect.

The generally heavy emphasis on visual-motor speed that characterizes P-IQ may also explain the age and test differences seen in Table 1. The largest P-IQ gains were on the WISC-R and WISC-III, followed by the WAIS-R and the preschool scales. Not surprisingly, the WISC-R and the WISC-III allot by far the most speed bonus points in the P-Scales. The WAIS-R is next, followed by the WPPSI and WPPSI-R, which place the least emphasis on motor speed. The WAIS-R, for example, does not give bonus points for any items on the Picture Arrangement subtest, whereas the WISC-III allots three bonus points for most items. On the WISC-III, Picture Arrangement (putting pictures in the right order to tell a story) had the largest practice effect of any subtest, a gain of about one standard deviation from test to retest.

The motor speed hypothesis may partially explain why the nonverbal versus verbal distinction was not as pronounced on other intelligence tests as it was on the Wechsler scales. Gains on nonverbal or fluid intelligence scales averaged about 7 to 8 points on the K-ABC, Stanford-Binet IV, DAS, and KAIT, in contrast to gains of about 2.5 to 5 points on these tests’ verbal/crystallized scales. The K-ABC, Binet IV, DAS, and KAIT nonverbal/fluid subtests place more emphasis on correct problem solving and less emphasis on motor speed than do Wechsler’s P subtests; the outcome may be less exaggerated practice effects for the nonverbal and novel tasks on these “other” tests. Research, however, has not pinpointed the precise explanations for different practice effects. Much of this discussion is therefore speculative.

Catron and Thompson (1979) investigated the role of the test interval on the size of practice effects by retesting five different samples of college students on the WAIS over five intervals: no interval (immediate retest), 1 week, 1 month, 2 months, and 4 months. Gains on V-IQ were 3 to 5 points for the immediate retest and 1-week retest, 2 points after 1 to 2 months, and 1 point after 4 months. Gains on P-IQ averaged 14 points for the immediate retest, and decreased steadily from 11 points after 1 week to 8 points after 4 months. Thus, after a 4-month interval, P-IQ was still elevated 3 points, but V-IQ was elevated only 1 point (i.e., there was virtually no gain).

In a review of 11 test-retest studies of the WAIS, Matarazzo and his colleagues found that, regardless of large differences in samples and some long intervals of time, the results were consistent in indicating about 2 IQ points of gain on the V-Scale and 7 to 8 points on the P-Scale. The intervals ranged from 1 week to 13 years; mean ages ranged from 19 to 70 years; and the samples included groups as diverse as brain-damaged elderly, mentally retarded, chronic epileptics, and college students.

In addition to novelty, motor speed, and interval, at least two other variables seem to relate to different practice effects for different tests: the nature of the task, and subtest reliability. When tests of verbal and visual memory are used in test-retest studies, for example, the pattern of different practice effects observed for cognitive problem-solving tasks no longer holds; in fact, the opposite pattern may emerge. The Wechsler Memory Scale-Revised (WMS-R) includes measures of verbal memory (retelling stories that are read aloud by the examiner, learning eight verbal word pairs) and visual memory (recognizing and recalling abstract designs that are exposed briefly, learning six pairs of visual stimuli). Results indicate that gains on the Verbal Memory Scale for three age groups average about 13 points, in contrast to an 8-point gain for the Visual Memory Scale. The visual memory practice effect was commensurate with the P-IQ gain on the WAIS-R, but the verbal memory gain was much larger than V-IQ gains. With verbal memory subtests, adults probably remember specific facts, story lines, and word associations, which greatly facilitate recall when these adults are retested more than a month later. On the KAIT, the largest practice effect for any of its ten subtests over a one-month interval was for Auditory Delayed Recall, which measures a person’s ability to remember verbal information (mock news stories) presented by cassette about a half-hour earlier.

The reliability of a subtest, particularly test-retest stability (see RELIABILITY), also relates to the size of its practice effect. Wechsler’s P subtests tend to be less reliable than the V subtests. Thus some of the change from one item to another is unreliable change. Vocabulary typically produces the smallest test-retest gain, and it is usually the most reliable Wechsler subtest. Picture Arrangement and Object Assembly tend to produce large practice effects, and these tasks are consistently among the least reliable Wechsler subtests. Block Design, easily the most reliable Wechsler P subtest on the WISC-III and the WAIS-R, shows the smallest practice effect among P subtests – despite the novel nature of the task and its reliance on bonus points for motor speed. On the KAIT, the least reliable task (Auditory Delayed Recall) has the largest practice effect, and the most reliable (Definitions) had the smallest.

Thus, practice effects do occur, they are different for verbal and nonverbal tasks, and they are of considerable practical importance. Any research study that depends on pre- and posttests should take into account gains due to practice; such gains should not be interpreted as evidence of true growth or change. In the absence of a control group, the average verbal and nonverbal gains known to occur based on routine retesting should be subtracted from any gains demonstrated for experimental groups. Failure to consider such gains or use appropriate control groups has led some researchers to infer, erroneously, gains in IQ following the surgical removal of plaque from the carotid artery in endarterectomy patients; and the inappropriate application of the practice effect data has led to specious conclusions regarding epileptic patients.

Any longitudinal study of changes in intelligence across the life span should take into account the evidence of practice effects. When the same individuals are tested every year or two on a Wechsler battery, the P-Scale, especially, can yield spuriously high IQs as a result of practice effects. Test a person over and over, and the kinds of “novel” tasks that characterize the P-Scale become as familiar as a test of vocabulary or general information. The V-IQ may continue to provide a reasonable estimate of true score over time, but the repeated use of P-IQ will not detect decrements in fluid or visual-spatial intelligence that accompany aging in adulthood. The repeated use of the same instrument in aging studies contributes to “progressive error” in longitudinal research, and has led to a confounding of data interpretation in several studies—including the well-known Duke longitudinal studies, in which the same adults were tested eleven times on the WAIS in the course of twenty-one years. This type of practice effect also makes it difficult to interpret IQs on tests that are administered every two or three years during the mandatory reevaluations of special education students.

Clinicians should understand the average practice effect gains in intelligence scores for children, adolescents, and adults. The expected increase of about 5 to 8 points in global IQ renders any score obtained on a retest as a likely overestimate of the person’s true level of functioning – especially if the retest is given within about six months of the original test, or if the person has been administered a Wechsler scale (any Wechsler scale) several times in the course of a few years. These inflated IQs, if not interpreted as overestimates resulting from the practice effect, may imply cognitive growth when none has occurred; may suggest that the earlier test yielded an invalidly low score when it was indeed valid; may suggest that a bright individual is gifted or that a retarded person is low-average; and so forth. Even though the average gain is about 5 to 8 points for various tests, the average range of gain scores makes it feasible for some individuals to gain as much as 15 IQ points due to practice alone.

And the different practice effects for verbal versus nonverbal tasks can influence the interpretation of profile results. On the WISC-III, for example, the average V–IQ gain is 2.3 points and the average P–IQ gain is 12.3 points, which translates to a net gain of 10 points on P–IQ due to the practice effect. Clinicians commonly use V–P IQ discrepancies as part of a diagnostic process. Other things being equal, V–P discrepancies will shift by an average of 10 points in favor of P–IQ. An initial P > V difference of 12 points will become about 22 points on a retest; a significant V >  P difference of 15 points will become a trivial V > P difference of about 5 points. Inappropriate clinical decisions are likely for professionals who do not understand the predictable and substantial practice effects associated with verbal and nonverbal cognitive tests.


CATRON, D.W., & THOMPSON, C.C. (1979). Test-retest gains in WAIS scores after four retest intervals. Journal of Clinical Psychology, 35, 352-357. Evaluates practice effects on WAIS with college students, using four different intervals ranging from 1 week to 4 months; see also Catron’s 1978 study of WAIS practice effects on an immediate retest in Psychological Reports, 43, 279-280.

ELLIOTT, C.D. (1990). Differential Ability Scales: Introductory and technical handbook. San Antonio, TX: The Psychological Corporation. Chapter 8 includes studies of practice effects on a diversity of cognitive and achievement tasks for four groups ages 3.5 to 13 years.

KAUFMAN, A.S. (1990). Assessing adolescent and adult intelligence. Boston: Allyn & Bacon. Chapter 4 includes a comprehensive discussion of practice effects on the WAIS and WAIS-R, and their implications for clinical practice and research; chapter 7 relates practice effects to the interpretation of longitudinal data on aging and intelligence; chapter 9 discusses the role of practice effects on the interpretation of Verbal-Performance IQ differences.

KAUFMAN, A.S., & KAUFMAN, N.L. (1983, 1985,1993). Manuals for the Kaufman Assessment Battery for Children (K-ABC, 1983), Kaufman Test of Education Achievement (K-TEA, 1984), and Kaufman Adolescent and Adult Intelligence Test (KAIT, 1993). Circle Pines, MN: American Guidance Service. Each test manual includes thorough data on practice effects for the global scores and subtests for each Kaufman battery.

MATARAZZO, J.D., CARMODY, T.P. & JACOBS, L.D. (1980). Test-retest reliability and stability of the WAIS: A literature review with implications for clinical practice. Journal of Clinical Neuropsychology, 2, 89–105. Summarizes and interprets the results of eleven stability studies with the WAIS with a heterogeneous set of samples and widely varied intervals.

MATARAZZO, J.D., & HERMAN, D.O. (1984). Base rate data for the WAIS-R; Test-retest stability and VIQ-PIQ differences. Journal of Clinical Neuropsychology, 6, 351-366. Includes thorough analysis of practice effects on the WAIS-R, and discussion of their clinical and practical implications.

MATARAZZO, R.G., MATARAZZO, J.D., GALLO, A.E., JR., & WIENS, A.N. (1979). IQ and neuropsychological changes following carotid endarterectomy. Journal of Clinical Neuropsychology, 1, 97-116. Reevaluates conclusions about IQ gains following carotid artery surgery based on data on practice effects.

SEIDENBERG, M., O’LEARY, D.S., GIORDANI. B., BERENT, S., & BOLL, T.J. (1981). Test–retest IQ changes of epilepsy patients: Assessing the influence of practice effects. Journal of Clinical Neuropsychology, 3, 237-255. Investigates the relationship between practice effects and functional changes in epileptic patients.

SHATZ, M.W. (1981). WAIS practice effects in clinical neuropsychology. Journal of Clinical Neuropsychology, 3, 171-179. Disputes Matarazzo’s claims that practice effects in normal individuals are applicable to neuropsychological patients.

WECHSLER, D. (1981, 1987, 1989, 1991). Manuals for the Wechsler Adult Intelligence Scale—Revised (WAIS-R, 1981), Wechsler Memory Scale—Revised (WMS-R, 1987). Wechsler Preschool and Primary Scale of Intelligence—Revised (WPPSI-R, 1989), and Wechsler Intelligence Scale for Children—Third Edition (WISC-III). San Antonio, TX: The Psychological Corporation. East test manual includes thorough data on practice effects for the global scores and subtests for the most recent, updated version of each Wechsler battery.

From ENCYCLOPEDIA OF HUMAN INTELLIGENCE (2 VOLUMES)*  By Alan S. Kaufman, Gale Group, © 1994, Gale Group. Reprinted by permission of The Gale Group.
* The article originally appeared as a reference entry in the Encyclopedia of Human Intelligence (Vol. 2, pp. 828-833). Edited by Robert J. Sternberg. New York: Macmillan Publishing Company

Send us your “What I’d like to learn about tests this year” list

As your partner in testing, we’d like you to know what we do, how we do it, and why. In turn, we’d like to know what other information we could provide to help you in your jobs. So send us your “What I’d like to learn about tests this year” list to webmaster@agsnet.com. And we’ll try to fulfill your wishes.

Fall is back to the books time. For you, this means test manuals. Happy reading! There’s more good stuff to discover in our manuals…really, there is.

Seize your September!

Norms Development Details You Should Know

Clinical Café by Tina Radichel, M.S., CCC-SLP

Can you believe that it’s almost September? Where did the summer go? A hearty “welcome back” to all our school-based colleagues!

One of the first things that I know you’ll be doing upon returning to school is making decisions about students on your caseload. You need to determine what testing, if any, you’ll need to complete in your assessment process. You might be picking up a freshly sharpened pencil, a cup of coffee, and the new test that you ordered months ago and attempt to recall “What is this one all about?”

When you make the decision to use a test to evaluate a student’s performance or progress, it’s wise to understand the background of the test and ensure that you are able to translate that background into your theory of speech/language development and the goals of testing. Knowing more about test development also helps you analyze and interpret the results you get from testing and understand why results different from your expectations might have occurred.

So, to help you in this regard, we offer another “look under the hood” of test development. The information below is adapted from an article by Dr. Mark H. Daniel, AGS Publishing’s Executive Director/Senior Scientist. For a quick review of test terminology, you might want to sharpen your teeth on the Pearson’s Assessment group Glossary of Common Test Terms.

Test Norming

Norms can be constructed once a test has been administered to a nationally representative sample. This process involves judgment as well as analysis. For example, decisions need to be made about how to group the data and whether or not to normalize standard scores. In most cases, the process begins with estimating percentiles.

How are percentile norms computed?

A primary objective of norms development is to smooth out irregularities while preserving the true shapes of the within-year distributions and the age trends. [Remember the bell curve? Like many things in life, it’s not always perfect!] These irregularities are introduced by sampling error, which occurs even though the norm sample has been carefully selected so the demographics of each subgroup (e.g., age or grade) match those of the U.S. population.

In the most common norming method, the overall sample is first divided into subgroups. For example, groups for grade norms might correspond to the spring and fall of each grade, whereas those for age norms might be formed from 3-, 6-, or 12-month ranges. Within each subgroup, the raw scores corresponding to selected percentile points are identified. Then, for each percentile point, the across-age or across-grade progression is smoothed to remove random deviations from a regular pattern of growth.

Next, each within-subgroup distribution of scores (as adjusted by the previous step) is smoothed to eliminate lumps or gaps. These two smoothing operations—across and within subgroups—are alternated (by computer) until the results stabilize. This results in a set of percentiles that progress steadily across age or grade and is smoothly distributed within each age or grade. Percentiles for any point on the age- or grade-range can then be read from the smoothed growth curves.

Continuous norming. If using a continuous norming approach, a large number of overlapping subgroups centered on each individual month of the age or grade range is created instead. For example, one subgroup might be centered on age 6:4, the next on age 6:5, and so on. Each subgroup has enough cases to permit calculation of percentile points for each month. These values can be smoothed across and within subgroups as described above.

How are standard scores computed?

Usually, raw scores at any specific age or grade are not normally distributed, even after smoothing. The distribution may be skewed (stretched out farther in one direction than another). Or it may be flatter or taller than a normal distribution.

There is often a theoretical reason (particularly with ability tests) to expect the true distribution to be normal and, thus, to assume that any non-normality is an artifact. If this assumption is made, then normalized standard scores are constructed. Each percentile point is converted into the standard score that would correspond to that percentile in a normal distribution. Because normalized standard scores are derived from percentiles, all tests using such scores show the same relationship between standard scores and percentiles.

If the underlying distribution is not assumed to be normal (e.g., with GFTA-2 and KLPA-2), then standard scores generally are constructed directly from raw scores. This might be the case with a behavior-problems scale on which most individuals score in a normal range and a few have extreme scores in one direction. Here, one would likely compute linear standard scores, which reflect the distance of each raw-score value from the mean in standard-deviation units. The relationship between linear standard scores and percentiles will vary from test to test.

Send us your “What I’d like to learn about tests this year” list

As your partner in testing, we’d like you to know what we do, how we do it, and why. In turn, we’d like to know what other information we could provide to help you in your jobs. So send us your “What I’d like to learn about tests this year” list to webmaster@agsnet.com. And we’ll try to fulfill your wishes.

Fall is back to the books time. For you, this means test manuals. Happy reading! There’s more good stuff to discover in our manuals…really, there is.

Seize your September!

Time for “Types of Tests” Terms!

Clinical Café by Tina Radichel, M.S., CCC-SLP

Lazy, Hazy, and Crazy Summer Days

Want to keep the wheels turning during the summer months? Want to simplify your life? Look no further! In an attempt to ease your burden in having to repeatedly clarify terminology regarding types of tests, this month’s Clinical Café focuses on some often-confused terms. Spread the word! Place copies of this issue by the napkin holders in the staff lounge. Slip a copy under the door of your scanning department or your technology office. Give a prize to the person who uses them correctly in a report or a presentation. You have all summer to brainstorm.

Norm- vs. Criterion-Referenced Tests

Norm-referenced tests use a representative group to compare against an examinee’s performance. This representative group is gathered carefully and tested in a standardized way so that the group is representative of the entire population for which the test is intended.

Criterion-referenced tests use a set of benchmarks, or criteria, which have specific expectations of mastery. An examinee’s performance is then compared to these expectations of content mastery or performance—that is, to him/herself, not to any reference group.

Diagnostic/Formative/Summative Tests

Pre-learning: Diagnostic tests are the ones that we as SLPs are usually most familiar. These tests measure knowledge and skill areas of an examinee “left to his/her own devices.” We complete diagnostic tests first to accurately place students in the intervention program most suitable to their needs. The OWLS and the CASL would be considered diagnostic tests because they point to a specific direction in intervention.

During learning: Formative tests offer information about learning in the middle or throughout the learning process. Formative testing can take the shape of learning self-assessment, quizzes, practice tests, or observations.

Post-learning: Summative tests make a final, end-of-course judgment on the intervention or learning and its relative outcome and success/failure of the examinee. National certification exams, like our NESPA exam, are summative tests. The ACT and SAT also fit into this category. Not to confuse anyone, but SLPs obviously also use our diagnostic tests post-learning. Still, these tests are better classified as diagnostic tests, which are given repeatedly to track progress (for medium stakes purposes—oops, I’m getting ahead of myself. Read on!).

Low/Medium/High Stakes Tests

We’ve been hearing quite a bit about these different types of tests in recent months. Dividing tests up along this continuum is descriptive because “low, medium, or high stakes” accurately describes the different levels of impact tests can have on examinees. Whether you’re talking about verifying the identity of the individual, the item and test rigor of creation and review, the need for item and test administration security, or the number and scope of consequences as well as the “stakes” of decisions made based on the results, the category that the test falls into accurately reflects all of these components. For example, the ACT is a high stakes test. It may impact college entrance. The items and test are ultimately secure. And I’m sure you all remember the “what you need to bring to this test” form, which includes proofs of identity and signatures of the stoic proctor. Most of our speech and language tests are medium stakes tests. The items are secure due to the investment in the norms development, the test is administered by an examiner or proctor, and it generates reports that point to key placement decisions and intervention programs. Low stakes test examples include student self-assessments, like our Career Decision Making (CDM) system, which offers direction and planning for examinees and an opportunity to develop motivation and thinking skills.

Speed vs. Power Tests

Timed tests are usually assessing how fast examinees can go against how much they really know. Certainly, there are elements of both speed and power in timed tests. However, when you remove the speed demand on the examinee, the test can truly become one of power—that is, a test that measures an examinee’s ability and knowledge (remember, “Knowledge is Power!”). Examples of speed tests are the ACT/SAT tests. Tests of power include the PPVT-III, and other untimed tests. At Pearson, all of our speech and language tests are built and standardized for an untimed administration—we want to know about speech and language power!

Putting It All Together

Here are three examples of how these test types might all work together:

  1. Your school develops an end-of-year (summative) curriculum-based assessment program (criterion-referenced) for measuring progress and determining success of the students against the curriculum content (medium stakes). The tests are completed over the course of the last week of school in the classroom and are untimed (power).
  2. The local school district has purchased a set of items from an outside vendor. These items have been developed by teachers or content experts but do not have supporting, nationally developed norms and, therefore, cannot be used to compare students to their peers (which would be norm-referenced). The items sit on the school district’s network and are open to all teachers who want to create and deliver a test to their students throughout the year via self-directed computer time (low stakes) to check their learning against the state standards (formative). The tests give the students 10 minutes to answer 20 questions (speed), and provide feedback to the students in their strong and weak skill areas.
  3. You give the PPVT-III to five students on your caseload at the beginning of the year (diagnostic). You compare their results to the norms tables in the published manual (norm-referenced) and, with the rest of your assessment, make decisions about curriculum planning and/or interventions for the year (medium stakes).

Is It Time to Retire? No, Not You—Your Tests!!

Clinical Café by Tina Radichel, M.S., CCC-SLP

Is It Time to Retire? No, Not You—Your Tests!!

“I can’t wait until retirement!” screamed Orna.
“Why?” Cassie responded. “You seem so young; it can’t possibly be time yet.”
“Well, everything around me has changed, and I just don’t seem to fit quite right anymore. ‘New’ and younger ones are cropping up every day. How do I compare?”
“Maybe you just need a little facelift,” Cassie says.
“No, I think I need to be entered into the Veteran Hall of Fame!” declared Orna emphatically.
“You might be right,” agreed Cassie, “but you should be consulted often for your history and expertise-you’ve served your profession well!”

Do you think this could be you and a colleague talking? Perhaps, but it also could be the imaginary conversation of two of your tests. What happens to tests as they age? How do test publishers decide when and how much to update a test? How should you decide when to use a new version of a test? Read on!

A Changing Landscape-Tests AND People

It’s no secret that things change. We expect it, we see it, and sometimes we even welcome it! Change is also a constant when it comes to test revisions. There are at least two reasons for this: the test content and the population.

Test content is all about what you are measuring and how you measure it. For example, the GFTA-2 measures articulation ability. Articulation is a relatively stable content area. What needs to be measured isn’t likely to change much. However, some of the items that the original GFTA used to measure articulation needed to be updated (e.g., the gun for initial /g/). Vocabulary is a less stable content area because the “common,” and therefore easier, words sometimes go “out of fashion” and become less common/more difficult (e.g., “casserole” on the PPVT-R). Language ability can be measured in many ways and new methods may be unveiled over the years, making test changes necessary.

As you know, the population is also constantly shifting, both in demographics and in skill/ability areas. Education trends in both pedagogy and practice influence what students learn and how they approach testing and test content. As norms are developed, part of the reliability and validity of testing is based on the most accurate representation of the national population in both demographics and skills/abilities. As the nation goes, so does the test publisher in norms development.

To Revise: When and How? That is the Question

How often should a standardized test be updated? You’ll love this answer—It depends. The “depends” relates to the stability of the various domains (e.g., articulation and IQ are logically more stable over the years than are vocabulary or reading achievement), and test developers always take the content into consideration. Additionally, some states don’t allow professionals to use a test that is more than 10 years old, so clearly this is a factor to consider as well from a business/customer standpoint. According to AGS Publishing Vice President of Product Development, Dr. Kathleen Williams (also an SLP), and Director of Research, Marshall Dahl, 10 years is simply a “good rule of thumb” given all the contextual information on content, professional/customer needs, state requirements, development time, dollars, and the like.

When a test is revised, sometimes the content can be preserved and just the normative data updated (e.g., Normative Updates for K-TEA, WRMT-R, PIAT-R, and KeyMath-R). This certainly translates to a cost savings to everyone, including the field. The decision about what and how to update has to do with the domain being measured and the content to measure it. Achievement tests need to be updated at least every 10 years because the domain is not as stable. Articulation, as a domain, is stable, but the content may or may not be. Hence the longer time between the GFTA and the GFTA-2. Feedback from professionals in the field regarding such matters as content and layout also enters into the decision on necessary changes.

Making the Change…For You, For Your Students

Obviously, a test publisher promotes the most current version of each test. The latest revision is the most up-to-date in both content and norms and, therefore, the most authentic “apples to apples” comparison of the current educational environment. If you have a favorite test, like the PPVT, aligning with the revision schedule of the publisher allows you to purchase the test when the new revision is published. At the same time, previous versions of the test are often still used in schools and clinical settings because of large research studies, budget crunches, or other circumstances. In some cases (e.g., PPVT), equating studies conducted during the revision process result in score conversion tables between test scores on the previous edition to test scores on the new, revised edition. Such score conversion tables can be of great benefit to you, the clinician, for comparing test scores from different editions of the same test.

The bottom line is that everyone needs valid and reliable tests to use. Making decisions about test revision plans has everything to do with that outcome. The actual development of a norm-referenced test revision is more than just an “interesting process of lots of examinees and confusing statistics” when done correctly. Everyone has a role to play in the development of human potential. We take our role as seriously as you take yours, and we are committed to providing tests and materials that help you and your students on a daily basis.

As always, we’d like to thank you for your ongoing service to people with communication needs and to remind you that we are here to support you with that effort. If you’d like to discuss this topic further, please feel free to use the SLP Discussion Center as the vehicle for an ongoing discussion with your colleagues. Should you have questions regarding these or other Pearson Speech and Language products, we welcome your phone calls at 800-627-7271 or use our web site at http://ags.pearsonassessments.com.

Bring on the warmth of spring!

Scores of Scores—The Art of Keeping Them Straight!

Clinical Café by Tina Radichel, M.S., CCC-SLP

How do you describe a client’s performance on a test? “Judy was at the 5th percentile.” “Antonio’s results show a standard score of 88.” “Mai performed at the 4th stanine.” With so many ways to report test performance, it’s no wonder people are confused about score interpretation! This month’s Clinical Café is a quick tutorial on normative scores. Print this article and tack it to your desk; give it to your favorite school psychologist, teacher, or principal; line your materials cupboards with it; or put it wherever it will be handy!

The following is a list of typically used normative score types. With each is a definition, a bit of perspective about the score type, and an example of an interpretation of a score. Notice that a raw score is not included in this list because raw scores are simply a frequency count of correct or incorrect answers and should not be used to describe student performance. They cannot be interpreted directly or compared between subtests. A raw score’s only purpose is to be used to derive normative scores. With that in mind, on with the list.

Standard Score

A standard score indicates the distance of a person’s raw score from average, taking into account the variability of scores among examinees of that age or grade. Standard scores are expressed in whole numbers with a mean of 100 and a standard deviation of 15. Standard scores are usually expressed with a confidence interval. Confidence intervals are expressed first as a percent (90% confidence level) and then with a standard score range. The percent indicates the degree of certainty that the examinee’s true score is based on his or her ability (rather than just the person’s performance on a particular test), and the score range indicates the range of standard scores within which the true score is likely to fall.

As an equal-interval measure, a standard score is one of the most common and useful metrics because it can be compared across subtests and across other tests and also can be arithmetically manipulated (added, subtracted, multiplied, divided, averaged, etc.). One important point needs to be made about standard scores. Often, we receive calls from customers who believe that if a client receives the same standard score from year to year, they are showing no growth. This is not accurate. Because the reference point for a standard score is the client’s own age group at the time of testing, which changes and grows in skill from year to year, the exact same standard score from one September to the next indicates one full year of growth.

Example: “Rob’s raw score converts to a standard score of 89, which falls within one standard deviation of the mean and is in the average range. Last year on this test, Rob’s standard score was 86. Receiving approximately the same standard score as a year ago indicates that Rob has demonstrated one year’s growth in this skill area.”

Percentiles or Percentile Rank

A percentile indicates the percentage of people in the reference group who performed at or below the examinee’s score. This score type is easily confused and unfortunately is widely misused, despite its popularity. Percentiles are an ordinal or rank-order scale of measurement, rather than an equal-interval scale. That means that you cannot subtract or average percentile scores in order to represent growth or change.

Example: “Kirby’s raw score converts to a percentile rank of 10. This means that 10 percent of the reference group performed at or below Kirby’s performance.”


The term stanine is a contraction of “standard nines.” Stanines provide a single-digit scoring metric with a range from 1 to 9, a mean of 5, and a standard deviation of 2. Each stanine score represents a specific range of percentile scores in the normal curve. Stanines are useful when you are interested in providing a “band” interpretation rather than a single score cutoff. Stanines 1 and 2 represent the bottom 11 percent of the examinee’s performance distribution, indicating a need in the tested skill area. Stanines 8 and 9 indicate a performance within the top 11 percent and a strength in a skill area. Stanines 4, 5, and 6 represent the average range.

Example: “Yasmine scored in the 5th stanine, which represents the middle 20 percent of examinees. This is in the average range.”

Normal Curve Equivalent

Normal curve equivalents (NCEs) are based on percentiles but are statistically converted to an equal-interval scale of measurement. NCEs range from 1 to 90, with a mean of 50 and a standard deviation of 21.06. NCEs of 1, 50, and 99 correspond to percentiles of 1, 50, and 99. However, other NCE values do not have a direct relationship to percentiles.

NCEs are used in many federal and state programs as a method of reporting specialized programs, such as Title I. Since they can be arithmetically manipulated (i.e., they can be averaged), they are particularly helpful for reporting data.

Example: “On this test, Gabriella’s normal curve equivalent was 54. This represents an average performance, with about half of the examinees her same age (or grade) performing better on this skill and about half performing more poorly.”

Grade Equivalent

A grade equivalent (GE) is the grade at which a person’s raw score is the median (or at the 50th percentile) score. Grade equivalents are expressed in tenths of a grade (1.2 = the second month of first grade).

Keep in mind that a GE has nothing to do with how the examinee performs against the local school curriculum or standards for a particular grade, nor does it take into account the person’s life experiences. Again, the reference for this score is the standardization sample of the test. Grade equivalents are also a rank-order scale; they place an examinee on a growth continuum, which may or may not increase at regular intervals. The same grade equivalent on two different subtests may not mean the same thing. Therefore, GEs are not the best option for making diagnostic and placement decisions.

Example: “Jamal’s grade equivalent for this test is 4.2. This means that Jamal’s score is the middle or median score for typically developing students in the second month of fourth grade.”

Test-Age Equivalent or Age Equivalent

Similar to a grade equivalent, a test-age equivalent represents the age in years and months at which a particular raw score is the median score. Like GEs, test-age equivalents are a rank-order scale; they place an examinee on a growth continuum, which may or may not increase at regular intervals. The same test-age equivalent on two different subtests may not mean the same thing. Thus, test-age equivalents are not the best option for making diagnostic and placement decisions.

Example: “Lisa’s raw score of 17 converts to a test-age equivalent or 10-6. This means that Lisa’s raw score is the same as the middle or median raw score for children aged 10 years 6 months. However, Lisa’s raw score of 17 may represent very different specific skills than those of children who are chronologically 10 1/2 since she is 14. Further investigation of specific strengths and needs is warranted.”


By now you’re probably thinking, “I was told there would be no math in speech-language pathology!” Sorry, this is not totally true. (I know, I’m not thrilled about this either!)

It’s not so bad if you keep in mind the following: Normative scores that are an equal-interval scale of measurement (standard scores, NCEs) can be arithmetically manipulated (i.e., added, subtracted, multiplied, or divided). Those that are rank-order scales (percentiles, stanines, grade and test-age equivalents) cannot be used this way. This is very important to remember when working with individuals or groups of students and their data. For example, you can average a class’s standard scores on a particular test, but you may not average the students’ percentile ranks.

The different score types offer different information, but they are also elements of mathematics and have their own rules! Use your calculators sparingly!

As always, we’d like to thank you for your ongoing service to people with communication needs and to remind you that we are here to support you with that effort. If you’d like to discuss this topic further, please feel free to use the SLP Discussion Center as the vehicle for an ongoing discussion with your colleagues. Should you have questions regarding these or other Pearson Speech and Language products, we welcome your phone calls at 800-627-7271 or through our website at ags.pearsonassessments.com.

Happy New Year!

Tests Don’t Diagnose . . .You Do! The Difference Between Testing and Assessment

Clinical Café by Tina Radichel, M.S., CCC-SLP

It appears that fall is upon us again. Welcome back to a new season and/or a new school year! After taking a month off from the Clinical Café, this month’s topic addresses a very specific issue—terminology—to reawaken your desire to use the right words with the right people at the right place and in the right time. What exactly do test, assessment, and diagnosis mean? And what impact do the use and understanding of these terms have on clinical decision-making?

Sitting around the AGS Publishing National Speech-Language Advisory Board discussion table in early August, the message was clear—you give a test; you complete an assessment. Who cares, you say? According to our team of colleagues from around the country, we all should! Imprecision in the use of testing language muddies clear communication. The concepts of testing, assessment, and diagnosis continue to be considered interchangeable by many, although they have incredibly different definitions and educational values attached to each of them (Mitchell, 1993). If we are to promote the clarity of performance in our client’s speech-language intervention, then we likely want to consider being clear in our own word choices.

A Test is a thing.

Most people are clear on what a test is—it is the “thing” or “product” that measures a particular behavior or set of objectives. The Standards for Educational and Psychological Testing (1999) define test as “an evaluative device or procedure in which a sample of an examinee’s behavior in a specified domain is obtained and subsequently evaluated and scored using a standardized process.” When you give a test, you are taking a “snapshot in time” and making an observation of an individual’s or group’s performance. Usually, a test gives only scores; however, when the test is considered diagnostic, it offers information related to the examinee’s strengths and weaknesses based on the test performance. For example, the PPVT-III is an example of a receptive vocabulary test, while the CASL and the GRADE are diagnostic tests that offer an analysis/profile of the examinee’s strengths and weaknesses in oral language and literacy, respectively.

The problem with the word test is that it has somewhat of a negative connotation in the public arena. No parents want their children to have to be “tested,” and many of us may remember negative or stressful experiences with tests in the past. Based on the definition above, taking a test is simply gathering information in a standard way, and we certainly want to gather the best and most accurate information available. The testing experience is an important consideration, however, especially in this high-stakes arena, which continues to escalate for educational accountability. Tests are key players in this arena.

An Assessment is a process.

An assessment is a more general process of gathering data to evaluate an examinee. You take the information from test data, interviews, and other measures, and pull it all together. An assessment process begins to shape the answer to the question “why did the person/people perform this way?” The Standards (1999) define assessment as “any systematic method of obtaining information from tests and other sources, used to draw inferences about characteristics of people, objects, or programs.” Assessment can also refer to the outcome of that process (e.g., “What is your assessment of Susie’s difficulty?”). You can’t point to, or hold, an assessment (just a report from an assessment process). For example, you might use the GFTA-2 and the KLPA-2 as tests in your assessment process. You might also interview the parent(s) and the teacher. Then you make some overall intelligibility judgments. You watch the student in class or at play. These are all important steps in the assessment process.

The practical problem is that out in the world, test and assessment are sometimes used as synonyms. During a focus group we conducted a few years ago, the moderator asked the question “What assessments do you use?” The attendees were puzzled at first and then responded with the overall assessment processes they use. Had the moderator asked “What tests do you use?” or even “What assessment instruments do you use?” the confusion may have been less. Precision lowers confusion!

A Diagnosis is a decision.

After all the testing is done and you’ve gathered all the information you need and uncovered all the available data, compared it, held it up to the light, put it under a microscope and considered it in context, it is time to make a clinical judgment. “In my professional opinion, based on all the data, the history, and my clinical experience, I believe that the issue is X.” You’ve made a diagnosis—a statement or conclusion about the testing and other information-gathering that you’ve done in the overall assessment process. For example, after you complete the assessment using the GFTA-2 and KLPA-2 tests and other assessment instruments and procedures, you may conclude that the child has a phonological process disorder. You support that diagnosis with test scores, medical history data, interviews, observation, and the like. But the diagnosis is your decision, for which you must use your clinical judgment—and no test or assessment can do that for you.

Why all the fuss over terminology? Are we just splitting semantic hairs? Maybe not. Again, while the word test may not have a great reputation, it is simply one piece of the larger assessment process. A test cannot make a diagnosis; humans do that. Likewise, an assessment is not a diagnosis either. A diagnosis is the result of the assessment process; it explains and defines the “why” of performance data. Both testing and diagnosis are really steps in the larger general assessment process: gathering background information, planning, testing, interviewing, observing, analyzing, interpreting, diagnosing, and recommending. The overarching umbrella to this process is clearly our clinical minds!

Clear terminology in any setting!

It is important to be clear in our communication not only among professionals in the same field, but across fields and, most importantly, with consumers and clients that we serve. We all tend to be guilty of the entropy of smaller, more general vocabularies and imprecise or nonspecific words—the very thing that we are critical of in our students and clients. Good modeling is key, and the opportunity to use clear communication to educate others about who we are and what we do is always there to be seized!

As always, we’d like to thank you for your ongoing service to people with communication needs and remind you that we are here to support you with that effort. If you’d like to discuss this topic further, please feel free to use the SLPForum as the vehicle for an ongoing discussion with your colleagues. Should you have questions regarding these or other Pearson Speech and Language products, we welcome your phone calls at 800-627-7271, or through our website http://ags.pearsonassessments.com

Have a good month!


Mitchell, R. (1993). Verbal confusion. The Council Chronicle, 2(3).

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education (1999). Standards for educational and psychological testing. Washington, D.C.: American Educational Research Association.

Speech Assessment-How Deep Can You Go (in no time at all)?

Clinical Café by Tina Radichel, M.S., CCC-SLP

If speech development were easy, children wouldn’t need speech-language pathologists. But easy it is not. Speech production uses a set of arbitrary sounds and sound combinations that are based on an equally arbitrary set of rules (Kent, 1998, in Bernthal & Bankson, 1998). Unfortunately, children don’t always master these rules in the same way. Enter the need for speech assessment and time-consuming analysis and interpretation.

Looking at speech assessment on a continuum means knowing that each child may require a different level of sound analysis. At a basic level, simply counting errors on a set of single words and comparing that number to a set of national norms may be sufficient in a particular case. At the most complex level, a generative analysis of a child’s sound production in conversation offers a depth and breadth of data that can offer a rich description of the child’s individual sound system. More often than not, however, assessment needs will fall somewhere between these extremes.

How do you determine the place on the continuum that matches your particular child’s needs? Simply stated, it depends (sorry, no easy answers here). You may initially need to determine a cursory number of errors to get a general idea of severity. But then you want more information, so you choose to analyze this number of errors by type of error (substitution, deletion, distortion, or addition). Or you want to look at distinctive features (place, manner, and voicing changes, or labials vs. stridents, etc). Then you decide that information isn’t enough either; you would like to organize the child’s errors by phonological process (cluster simplification, velar fronting, stopping, etc.). And so it goes.

The ability to move easily between levels of information is key to effective assessment. For example, one of the reasons that the GFTA-2/KLPA-2 combination of tests is so powerful is that the continuum is integrated with one well-controlled, representative normative group. The combination of these two tools increases the validity and ease of moving through the continuum to deeper analysis without jeopardizing the reliability and validity of the data. What’s more, you can stop whenever you determine you have the information you need. It is this philosophical premise of a continuum of assessment and the necessity of flexibility that serves as the foundation of the tests, as well as the newly released, newly designed GFTA-2/KLPA-2 ASSIST scoring and reporting software.

Analysis and Interpretation of Formal Testing

Clinical decision-making in speech-language pathology has always been both an art and a science. Cliché, yes, but true. A formal test or a criterion-referenced checklist can provide you with a wealth of data, but you must then engage your “gray matter” and insert the data into the context of a child’s history, experience, life, and environment. At some point, the test data can tell you no more than a number or set of numbers. You must decide how those numbers fit together and “where to go from here.”

In short, you must use your clinical judgment!
“Egads!” you say. “Think? No, it’s summer! I can’t do that!”

Knowing that test scores tell you only a piece of what the child knows and can do, further dynamic procedures (e.g., the GFTA-2 Sounds-in-Sentences and Stimulability sections) can easily help you broaden the picture of the child’s sound system. These two sections of the GFTA-2 use consonant sounds in an authentic/dynamic way and provide more information for analysis and interpretation that is not possible through formal testing means. Best practices in speech-language pathology and educational psychology have long supported the use of a full range of assessment tools and information-gathering methods to complete an assessment that is valid and leads to intervention. (Feuerstein, 1979; Lidz, 1991; Moore-Brown & Montgomery, 2001; Paul, 1995; Schraeder, Quinn, Stackman, & Miller, 1999)

The bottom line of interpretation is simple: while each child’s speech system is unique, there are also a number of very common ways to talk about it. When making interpretive judgments about test scores, test manuals are invaluable resources for clinical decision-making and report-writing. In addition, software that can generate standard wording for describing test scores accurately can do much of the initial report-writing work for you! None of us have a lot of time these days, so efficiency is key to getting your professional interpretations down on paper. Both the GFTA-2 and KLPA-2 manuals offer excellent assistance in the analysis and interpretation of test scores, including special cases and considerations.

Planning Intervention

The rubber meets the road in clinical intervention. No assessment will make a good speech and language outcome, but excellent assessment tools can give you the necessary foundation for sound thinking in clinical practice. You make the difference in bridging the gap and making the data work for you clinically. Logic indicates that the deeper you go on the continuum of assessment, the more information you have for planning intervention. For example, knowing that a child may have 80 percent of his or her errors as substitution errors may help qualify a child for services and describe the test scores. Then you as a clinician must make the leap to determine what targets to pursue in therapy. On the other hand, if you know that those substitution errors are largely errors in the phonological process Liquid Simplification, you can determine if the errors are age-appropriate and on which targets to focus. The more depth of information you have up front, the easier and more effective intervention planning is after assessment. In this outcome-based world, there is no better reason for having an integrated continuum of assessment than better and more effective intervention!

While we can’t tell you what specific intervention activities will work with each individual child or group of children, we do want you to be able to spend more time on planning than on “crunching” the data and writing lengthy repetitive reports. Our new GFTA-2/KLPA-2 ASSIST software (brand new design too!) integrates scoring and reporting for both the GFTA-2 and KLPA-2. Check it out at http://ags.pearsonassessments.com/static/a11750.asp.

A Big Thanks!

As always, we’d like to thank you for your ongoing service to people with communication needs and remind you that we at AGS Publishing are here to support you with that effort. If you’d like to discuss this topic further, please feel free to use the SLP Forum Discussion Center as the vehicle for an ongoing discussion with your colleagues. Should you have questions regarding these or other Pearson Speech and Language products, we welcome your phone calls at 1-800-627-7271, or through our website contact form.

Enjoy the complexity of speech assessment!


Bernthal, J. E., & Bankson, N. R. (1998). Articulation and phonological disorders (4th ed.). Needham Heights, MA: Allyn and Bacon.

Feuerstein, R. (1979). Dynamic assessment of retarded performers: The learning potential assessment device, theory, instruments, and techniques. Baltimore: University Park Press.

Lidz, C. S. (1991). Practitioner’s guide to dynamic assessment. New York: Guilford Press.

Moore-Brown, B. J., & Montgomery, J. K. (2001). Making a difference for America’s children: Speech-language pathologists in public schools. Eau Claire, WI: Thinking Publications.

Paul, R. (1995). Language disorders from infancy through adolescence. St.Louis, MO: Mosby.

Schraeder, T., Quinn, M., Stockman, I. J., & Miller, J. (1999). Authentic assessment as an approach to preschool speech-language screening. American Journal of Speech-Language Pathology, 6, 195-200.

Test-Retest Reliability-The Good, the Bad, and the Judgment Calls!

    Clinical Café by Tina Radichel, M.S., CCC-SLP

    An Ode to Retesting Test, test, and test again.
    Wait, wait a minute-does it matter WHEN?

    Is it weeks or months, days or years?
    Do I have enough info to tell all my peers?

    What will happen to the students’ scores?
    Will kids remember the answers or run out the doors?

    Will the scores be valid, reliable, and true?
    Will I interpret correctly so no one will sue?

    Oh, test publisher, hear my refrain;
    Give me some guidance, it’s taxing my brain!!

One should begin any dry but important topic with a bit of levity…hence the original poem above (inspired by Dr. Seuss, of course)! Welcome to the second issue of the Clinical Cafe, with today’s “espresso shot” topic of test-retest reliability. We, in Development for Pearson’s Assessment group, get numerous calls each month from professionals around the country regarding rules and strategies for retesting students. So, here’s another “insight” to read with your morning (or afternoon or evening) coffee and share with your colleagues.

Overall, test-retest reliability is an index of temporal stability. It tells how much the individual’s normative score might possibly change on retesting if a period of time has elapsed between test administrations. Change could reflect the person’s growth or fluctuation in the ability being measured, random differences in performance, or the individual’s recollection of the earlier administration. A test-retest coefficient is a statistical measure that is obtained by administering the same test twice, with a certain amount of time between administrations, and then correlating the two score sets. Reliability between two parallel forms with different content is known as an alternate-form coefficient (as in the PPVT-III).

When making a decision on retesting, follow the steps below.

  1. Determine why you are conducting a retest: Did the examinee’s performance fall below your expectations due to illness, bad day, test anxiety, student behavior, etc? If this is the case, as soon as the examinee is up to it, you should be able to retest, especially if you use a parallel form. Are you involved in a pre/post test situation where you are attempting to ascertain gain? If so, you’d want to schedule the second administration after the completion of instruction or therapy to determine the effectiveness of the treatment.Has the student recently transferred from a different school? If the original test administration was done in another school or setting by a clinician that you do not know, and you question the reported results, you may choose to re-test.
  2. Locate the section in each of the Pearson test manuals that discuss test-retest reliability:
    TEST Manual Page(s) Test-Retest Time Interval Test-Retest Median Interval Reliability Coefficients
    CASL 121-124 7-109 days 6 weeks .65-.96
    EVT 67-69 8-203 days 42 days .77-.90
    GFTA-2* 52-54 0-34 days 14 days .79-1.00
    KLPA-2* 66-67 0-34 days 14 days .79-1.00
    OWLS LC/OE 123-125 20-165 days 8 weeks .73-.89
    OWLS WE 146-147 18-165 days 9 weeks .87-.90
    PPVT-III Form A 48-51 One month 30 days .91-.93
    PPVT-III Form B 48-51 One month 30 days .91-.94
    PPVT-III A & B** 48-51 0-7 Same day .88-.95

    *Read further for an additional issue related to test content.

    1. **The PPVT-III is unique among Pearson speech-language tests in that it has parallel forms (Forms A & B). As written in the Alternate-Forms Reliability Coefficients section on page 48 of the PPVT-III manual, most test subjects were given both forms of the test in the same day.
    2. Make a clinical decision based on all the information:While some tests provide clinicians with an exact test-retest waiting period, some do not. So much depends on the reason for the retesting. After reviewing the information provided in the manual and following the above steps, you may rely on your professional clinical judgment to determine when to confidently retest. You can be confident of the test reliability if your retesting falls within the respective test interval in the above table since you will be matching standardization procedures.

    Test Content and Test-Retest Reliability: A Related Consideration

    Test content also should be considered when determining the need for retesting. For example, articulation is a developmental skill area; you would expect more of a change in performance earlier on the growth curve (i.e., ages 2-5), which then will flatten and stabilize around 8 years of age. Retesting a young student at a given time interval on the GFTA-2 and/or the KLPA-2 would likely have greater change attributed to growth than retesting a student at 10 years of age. This is one reason that the test-retest time interval is narrower for these two tests.In language, skills that are “closed set” (e.g., syntax) should be considered differently than those that are “open set” (e.g., vocabulary, nonliteral language). Vocabulary has a steep growth curve similar to articulation, but does not flatten at a particular age; it grows throughout the life cycle. Conversely, syntax forms are a closed set; you learn them early, and they are generally static. So it helps to make the test-retest decision-making and data analysis seem clearer when you take into account test content.

    Dancing the Line of Clinical Judgment and Test Standardization

    Yes, it is difficult to know sometimes what you can and can’t make decisions on. Test standardization is a rigorous process and we appreciate professionals being concerned about “following the rules.” At the same time, no test can anticipate all the situations and nuances of the clinical arena, so there is a point at which the rules end and your clinical judgment begins. We’re happy to continue to help you clarify the line on which to “dance.”If you want more information on standardized test development, the Development Team at Pearson’s Assessment group is in the process of creating an ASHA-approved CE presentation (available end of summer) on the basics of assessment. If you are interested in scheduling a continuing education activity for your school, district, or your state speech-language-hearing association, please contact:

    As always, we’d like to thank you for your ongoing service to people with communication needs, and we are here to support you with that effort. If you’d like to discuss this topic further, please feel free to use the SLP Forum as the vehicle for an ongoing discussion with your colleagues. Should you have questions regarding these or other Pearson Speech and Language products, we welcome your phone calls at 800-627-7271 or use our web site at http://psychcorp.pearsonassessments.com. Oh, yes, and if you’d like to copy the poem at the top, feel free (but don’t forget to cite the Clinical Café)!

    Enjoy the summer!

“Articulation and Phonology Are Not Normally Distributed-Who cares? You do! “

Clinical Café by Tina Radichel, M.S., CCC-SLP

Periodically we see a trend in calls and e-mail questions from customers. Because you are a distinguished member of the SLPForum, we’d like to supply you with a bit of continuing education that may help you and your colleagues in your day-to-day clinical practice. We would also like to offer you a FREE gift in this e-mail, so read on!

Our particular insight this issue relates to the more accurate scores available from the GFTA-2 and KLPA-2 results. Grab a cup of coffee and sit back for a two-minute read that will save you hours of thinking time later.

Since the publication of the GFTA-2, questions have been raised about the difference between scores on the 1986 edition of GFTA and the GFTA-2. The 1986 GFTA norms were percentiles extrapolated from two different databases: the National Speech and Hearing Survey (Hull, Mielke, Willeford, & Timmons, 1976) and the Khan and Lewis work of 1986. The use of two unrelated databases collected at two different points in time is part of the reason for score differences. However, the key reason for the difference in scores lies in how the normative scores were developed.

The psychometrician who worked on the original GFTA norms applied the methods of normative score development based on a “normal” distribution of data. This method did not result in scores that appropriately represented the extremes of the distributions of articulation errors for each age. For example, according to the 1986 norms, a female who was aged 6 years 6 months and made no errors would have a percentile rank of 99. This would mean that only 1 percent of girls that age made no errors. Of course, this is not true. According to the GFTA-2 norms, the percentile rank for girls at 6-6 making no errors is appropriately listed at >65. This means that, at this age, 65 percent make one or more errors and 35 percent make no errors. Speech-language pathologists know, and research on normal articulation development tells us, that this is a more accurate representation.

As stated in the GFTA-2 manual, articulation ability is not normally distributed in the general population in the same way as many other abilities. The expectation that children master all sound production by age 8 makes the “normal curve” for articulation inherently skewed. Many state/district’s qualification criteria for special services are based on a system of using percentiles derived from forcing articulation data into a normalized distribution scale. Forcing data in this manner is not appropriate based on articulation development.

Here’s another example: If a boy who is 4-6 has a percentile rank of 2 on the 1986 GFTA or on another articulation test with scores developed by forcing the data into a normalized distribution, this would equate to a standard score of 70. This score is two standard deviations (SDs) below the mean and represents a significant difference or distance from average. Alternatively, if this same boy gets a standard score of 70 on the GFTA-2, he would have a percentile rank of 6. This rank of 6 is equivalent to the percentile rank of 2 on a “normalized” distribution or on a test developed by those means. In either case, this child’s articulation is significantly different from normal or average and is in need of remediation.

So you are now saying, “Help! Now what?”

If you need to incorporate GFTA-2 non-normalized distribution results into a qualification system that is based on a normalized distribution system, here’s what you do:

  • Determine the cut-off percentile for services in your state/district.
      the 10th percentile.
  • Look at the “Percentile Rank to Standard Score Table” in the norms section of one of your favorite tests that is based on the bell curve.
      PPVT-III Norms Booklet – Page 44
      EVT Manual – Page 172
      CASL Norms Book – Page 121
      OWLS LC/OE Manual – Page 183
  • According to this table based on a normalized distribution, determine the Standard Score that equates to your district/state cut-off.
      10th Percentile = Standard Score of 81.
  • Use this Standard Score as your qualification criteria instead of the percentile rank on the GFTA-2.
      In a system based on normalized distribution criteria, any child who receives a Standard Score of 81 or below on the GFTA-2 would qualify for services.


As stated in this example, if your school district/state uses a specific percentile (10th), this is equivalent to a standard score of 81 in a normal distribution. The standard score of 81 represents a specific variance from average (regardless of the distribution). Because articulation is not normally distributed, using the standard score of 81 allows you to keep the same reference point (as different from average). The percentiles vary depending on the age of the child, but his or her reference to average does not. Keeping the metric of 81 as your cut-off means that you are serving the children who are similarly discrepant from average regardless of age.

Compare your test results using this FREE booklet!

Call AGS Publishing Customer Service at 1-800-328-2560 or submit the online contact form to get your complimentary GFTA-2 Supplemental/Developmental Norms Booklet (Ask for item number 11754). Using the information presented in this booklet, you can check, for example, that a male child aged 4-6 should have mastered the articulation of 29 of the 77 sounds possible on the GFTA-2 (using 90% as the acquisition level cut-off). Compare the GFTA-2 test results to the developmental normative data to determine which sounds are developmentally appropriate and which are not. Then you can base your therapy strategy on this information.

Here’s something else . . .

We recently published Khan Lewis Phonological Analysis – Second Edition (KLPA-2). The KLPA-2 (item number 11820) is a norm-referenced, in-depth analysis of overall phonological process usage. It is a companion tool for the GFTA-2 articulation test. The KLPA-2 was designed to provide further diagnostic information on the 53 target words elicited by the GFTA-2 Sounds-in-Words. This tool will help you deepen your analysis of speech sound patterns.

In closing, we’d like to thank you for your ongoing service to people with communication needs and we are here to support you with that effort. Assessment analysis and interpretation is an important topic for our field; if you’d like to discuss this topic further, please feel free to use the SLPForum Discussion Center as the vehicle for an ongoing discussion with your colleagues.


Should you have questions regarding these or other Pearson Speech and Language products, we welcome your phone calls at 800-627-7271 or use our web site at http://ags.pearsonassessments.com.

Enjoy the spring!