Rubin, Jim. (2011). Organizing and evaluating results from multiple reading assessments. The Reading Teacher, 64(8), 606-611.
Rubin presents a scheme for aggregating students’ scores on four very different reading assessments to get a single score based on the notion of independent, instructional, and frustration levels. The four assessments used here probably represent those that are most commonly used by classroom teachers and reading specialists: standardized test scores, cloze tests, informal reading inventories, and running records. Rubin suggests that each score be converted to a score that represents whether that score denotes the independent, instructional, or frustration level for each specific student, and then that those scores be averaged to get a single numerical score. Although I honor any attempt to make sense of assessment data, there are too many problems with Rubin’s approach and the way it is presented here for me to recommend doing what he suggests. Below is a list of the things that raised red flags for me.
1. Score ranges are given for independent, instructional, and frustration levels for each of the four assessments, but we are given no specific information about how the score ranges were determined. In the case of informal reading inventories, I suspect the criteria for independent, instructional, and frustration level are based on the well-known Betts Criteria, but those criteria are not cited here, nor is any other source. Similarly, the criteria used for the cloze test and running records probably came from the literature on those assessments, but those sources are not cited. I have no idea where the score ranges for standardized tests came from, but there is an even bigger problem with those, which is addressed below under 2. I can infer where most of these criteria came from, but they need to be specifically referenced; otherwise it appears as if Rubin pulled these score ranges out of the air. Even in a short practitioner-oriented article like this, such information is important. Only someone with an advanced background in literacy assessment could make the inferences I did. Some practitioner readers may not have that background and would need the references the score ranges are based upon. An editor should have insisted those be there before publication.
2. Rubin’s use of standardized test percentile scores here is shockingly inappropriate. He states that percentiles of 85-100 represent the independent level and are given a score of 3, percentiles of 70-84 represent the instructional level and are given a score of 2, and any percentile below 70 is considered frustration level and is given a score of 1. Furthermore, Rubin inappropriately writes about the ranges in terms of letter grades (e.g., the 85-100 percentile range is called the “high B to an A” range, the 70-84 range is called the “C to middle B” range, and percentile scores of below 70 are equated with “below a C” level). Percentiles are NOT the same as percentages, and should definitely not be assigned letter grades! A percentile of 70 is NOT frustration level; it is a score well above the mean for the distribution, with the mean always being at the 50th percentile. With percentiles, a score of 50 is not a “failing grade”, as a score of 50 per cent would be; it is actually a score at the center of the distribution. Rubin’s misconception of what percentiles mean here is a major faux pas. The distinction between percentiles and percentages is something that we emphasize heavily at the undergraduate level at my own institution. It is a basic understanding in the assessment field. It is unforgiveable that this error was not caught in the article by at least one reviewer. It makes me wonder what is being taught (or not) in undergraduate (and graduate!) courses today.
3. On top of all that, I found myself wondering why in the world I would want to aggregate measures that look at such different things anyway. Wouldn’t it be more useful to look at each measure separately to see where each child’s strengths and needs lie? Do I really want to boil a child’s literacy levels down to one composite score? What would I use such a score for?
4. The way the four assessments are described makes it sound like in each case, we are looking at only one reading passage at a time, at one grade level, but that may not be the case, especially with informal reading inventories, where children read passages at several grade levels. That may also occur with standardized tests, and could also occur with multiple cloze tests and multiple running records. We are not told here if the scores represent multiple reading levels or multiple assessments.
5. Can we really aggregate such dissimilar measures? Each of the four measures looks at different types of reading, and they are not really equivalent enough to make the aggregate scores meaningful. Even more concerning is that averaging the four scores implies that all four assessments are of equal weight and equal quality, and that is probably not at all the case. These four measures will differ widely in terms of reliability and validity, and, moreover, there are many informal reading inventories, and many standardized tests, representing a wide range of quality. It makes a difference WHICH inventory or test we use, but no specific ones are mentioned here, and all are treated as if they are created equal.
6. A more esoteric (but still important) issue is whether it really is desirable to reduce these data from the interval (continuous) data level (as percentiles and percentages are) to the less flexible ordinal level (a level which is a form of ranking, as the 1-2-3 scale here is). We can do a lot more statistically with percentiles and percentages than we can with rankings. Ordinal data just don’t allow the options that interval data do.
There are just too many problems here for me to recommend this approach or this article. I am disappointed that the editors and reviewers did not require that some of these problems be addressed before the article was accepted for publication.
No comments:
Post a Comment