Friday, 23 August 2019

Why the WAIS test works less well at the high end

Scaled scores - miss two items, drop a SD
Scaled scores - everything above a certain raw score obtains 19

One of the stumbling blocks faced by a researcher hoping to investigate high cognitive ability in adults is finding a test battery with sufficient discriminatory power at the higher end of the ability distribution. This article briefly mentions how the structure of the tests, especially the Wechsler tests, may artificially compress scores at the extremes.

David Wechsler believed primarily that the value of his tests was in their clinical interpretation. He insisted on the retention of certain items because he believed the possible answers by patients might give practitioners insight into their mental state, as much as for their utility in measuring reasoning ability. Furthermore, he took a hostile view towards those wanting to use his tests measure high IQ, insisting that they had never been designed with that purpose in mind (Kaufman & Lichtenberger, 2009).  David Wechsler was also quite keen that scores on his tests, despite the lack of a theoretical psychometric grounding in their early iterations, would follow a normal distribution.

One day, someone in a Facebook group linked to a magazine article that I found extremely curious. I will not post the link, as the page was on a hardline political website, however, the article itself contained no political content. It discussed the possibility of intelligence following a Paretian, rather than normal distribution, and how the construction of the WAIS test artificially forced a normal distribution (Pennington, 2016). Using factual knowledge as an example (represented by the "Information" subtest on the WAIS), the author pointed out that most people know only a tiny amount of the world's accumulated knowledge, while a few know anywhere from 10 times to 1,000 times what the average citizen does. The WAIS test, she argued, was heavily loaded up with simple items while limiting the range of item difficulty, compared to what would be theoretically available from the entire world of knowledge. In fact, the item of median difficulty on the test, missed by 50% of testees, is not an especially difficult one for those at the highest levels of cognitive ability, and indeed there may be few or no items that challenge the very brightest individuals at their "level". Lumping everyone beyond a certain IQ level into the "highly superior" category tells us nothing about the range of possible individual variation within that category, which runs the whole gamut from the brighter end of educated professionals to the true rare individuals who occur once or twice in a generation.

The other two issues I am going to mention have to do with how the test is scored - using a system where raw scores are mapped onto age-normed "scaled scores" ranging from 1-19, with the average being 10.

Linked to the first issue above is the fact that no matter how bright you are, and even if you obtain a maximum raw score on a subtest, the maximum scaled score is 19. So even if a subtest contained a greater than usual number of difficult items, and you happened to correctly answer all of them, you would get a scaled score of 19 just the same as the guy who only just answered enough items correctly to get a 19. In other words, everyone beyond a certain ability range is just lumped into one point category.

Does this mean a preponderance of maximum scores? Not necessarily. There is always measurement error, and the extremes of the scale are more likely to be affected by such because of the test construction. Let's take a hypothetical subtest with 30 items (it doesn't matter what the task is for the purpose of this illustration). An average score on this subtest for a person of our hypothetical test-taker's age is anywhere between 12 and 18, all of which raw scores map onto a scaled score of 10. On the other hand, the person might have to get all 30 items correct to get a 19. Further, there might be no raw score corresponding to an 18 as a 29 on the subtest renders a scaled score of 17, and if they obtain a raw score of 28, they get a 16. Thus, Joe Average makes two unlucky mistakes, and could well be still within that average category (since the scaled score of 10 covers several raw scores on the subtest). Joe Triple Nine makes two unlucky mistakes, and his score on the subtest is busted from 19 to 16 - an entire standard deviation.

I do not believe that the solution is to go looking for difficult so-called tests made by hobbyists, and the reasons for this are many, not least their small sample sizes, lack of theoretical grounding, and lack of statistical validation, and this may be the topic of a separate article. There may well be other standardised tests that with better discriminatory power than the WAIS at the high end of the ability distribution. One possibility is the SB5, with its greater emphasis on power-test administration (difficulty prioritised over speed) and experimental extended scale for measuring hypothetical IQs over 160 s.d. 15. Others may also be available for neurocognitive research.

As to the question of whether intelligence follows a normal distribution, I would want to see more convincing evidence that it does, and that if this is what the research evidence shows, that such evidence is not simply an artifact of how the most commonly used tests are constructed.



Refs:

Kaufman, A.S. & Lichtenberger, E.O. (2009). Essentials of WAIS-IV Assessment. New York: Wiley.
Pennington, R.W. (2016). The Paretian Distribution of Intelligence. [Online article].