Frequently encountered fallacies regarding test-related statistics

An average can be disproved by a single deviating value

No single value can disprove an average, and it is normal that a set of numbers contains values far removed from the set's average. For instance, if the average I.Q. in a population is 90, this in no way prevents I.Q.'s of 150, 160, 30, 20 or whichever value from occurring in that population. That average is therefore not disproved by showing a single value however high or low.

A correlation between two tests can be derived from a single score pair

No, a correlation always requires multiple score pairs to be computed. One single pair, such as formed by the scores of one candidate on two different tests, forms just one data point and gives no information whatsoever as to the correlation between those tests. For instance, if the candidate scores I.Q. 115 on the one test and I.Q. 140 on the other, this still leaves open all possible outcomes for the correlation between those tests, including a perfect correlation (unity, 1.00).

A correlation between two tests can be disproved by a single score pair

No single score pair can disprove a correlation. Even a correlation of .99 does not exclude occasional score pairs that are "opposite" to what the correlation suggests. For instance, if a candidate scores I.Q. 145 on the one test and 130 on the other, this does not disprove a possible high positive correlation between those tests, nor does it imply a negative correlation.

A correlation between A and B implies that A is the cause of B

No, a correlation does not imply causality in that way because (1) the correlation may not be significant, and (2) the direction of causality may be different. Causality is only implied with a probability that depends on the correlation's significance, and the direction of causality may be either from A to B, or from B to A, or there may be a common cause behind A and B that makes them go up or down together. Especially the last possibility is often overlooked, leading to conclusions that do not follow.

High-range tests, unlike mainstream psychological tests, are not standardized

Yes they are. The word "standardized" means that a test has been normed to yield standard scores. That is, scores based on z-scores; that is, scores based on distance to the mean in standard deviation units. For instance, I.Q.'s indicate a score's distance to the mean in units of 1/15 standard deviation, the mean being defined as I.Q. 100, but many other forms of standard scores exist. When high-range tests yield standard scores, then they are standardized and there is no justification to call mainstream psychological tests "standardized" or "standard" to thus distinguish them from high-range tests.

One can know what a test measures by looking at the contents of that test

No, such "face validity" tends to be mistaken. Psychological tests often measure something rather different from what they appear to measure as assessed intuitively based on their contents. Only statistics such as validity coefficients and factor loadings reveal what a test measures. The notion that one can know what a test measures by looking at its contents has been termed the "topographical fallacy".

A high g loading means that a test has a high ceiling, is valid in the high range

No, a test's g loading only applies to the measured range of the test, and says nothing about the test's ceiling or level of difficulty. For instance, most mainstream tests are aimed at the I.Q. range 70-130 and lose validity outside that range, however high the g loadings they have within their measured ranges.

The average I.Q. of a group has no relevance because individual members of the group may still have I.Q.'s far above that average

This is false because the bulk of I.Q.'s in a group are situated closely around the average, given a more or less normal distribution as real-world groups have. Additionally, a fact that makes a group's average I.Q. particularly relevant is that the proportions of the group that fall above or below certain societally relevant threshold values in I.Q. are directly dependent on the group's average, and change sharply when the average goes up or down. These sharp changes result from the shape of the normal distribution (and real-world I.Q. distributions follow the normal distribution quite well, especially within plus or minus two standard deviations from the average). An example is I.Q. 80, which is about the minimum needed to function and live independently in modern society. If the group average is I.Q. 110, about 2 % fall below 80. If the average is I.Q. 100, about 11 % fall below 80. If the average is I.Q. 90, about 25 % fall below 80. If the average is I.Q. 80, 50 % fall below 80. These differences in proportions falling below or above this and other threshold values have vast societal consequences which in themselves fall outside the scope of this explanation of statistics. The above example goes to illustrate that groups of different average I.Q. will function differently in society, and that a rise or fall of average I.Q. will have dramatic societal consequences.

"Gender" is a polite, civilized, appropriate word for "sex"

No, the words "sex" and "gender" refer to different concepts, and the modern use of "gender" as a politically correct euphemism for "sex" is mistaken. Statistical studies of sex differences have often been erroneously termed, and categorized as, "gender differences". In actuality, "sex" refers to a physical state, which may be male, female, or in rare cases something in between (intersex), while "gender" denotes a psychological identity that is entirely self-proclaimed and can not be verified externally. "Sex" depends on tangible physical features like chromosomes and body parts, while "gender" is "how you feel inside" (male, female, or something in between). Sex and gender of an individual need not be the same.

The purposeful confusion of sex with gender that we have seen over the past decades serves a political purpose, to wit the denial of a male/female dichotomy, and concomitant promotion of the ideological doctrine that we are all on a continuous scale, somewhere in between male and female, and should not categorize humans as belonging to either sex. In reality, almost everyone is simply male or female, and in-between cases (intersex individuals, hermaphrodites) are extremely rare. Gender, on the other hand, is more fuzzy than sex, so by confusing sex with gender, political activists create the illusion that sex itself is a continuum, which it by no means is.

For clarity, transsexuals/transgenders are generally not intersex, and in fact the concept of transsexuality/transgenderism exists only by the grace of a male/female dichotomy, since transsexuality/transgenderism by definition requires sex and gender of an individual to be opposite.

A statistical fact can disprove a non-statistical, qualitative judgment

It can not, if there is no logical contradiction between the fact and the judgment. For instance, the judgment The wages in sector X in our country are too low to live on is not disproved by the statistical fact The wages in sector X in our country are already the highest of all countries in Europe. There is no logical contradiction, because the fact that the wages are the highest of all countries in Europe does not imply they are high enough to live on. Sophistry like this is often used in political discourse.

An even more fraudulent form of this fallacy is to use a single statistical number (detached from the distribution to which it belongs) to counter a judgment. For example, the judgment It rains a lot in country X is not disproved by the statistical fact It rains only 3 % of the time in country X; at the very least, the corresponding percentages for all countries should be considered to see where country X lies in that distribution! Similarly, conclusions like Humans and chimpanzees are closely related because they share 99 % of their D.N.A. do not follow unless one shows where this value lies in a distribution of percentages of shared D.N.A. of many two-species combinations. One is here making use of the "low" or "high" impression that the mere numbers 3 % and 99 % make on most people, but one is not proving one's point at all!

"Verbal" means "oral" (spoken)

No, in psychometrics, "verbal" means "in words" and may refer to either written or spoken language. "Verbal" therefore does not mean "spoken", although it is in non-psychometric contexts sometimes used thus. The popular confusion of "verbal" with "oral" gives rise to the confusion of talkativeness with verbal ability.

Studies of sex differences in children are representative of sex differences in adults

No; such studies contain sex biases favoring females. This is so because girls mature quicker than boys, both mentally and physically, and from puberty onward, the boys catch up. So, a study of sex differences undertaken just before puberty will contain strong biases in favor of females. In the social sciences, one not infrequently sees such studies being employed to "debunk" male superiority in some field. This is a form of scientific fraud.

Individual cases of different respective within-group centiles are comparable

No, a fair comparison requires that the cases one compares have the same relative level within their respective groups. An example of this fallacy is to compare a world-class female sprinter to an average man and say, "See, it is not true that men run faster than women!" For a fair comparison, one would need a woman and man who are both world-class, or both average, for instance. Another form of this error is to display a highly educated, well-spoken immigrant next to an underclass "White trash" native citizen. The media and politics are constantly bombarding us with this species of fraud to manipulate public perception.

Someone who solves a test in less time has a higher I.Q.

This is not true even on most mainstream, timed tests. It has been found that when more time is allowed for an I.Q. test administration, the g loading of the test rises. It has also been found that the speed at which someone solves easy problems does not correlate with one's ability to solve difficult problems. It appears that test-taking speed is not a component of g (general intelligence) but rather lies in the non-cognitive personality domain and correlates positively with the trait of extraversion. A rule of thumb in test construction is that when a test is administered without a time limit, the ranking of candidates should not change compared to a timed administration; if it does, the time limit is deemed too short.

Test-taking speed is the same as elementary cognitive tasks like reaction speed and decision speed

No; a difference is that elementary cognitive tasks correlate with g and are indeed g's building blocks, while test-taking speed correlates not with g but rather with the non-cognitive trait of extraversion. See also the related fallacy "Someone who solves a test in less time has a higher I.Q."

Persons high in g (and/or introversion) are slow thinkers who need untimed tests to reach their potential

This fallacy rests on the related fallacies "Someone who solves a test in less time has a higher I.Q." and "Test-taking speed is the same as elementary cognitive tasks like reaction speed and decision speed". So, (1) persons high in g are not "slow thinkers" but may have varying test-taking speeds depending on their other (non-cognitive) traits, and (2) introverted people may take more time on tests, but that implies not that their reaction speed, decision speed et cetera are low, as those elementary cognitive tasks are distinct from test-taking speed.

Measurement error applies to group averages

This mistake is sometimes made in dismissing differences between group averages with formulations like "The difference falls within the error margin so it is not significant". But of course, error of measurement applies to individual measurements of the given instrument. When computing an average, the error is evened out by the principle of aggregation. Error is by definition random in direction, the one time upward, the other time downward. When averaging a large number of measurements, these errors cancel each other out. The error of measurement of an instrument can therefore not be applied to averages of that instrument's measurements.

A valid I.Q. test never contains knowledge- or vocabulary-related items, or any verbal items at all

In fact, I.Q. tests have contained such items from the start of I.Q. testing on (early twentieth century) and non-verbal, "culture-free" tests only arose decades later. The reason that knowledge- or vocabulary-related items function well is that people of higher intelligence pick up and retain more knowledge, which in turn may be because they have better working memories, and it is the working memory that stores information in the long-term memory. As a result, knowledge- or vocabulary-related items are among the item types with the highest g loadings, even when factor-analyzed among all non-verbal tests (so, their high g loadings do not simply result from the presence of similar item types in other tests, as is sometimes mistakenly suggested).

Non-verbal "culture-free" or "culture-fair" tests, which are mostly visual reasoning tests, may have high g loadings as well, but have also proved to be non-robust, that is, have undergone much score inflation over the twentieth century as people got more acquainted with the item type, test contents, and solving strategies. Relying on these one-sided and inflationary tests for assessing "giftedness" or selecting for I.Q. society membership has been a major cause of inflation of "giftedness" and of I.Q. society membership. There is a yet unproven notion that the high g loading of some non-verbal tests is actually an artifact caused by the strong score inflation ("Flynn effect") such tests have undergone.

In addition, it has been observed that test candidates lost quite a bit of I.Q. points by avoiding (partly) verbal tests, misled by the prejudice of "it can not be a valid test if it contains verbal items". If verbal ability is your strongest side, you will be at an obvious disadvantage on a purely non-verbal test. When such people eventually try a partly verbal test, they discover with amazement that their scores get higher, not lower.

I.Q. test scores can be anonymized by removing all personal details

Alas, someone's combination of scores on a number of tests is itself as personal as a fingerprint! If you have access to such an "anonymized" database and happen to know what a particular candidate has scored on a few particular tests (three, even two tests would usually be enough) you can easily find that combination of scores and readily see that candidate's scores on all other tests taken by the candidate.

You would need extremely many candidates to create a test that measures at high levels like the 99.9th centile

The I.Q. level up to which a test measures depends solely on the difficulty of the test contents, not on the number of people who have taken the test. If a test is too easy to measure up to the 99.9th centile, you can let a trillion people take it and it will still not measure there. If a test is hard enough to measure up to the 99.9th centile, it will measure there even when no one has taken the test yet. Put differently, the number of test submissions does not affect a test's ceiling; it merely affects the quality of the test's norms.