Statistics explained

I.Q. Tests for the High Range

The reports

All of the statistical reports are in the category Statistical reports related to intelligence tests.

Neuron

Explanations

Below are brief explanations of the statistics found in the reports.

Raw score

The number of problems solved on the test in question, or the sum of the item scores in the rare cases where some of a test's problems give more credit than other. In any case, the direct, unconverted score from the test.

Protonorms

Protonorms are generalized raw scores that allow comparison between scores on all of the tests, but do not contain information as to where that score stands in any population; They are not standard scores or quantiles, and do not have a fixed mean and S.D. Protonorms are established by rank-equating the shared scores (that is, the scores from the same group of candidates) on each combination of two tests. Protonorms can be normed to I.Q.s by combining the data from several tests, thus using many more data points than when each test is normed by itself and therefore reducing the need for interpolation and extrapolation. Also, the relation between protonorms and I.Q. can be reassessed and adjusted from time to time without needing to update all of the existing norm tables of the individual tests; Only one table needs to be changed instead of many.

Preliminary norms

These are usually established in the same way as regular norms, but before there is enough data for a full statistical report; so, in principle based on less than 30 test submissions. In some cases they are based on 30 or more submissions but less than 70 actually used score pairs. The criteria to outgrow the preliminary stage are: At least 30 submissions AND at least 70 score pairs used in norming with for each test in its own right a clear positive correlation with the object test (not just an aggregate correlation over all of the used tests combined).

Rank equation

The principle norming method used is rank equation. A selected group of candidates provides raw scores on the test to be normed, paired to I.Q.s or protonorms on one or more other tests. Both the raw scores and the norms are ranked from highest to lowest. Each raw score gets a rank and each norm gets a rank, and each raw score receives the norm of the corresponding (that is, same) rank. For instance, the highest score has rank 1, the next has rank 2. Do note that missing scores have rank too; The possible missing scores between the highest and the next have rank 1.5. Missing scores are normed just like existing scores. Scores with tied ranks (two or more persons with that score) get the rank number that is the median of the tied ranks.

An advantage of rank equation over z-score (mean and S.D.) equation is that it produces correct norms in ALL cases, while z-score equation only produces correct norms when both variables (raw scores and norms) are linear and have a close to normal distribution. In reality, raw scores are almost never linear, so that rank equation must be the principle method.

A special case is the top score on a low-ceiling test, which will receive a far too generous norm with this method when the tests against which it is ranked have higher ceilings; In that case, the top score must be normed at the lowest of the ranks corresponding to it, instead of at the median thereof. In extreme cases this ceiling effect expands over the top several scores.

Proportion outscored

This is a number between 0 and 1 that is computed for each particular protonorm or I.Q. as follows: Take the number of achieved scores lower than that norm; Add to that HALF the number of scores with exactly that norm; Divide by the total number of scores.

Proportion outscored is computed by combining the scores of several selected tests, to obtain representative proportions. Proportions outscored of one particular test are not representative, as different tests draw different groups of candidates; The experience is that the harder a test is, the fewer candidates take it, the higher their average intelligence, and the smaller their internal spread.

Proportion outscored is a generalized form of quantile. It can be converted to any quantile; To obtain milliles, round to the nearest thousandth; for centiles, to the nearest hundredth; for vigintiles, to the nearest twentieth; for deciles, the nearest tenth; for quintiles, the nearest fifth; for quartiles, the nearest fourth; for the median, the nearest half.

I.Q.

In the context of "I.Q. Tests for the High Range", "I.Q." is an abbreviation of "Intelligence Quantifier", and meant to approximate where a particular score belongs on the scale of adult intelligence. The word "Quantifier" is used instead of the common "Quotient" because I.Q., as currently computed, is in no way a "quotient", that is, an answer to the question "how often" (does the one fit into the other). The word "Quantifier" therefore fits the meaning better: a number quantifying an amount, either continuous or discrete.

This I.Q. is derived directly from the proportion of high-range candidates outscored, using a table such that the resulting I.Q. is comparable to an adult deviation I.Q. when the general population standard deviation is set at 15. This table is based on past experience from the period when high-range tests were anchor-normed to other tests via reported prior scores, and has replaced that method. The table may be adjusted when future studies show this to be necessary. It may be noted that this way of deriving I.Q. does not assume a normal distribution.

It is striven for to make I.Q. approximately intervallic (linear), and objective, independent of age, sex, or population. I.Q. is not a true ratio scale in that it does not have an absolute and meaningful zero. It is attempted however to ascertain that a given I.Q. corresponds to the same intelligence level across candidates and across time (years, decades, centuries, eras).

Inherent in the present method for deriving I.Q. is that higher I.Q.s are rarer than lower ones (upward of the high-range mode in the low 130s). Some worry that this makes it impossible for a suspected "bump" in "gifted" range to show up. Such a bump is present in traditional childhood scores. The reply to these worries must be that the notion of a "bump" in the I.Q. distribution is meaningless as long as we do not have a true physical absolute scale for I.Q. (as we have for distance, mass, etcetera), and therefore can not know if such a bump exists, or how the distribution is altogether. We only know the ranking of scores, and do our best to construct an intervallic scale underneath it. The bump in childhood scores is most likely a result from the fact that the method for computing those scores - dividing the mental by the biological age - was an inferior method. An indication that adult deviation I.Q., at least in the below-average to average range, is indeed intervallic, is the virtually linear relation that exists between the physical measure of brain volume and I.Q., when both are averaged across populations.

It is important not to confuse I.Q. with childhood scores (either mental/biological age ratio scores or standard scores by age group), age-corrected or age-based scores for adults, estimated or quoted "I.Q."s of famous people or self-assumed "I.Q."s of megalomaniacs, each of which tend to be much higher than real I.Q.s. While it is true that "I.Q." started out as the ratio of a child's mental and biological age, this concept is meaningless and impossible for adults, and has even for children been abandoned decades ago.

Standard score

The term "I.Q." is reserved for tests that aim to give a good indication of general intelligence, typically containing a variety of item types. For one-sided tests, on which one's score may be relatively far below or above one's general intelligence level if one has an uneven aptitude profile, the use of "I.Q." is avoided and a "Standard score" on the same scale (S.D. = 15) is given.

Number of candidates

This is the number of candidates that have taken the test in question; so, the number of scores that has been achieved (Retests are not allowed so this is also the number of first attempts). It is the true number; no selection is made. In interpreting this number one must realize that quality is more important than quantity in statistics. More is not better. A sample of 30 who have taken the test seriously, done their best, and honestly reported their personal information, is far superior to a sample of 400 who have taken a free online test carelessly, without striving for a good score, and without reporting true information. Not to mention the inflated samples one obtains by mixing first attempts with retests. It is a popular fallacy about statistics that a large sample is better than a small sample.

Median

This is the middle score, when the scores are ranked from highest to lowest. When the number of scores is even it is the mean of the middle two scores (But when the middle two scores are some distance apart, really all of the missing possible scores therebetween are median scores). It divides the test population in two equal parts. It is the simplest form of quantile; Other quantiles are those that divide the population in three, four, five, etcetera parts. Best known are centiles. Note that quantiles are the points that divide the population, and not the intervals or slots BETWEEN those points, as is often thought. Also, the number of quantiles, or which are the highest and lowest, is open to some debate. The highest centile is mostly thought to be 99; However, the 100th centile really belongs to any score that is higher the highest score in the norm group, and the 0th to any score that is lower than the lowest.

Weighted median

This is the middle value of a set of numbers that each have weights attached to them. It is determined by expanding the set of numbers according to their weights. For instance, if the numbers are 5 (weight = 3), 4 (weight = 1), and 7 (weight = 9), the expanded set looks like:

4 5 5 5 7 7 7 7 7 7 7 7 7

The middle value is a 7, so the weighted median of this set is 7.

Quartile deviation

This is half the difference between the 3rd and 1st quartile (the two points that enclose the middle 50% of the population). It is a measure of spread, like the standard deviation, but unlike the latter it is also meaningful in non-linear situations and non-normal distributions. In a hypothetical normal distribution, the quartile deviation is two thirds the size of the standard deviation.

Range

The difference between the highest and the lowest score in the sample.

Hardness

This is the proportion of the possible raw score range that is on average missed; In principle it is computed thus:

(maximum possible score - median score)/maximum possible score

As raw scores are almost always non-linear, the median is used instead of the actual average.

Correlations

A correlation is a number between -1 and 1 that denotes the extent to which two variables tend to go together; the extent to which the one tends to go up when the other goes up.

The basic type of correlation is called "Pearson r", and is computed between the actual values of two paired variables. Theoretically this type of correlation requires the variables to be expressed on linear scales, whereon one would use the means and standard deviations as measures of central tendency and spread.

A less used type of correlation is the rank correlation, which is computed between the rank vectors of two paired variables, rather than between the actual values. Each actual value is then replaced by the rank it has in its data set, so that for instance the highest value receives rank 1, the next rank 2, and so on. Rank correlations are appropriate for non-linear scales, whereon one would use the medians and quartile deviations as measures of central tendency and spread. Rank correlations tend to be lower than Pearson r's, as the information contained in the distances between the raw values, and therefore also part of the variance, is thrown away.

Correlations can also be computed when one or both of the variables are dichotomies, that is, can only take two values (typically represented by 0 and 1). This type of correlation too tends to be lower than Pearson r's, as the variance is severely limited in a dichotomy.

In principle, correlations in the reports on intelligence tests are given with all of the tests wherewith there are at least five score pairs. If this results in too few usable pairs for norming, the threshold is lowered to four, three or two pairs, until there are enough pairs. "Enough" means at least 70 with a positive correlation.

The correlations are given per test (and not aggregate), as that is the only objective way to obtain correlations. Aggregate correlations (over a number of tests combined) require interpretation of test scores onto a common scale, and are therefore not objective. Aggregate correlations with intelligence tests as seen in statistical reports by some authors should not be compared with the weighted means of objective correlations as explained here, as they (the aggregate correlations) may be inflated or deflated by subjective interpretation.

Interpretation of correlations

A rule of thumb:

Significance of correlations depends on two factors:

  1. Height of correlation - Higher correlations, either positive or negative, are more significant;
  2. Number of pairs - Greater numbers of pairs give greater significance.

Below a table that for each number of pairs shows the minimal Pearson r correlation required for significance at the .05 level; that is, the level where the probability of that or a higher correlation occurring by chance if the true correlation were zero would be 5%. This computation of significance rests on the assumption that correlation values resulting from mere chance ("error") have a "normal distribution". This is the common way of reporting significance in statistics, but it is (to Paul Cooijmans' opinion) of limited practical value and meaning.

For rank correlations, significance should properly be assessed in a different way, to wit by probability calculation. One thereto computes the correlations of all possible pairings between the data sets, and counts how many thereof equal or exceed the value one wants to know the significance of. Then one divides that number by the total number of possible pairings to obtain the significance. This is so labour-intensive that is it only doable for very low numbers of pairs (otherwise the number of pairings becomes astronomical). Is more precise than the the common way of reporting significance though, as probability calculation is "hard", while the assumption of a normal distribution of error is "soft".

# pairsMinimal cor required for significance at .05
5.98
6.88
7.80
8.74
9.69
10.65
11.62
12.59
13.57
14.54
15.52
16.51
17.49
20.45
25.4
30.36
40.31
50.28
70.24
100.2
200.14
500.09
1000.06
10 000.02
100 000.01

On g factor loadings

Reports for particular tests give conservatively estimated minimum g factor loadings. These are obtained by taking the square root of the weighted average of that test's correlations with other tests. Actual g factor loadings, insofar available, are in the report called "Correlations between Cooijmans tests (g factor loadings)". These are usually higher than the estimates in the other reports, but it takes many years to collect enough data for factor analysis, so that this is only possible for a limited number of tests that have been in use for a long time.

Actual g factor loadings tend to be higher because they are based on a set of tests among which all of the intercorrelations are known; That is, for which the correlation between every combination of two tests is known; That is, for which there is a group of candidates of whom each has taken each of the tests. Such a situation is rare and occurs only after many years between higher-quality tests and higher-quality candidates, resulting in relatively high correlations and high g factor loadings. The estimated loadings however are based on all of the known correlations of other tests with the object test, no selection being made for quality or height of correlation, so that includes correlations with lower-quality tests with which a good test "should" correlate lowly or negative, resulting in lower g loadings.

Correlation with national I.Q.s

This is the correlation of raw scores with the national average I.Q. of the candidate's country of origin as published by Lynn and Vanhanen in "I.Q. and the Wealth of Nations".

Robustness

This is the reversed correlation between the chronological ranks and the scores of the test submissions, raised by 1. Higher numbers mean greater robustness, and a value of 1 means the scores are perfectly stable over time. Lower values mean the scores are rising over time. This statistic is experimental, and its interpretation will have to be learnt by comparing it across statistical reports of different tests.

Reliability

Reliability should be understood as the correlation between two hypothetical very similar versions of the test. Some (less accurately) call this "test-retest correlation". A good reliability for an intelligence test is .9 or higher. Reliability is mostly computed internally, for instance by splitting the test into halves and computing the correlation therebetween, after which a correction is applied (Spearman-Brown formula) to obtain the reliability for the test as a whole. This is called the "split-half method".

Another formula is "Cronbach's alpha", sometimes called "internal consistency method". This uses the covariances between all of the individual test items to estimate the mean of all possible split-half reliabilities, which is effectively the lower limit of the test's actual reliability. A somewhat simplified version of "alpha", for tests with only dichotomous items, is called "Kuder-Richardson 20". Note that Cronbach's alpha itself is applicable to such tests too and gives the same result.

Standard error of measurement

Standard error is the standard deviation of the expected error; That is, the standard deviation of an individual's scores if it were possible to take the test repeatedly without a learning effect between the test administrations. The rule of thumb for interpreting standard error: One's true score on the test in question lies with 95 % probability between plus and minus TWO standard errors from one's actual score.

In interpreting standard error one may also consider that its value really only applies to the middle part of the test's score range, and loses meaning at the edges. In general, standard error is only meaningful where the scale on which it is expressed is linear, which is not everywhere and always the case.

Standard error (SE) is computed by combining a test's reliability (Rel) with its raw score standard deviation (SD):

SE = SD × √(1 - Rel)

Quartile error of measurement

Quartile error is the quartile deviation of the expected error; That is, the quartile deviation of an individual's scores if it were possible to take the test repeatedly without a learning effect between the test administrations. The rule of thumb for interpreting quartile error: One's true score on the test in question lies with 50 % probability between plus and minus one quartile error from one's actual score.

Quartile error is computed similarly to standard error, but with the test's raw score quartile deviation instead of its standard deviation.

Range of error of measurement

Range of error is the full range of the expected error; That is, the full range of an individual's scores if it were possible to take the test indefinitely without a learning effect between the test administrations. The rule of thumb for interpreting the range of error: One's true score on the test in question lies with virtual certainty within a band of one range of error.

Range of error is computed similarly to standard error, but with the test's raw score range instead of its standard deviation.

Resolution

This measure reflects the number of consecutive possible raw scores that fit in a unit of spread:

Resolution = (raw score quartile deviation)/16

For tests that consistently give scores in half points, this value should be doubled. 16 is by and large the greatest quartile deviation found in my tests (In a normal distribution, a quartile deviation of 16 implies a standard deviation of 24). The reason for this division is to arrive at a number between 0 and 1, comparable to the several other measures of test quality. By having them all in the same order of magnitude, they can easily be combined into a general higher-level measure of test quality. Without this need for combining, the undivided value would already be sufficient as a measure of resolution.

Quality of norms

This experimental statistic reflects both the number of score pairs used in norming and their correlations with the object test:

Quality of norms = (weighted sum of correlations of tests used in norming)/70

70 is, by experience, the minimum number of pairs needed for acceptable norms. The reason for this division is the same as that mentioned under Resolution. The interpretation of this statistic will have to be learnt by studying its occurrence in future reports. So far it seems norms based on 60 submissions yield higher values than those based on 30. Also, "Quality of norms" may in the future be used to decide how many of the available correlating tests to take for rank-equation; For instance, a minimum value (yet to be set) on this statistic might be applied to determine that.

Test quality

An experimental measure of total test quality, derived from robustness (rob), validity (val), reliability (rel), hardness (hard), resolution (res) and quality of norms (norm) as follows:

Test quality = √((rob2 + val2 + rel2 + hard2 + res2 + norm2)/6)

As a measure of validity, the estimated or actual g factor loading is used in the case of intelligence tests. When one or more of the statistics is or are missing they may be left out and the divider adjusted appropriately.

Section statistics

When a test comprises sections or subtests that are treated as tests in their own right, these have statistical reports of their own. When the sections of a test are not treated as tests in their own right, the intercorrelations between the sections are reported in the report of the main test.