National IQ papers must be retracted: Why Kevin Bird and Rebecca Sear don’t get it
A recent article by Samorodnitsky and co-authored by two renowned censorship champions, namely, Kevin Bird and Rebecca Sear, requests that all published papers that have ever used Lynn & Vanhanen (2006) national IQs (or subsequent, revised version, up to Becker & Lynn, 2019) to be retracted. They ignore a plethora of evidence suggesting that L&V or B&L IQ data are robust, despite the questionable data quality regarding lower-IQ countries. This is a case of non-classical measurement error. But many other variables commonly used in either economics or psychology often display a similar form of non-classical measurement error, and sometimes display quite dramatic biases in one or both tails of the distribution due to misreport bias. The right question is to ask how the biases can be corrected, not whether the research and their authors should be cancelled. Econometricians proposed a wide variety of techniques to deal with non-classical measurement error. In fact national IQ researchers already employed some robustness analyses. This article will dispel the logical fallacies used to negate IQ research.
SECTIONS
1. Non-random error and systematic bias: Nothing new.
2. “Poor” quality data is the rule, not the exception.
3. Robustness check has been used in National IQ studies.
4. Do national assessments reflect cognitive ability?
Recently, a paper by Eppig et al. (2010) has been retracted on the basis that L&V NIQ data “contain substantial inaccuracies and biases that throw substantial doubt on inferences made from them, and that these problems had not been resolved”. Given that Eppig et al. in their paper used African IQs from Wicherts et al. (2010a) as a robustness check to L&V, it is very likely that the true, unofficial reason lies elsewhere, as the editors pointed out later about “the potential harms created by using a dataset that appears to portray human populations in some geographical regions as of below normal intelligence on average”. This is concerning because this paper argued that a potential factor explaining NIQ differences is the prevalence of infectious disease. Thus, they are supportive of an environmental explanation, rather than a genetic one. And yet the paper was retracted. Given this, I expected more sinister plans would come. And they did come. But faster than I expected.
1. Non-random error and systematic bias: Nothing new.
Samorodnitsky et al. argued that Lynn and Vanhanen national IQ data (L&V NIQ) is composed of unrepresentative samples, and is riddled with error, especially for African countries. They cite Wicherts et al. (2010a) who showed that: 1) L&V inclusion criteria are inconsistent, 2) the lower the African IQs and the higher the likelihood that L&V would exclude the data but given 1) it is hard to prove anything, 3) a large portion of the data is of extremely poor quality. While L&V estimated the sub-Saharan African IQ to be 70, Wicherts et al. conducted a systematic review and meta-analysis, concluding that the African IQ should be 82 based on various IQ tests (excluding Raven) or 80 based on Raven’s matrices. Wicherts et al. (2010b) re-estimate African IQs based on stratified sampling method and based on what Lynn & Meisenberg (2010) deemed representative, reducing samples even more. Wicherts et al. reported IQs of >85 for adults and 76/78 for primary/secondary school children.
Samorodnitsky et al. also cited Sear (2022) who critiqued L&V data using arguments similar to Wicherts, except her article has no added value beyond what Wicherts (2007) already brought to the table. Even worse, Sear argued that using children’s IQ artificially depressed the true score because psychometric test scores are affected by age. Yet it is a basic knowledge that IQ scaled score formulas are derived from age-specific norms, as was done and explained by Becker & Lynn (2019, pp. 25-26). The rest of Sear’s (2022) article revolves around the argument that the same ability is not measured across countries.
The problem highlighted by these authors is similar to what economists call non-classical measurement error. Under the assumption of classical measurement error, a simple Instrumental Variables (IV) model can fix this problem (Hausman, 2001). When it’s not classical, the variable’s error is correlated with the true value of that variable or other variables in the model, or with the errors in measuring those values (Bound et al., 2001; Hyslop & Imbens, 2001), i.e., the errors are no longer random. In this case, error correction becomes challenging but can still be applied using IV regression conditioning on and averaging over a control variable (Hahn & Ridder, 2017). In fact, various methods have been designed to satisfy a variety of variable types, model assumptions, and error structure (Black et al., 2000; Hu & Wansbeek, 2017; Lenis et al., 2017; Murad et al., 2016; Schennach, 2016). If non-classical errors display systematic biases, an effective strategy is the use of validation data (Schennach, 2016) conditioning on the assumption that the validation source is measured with error (Bingley & Martinello, 2017; Kane et al., 1999; Meyer & Mittag, 2021) – an assumption often ignored in past research. Fortunately, there are methods that have been developed to circumvent the need of validation data (Schennach, 2016).
It has been recognized in economics research that the assumption of classical error is often untenable, especially in self-report survey data (Bound et al., 2001). The problem is compounded when non-response biases strongly affect both tails of the distribution (Bollinger et al., 2019). Indeed, worrying methodological issues and poor quality data often plagued economics research. Yet economists continually proposed novel methods for correcting non-random measurement errors through complex models or better methods of data collection for reducing errors (Celhay et al., 2024). But this is not what Samorodnitsky et al. proposed, as their goal is not the progress of scientific research since their question has never been “how do we deal with poor quality data?”.
2. “Poor” quality data is the rule, not the exception.
Why non-random errors occur? There are a variety of reasons (Angel et al., 2019; Bound et al., 2001) but an important one is social desirability. Indeed, there is some indication that the degree of misreporting is sometimes correlated with the level of social desirability (i.e., the need to respond in a culturally appropriate manner) in the underlying variable. This is especially true for highly sensitive questions such as income, aid, weight, etc. This problem is not going to evaporate any time soon (Gnambs & Kaspar, 2017; Meyer et al., 2015). Below I present some interesting cases displaying non-random errors.
1. Self-reported income. Countless papers already mentioned that misreport errors are rather concerning. Moore et al. (2000) argued that income misreport is quite common and often due to cognitive load and systematic bias. Furthermore, people tend to under-report their income, especially the non-salary component of income (e.g., interest and dividend income) which displays large under-report biases. There are also instances of measurement errors displaying mean reverting bias, i.e., persons with low earnings overstate their earnings and persons with high earnings understate their earnings, the consequence of which is lower variance (Bollinger et al., 2019; Bound et al., 2001; Gottschalk & Huynh, 2010; Kim & Tamborini, 2014, Table 3; but see, Bingley & Martinello, 2017). Misreport also differs between racial groups. Highly educated workers more correctly report their earnings compared to less educated workers and that Blacks with high earnings under-report to a greater degree than comparable Whites, while Blacks with low earnings over-report to a greater degree (Kim & Tamborini, 2014). That education correlates with accuracy of earnings reports has been confirmed prior by Bound et al. (1994). And not just social desirability, but also sociodemographic variables (such as age, education, degree of urbanization, country of birth) have a non-negligible impact on misreporting rates (Angel et al., 2019, Table 8).
2. Unemployment duration. Abrevaya & Hausman (1999) reported that in the CPS there were about 37% of unemployed workers who overstated unemployment durations, with longer spells of unemployment (i.e., uninterrupted period of time) having a higher proportion of reporting errors, often due to “focal responses” (e.g., at a number of weeks that corresponds to an integer month amount, like four or eight weeks). Bound et al. (2001, p. 3801) reached the same conclusion. Their solution is to employ the Monotone Rank Estimator (MRE) which handles measurement error in the dependent variable that is not independent of the covariates.
3. Work accident under-reporting. Probst & Estrada (2010) found that under-reporting is more prevalent in working environments with a poorer (perceived) organizational safety climate and the degree to which supervisors enforce safety policies. Thus, the likelihood of under-report increases when working conditions are less safe.
4. Education and any other bounded variables. Kane et al. (1999) stated: “However, there is little reason to believe that the measurement error in self-reported education is classical. In fact, since the most widely used measures of educational attainment are categorical in nature (usually measured in discrete years or degrees), the measurement error generally will not satisfy the classical assumptions (Aigner, 1973). For example, one would expect the measurement error in any categorical measure to vary with the level of education reported, since individuals in the lowest education category can never under-report their education and those in the top education category cannot over-report. Unfortunately, without the classical assumptions traditional IV estimates of the return to education are no longer consistent.” This means that, for any bounded variable (e.g., categorical or binary), the error must be negatively correlated with the true value. Kane found evidence of non-classical error both in survey and administrative data. Bingley & Martinello (2017) confirmed Kane’s suspicions and found that the measurement error in education is negatively correlated with its true value, given the model assumption that both survey and administrative data are affected by non-classical error.
5. Self-reported Height and weight and BMI. A sheer number of studies (Brener et al., 2003; Gorber et al., 2007; Gunnell et al., 2000; Haroun & Ehsanallah, 2024; Lin et al., 2012; Roth et al., 2013) discovered that errors in self-reported height and weight were not randomly distributed. Shorter individuals tend to over-report their height, overweight (underweight) individuals tend to under-report (over-report) their weight. There was also a gender difference in bias, where men generally over-report their height whereas women generally under-report their weight. The impact of such misreporting on the prevalence of obesity is large (Bauhoff, 2011). Because self-reported height and weight display non-random errors, the BMI is positively correlated with the absolute error of BMI (Davillas & Jones, 2021). Despite all these problems, a meta-analysis found that these self-report measures and actually measured variables were highly correlated (Rios-Leyvraz et al., 2023).
6. Self-reported SAT/ACT and grades. Cole & Gonyea (2010) found that self-reported scores were strongly affected by desirability bias, and students largely over-estimated their scores. Unsurprisingly, reported scores from lower achieving students were much less accurate than higher achieving students. Rosen et al. (2017) found that higher GPA predicts a higher probability of accurately reporting the Algebra I grade. This confirms an earlier meta-analysis by Kuncel et al. (2005) who found that lower school performance was associated with considerably lower accuracy for self-reported grades. Actual scores and self-reported scores were strongly correlated. The fact that self-reported college GPA exhibited large over-report bias (d=1.38) yet still correlated strongly with actual college GPA (r=0.90) is another illustration that means and correlations are not necessarily interrelated; see also Portela et al. (2010, Table 4).
7. Covariates (education and IQ). This last point, which relates to the previous one, is the most important because it affects so many variables. People with higher cognitive ability (who also achieve higher education) understand the survey questions better and thus provide more accurate responses. That is, data quality decreases with lower IQ. There are cases worth mentioning. Choi & Cawley (2018) found that college graduates show better accuracy in self-reported health compared to lower educated people. Jerrim & Micklewright (2014, Figure 1) and Kreuter et al. (2010, Tables 4-5) found that students who agree with their parents on their parents’ education level have slightly higher achievement scores. Salvucci et al. (1997, pp. 160-162) reported 3 important findings: 1) that students with higher scores on the achievement tests in the HS&B data had a higher degree of concordance in their responses about family background, 2) that Whites have slightly higher degree of concordance than Hispanics who also have slightly higher degree of concordance than Blacks, 3) that students over-reported their parents’ high school education and under-reported their parents’ postsecondary education. Thus, whenever the study involves regression with IQ/education variables, the assumption of random error may often be untenable.
Two patterns are most noticeable: 1) social desirability is pervasive, 2) the lower the cognitive ability or education, the lower the accuracy in reports. This is relevant to L&V NIQ data, because lower NIQs seem to correlate with poorer data quality. Whether the errors are random is open to discussion and the debate is not settled yet. Wicherts et al. (2010a) observed that the lower the African IQ and the lower the probability it was included in Lynn’s dataset. It is difficult to prove this was intentional because Lynn’s criteria were not consistent at all. Whatever the case, researchers have already used a variety of methods to deal with this issue.
3. Robustness check has been used in National IQ studies.
Since Wicherts’ study has been so highly regarded by Samorodnitsky et al., why are they not citing all these studies that used robustness checks? Perhaps because the available studies showed the various methods used did not impact their main results at all.
1. A group of researchers employed winsorization. Jones & Nye (2011) use NIQ (L&M 2010), GDP per capita and corruption to predict unpaid parking tickets, and the results are robust after winsorizing to a minimum of 75 and then 80. Hafer & Jones (2015, Table 3) use NIQ (L&M 2010) to predict global entrepreneurship, along with Gini, GDP per capita, % manufacture, economic freedom as other predictors, and they find that the results are unchanged after winsorizing the lowest IQ values to 80. Jones & Potrafke (2014) analyzed the relationship between NIQ and institutional quality, using all available versions of L&V NIQ data (2002, 2006, 2010 and 2012) and then winsorizing all NIQs to 76 and then 80. The results were robust to data version and winsorization. Potrafke (2012) found that NIQ (L&V 2006 and L&M 2010) was negatively related with corruption after various controls, and winsorizing NIQs to 76 or 80 actually increased the influence of IQ. Potrafke (2019) examined the relationship between NIQ and patience and found that winsorization at either 76 or 80 had a small effect on the relationship. Hafer (2016, Table 3) found a relationship between L&V 2012 NIQ and various measures of saving (liquid liabilities, private credit, bank assets) and the regression coefficients were not affected by the upward correction applied to African IQs (Hafer did not say which method he used but I assume it has to be winsorization similar to what he did in prior studies). Hafer (2017a, Tables 3-4) used L&V 2012 NIQ to predict GDP growth and welfare growth, controlling for a variety of demographic variables, and found that the relationship is robust even after adjusting African IQs to a value of 80. Hafer (2017b) employs regression with NIQ (L&V 2012), real GDP per capita and some institutional variables to predict female entrepreneurial activity. After raising the minimum IQs to 80, the standardized coefficient of IQ jumped from 0.574 to 0.674. Carl (2014, Tables 3-4) found that the correlations between GDP per capita and NIQ and between generalized trust and NIQ were strong and similar whether NIQ is measured using Rindermann et al. (2019) or L&V 2012 with winsorization to IQ 80. Jones & Podemska-Mikluch (2010) found that NIQ is correlated with US Treasury Holdings (a proxy for international capital flows) and the result is robust to winsorization below 80 or up to 80. Minkov et al. (2016, Table 5) use pathogen prevalence, GDP, Gini, life history strategy, cool water condition and education as predictors of NIQ. Pathogens, GDP and Gini were not significant and, thus, are dropped. Both life history and cool water were strong predictors of NIQ. The results were robust even after assigning an IQ of 82 to all countries whose IQs, according to L&V, were below 82, and using math achievement from TIMSS instead of L&V 2012. Loy (2018) employed an instrumental variables regression to predict earnings management, with secrecy as instrument and NIQ (L&M 2010), log GDP, investor protection and accounting enforcement as predictors, and found that the relationship is unchanged after setting the minimum IQ values to 80.
2. Some researchers used Wicherts’ African IQs as robustness checks to L&V. Hassall & Sherratt (2011) analyze the relationship between NIQ and a variety of outcomes (infectious disease, nutrition, education, GDP, distance from Africa, temperature). These correlations were strong and robust regardless of whether they use the L&V 2006 data with or without Wicherts African IQs. Madsen (2016, Table 4) analyzes the impact of PID-IID (parasite-infectious disease + iron-iodine deficiency) on either NIQ (with Wicherts’ African IQs) or PISA. Covariates include education and pupil-teacher ratio and the instrumental variable uses either foreign health aid or pathogen prevalence or density of pathogens. The regression coefficients were somewhat different depending on whether NIQ or PISA was used as dependent variable but the overall conclusion did not change. Figueredo et al. (2021) evaluate how much zoogeographic regions account for the variation in NIQ: they reported proportion of variance of 76.6% and 63.9% for B&L 2019-NIQ and Wicherts correction to African IQs, respectively. This difference is an overestimation due to R² being a biased measure. Omanbayev et al. (2018, Table 6) found that NIQ predicted lower air pollution, after controlling for GDP, democracy, trade and population size, and IQ coefficient was almost identical whether Wicherts’ African IQs were used or not. Daniele & Ostuni (2013, Table 7) use infectious disease to predict NIQ, along with education, GNI per capita and temperature as predictors, and found that the regression coefficients of the predictors were generally similar regardless of whether Wicherts’ African IQs were used or not. Dama (2013, Tables 3-4) found that NIQ (L&V 2006) predicted the variation in birth sex ratio after controlling for demographics, and the standardized coefficients were somewhat attenuated (by .100) but still strong after using Wicherts’ African IQs. Voracek (2013) examined the relationship between Failed State Index (a measure of state vulnerability) and NIQ (L&V 2012) and noted that the strong negative correlation was unaffected when applying Wicherts’ African IQs. Gallup (2023) uses child survival rate, log GDP per capita and education expenditure to predict NIQ (L&M 2010 with Wicherts’ African IQs) and TIMSS math/science/reading. The overall conclusion is unchanged, and child survival is the only predictor that is highly significant. Salahodjaev (2015, Table 2) conducts a regression with NIQ (L&V 2012), money supply, GDP growth, Central Bank independence, deficit, trade, democracy, socialism, to predict inflation, and finds that the negative coefficient of NIQ is robust after applying Wicherts’ African IQs. Webster & Duffy (2016) use NIQ (L&V 2002) and Quality of Human Conditions and their interaction to predict disbelief in God. Results were not affected by the use of Wicherts’ African IQs, which they argue is not surprising given the strong relationship between L&V and Wicherts et al.’s IQs.
3. Other researchers decided to drop the African IQs. Nikolaev & Salahodjaev (2016, Tables 3-4) examine the relationship between NIQ and happiness inequality using L&V 2012 data, and found that the regression coefficient is slightly attenuated after removing the African IQ data. Obydenkova & Salahodjaev (2017, Table 8) analyzed the impact of NIQ and its interaction with government size on life satisfaction, and found that most regression coefficients were very similar after removing the dummy variable for African countries, except for NIQ which increased from 0.0058 (ns) to 0.0213 (sig .05). Jones (2012) analyzed the relationship between NIQ and productivity (TFP) growth, controlling for education, life expectancy, years open to trade, and ethnolinguistic fractionalization, and tropics. The relationship was strong, even after the inclusion of an African country dummy variable.
These papers confirmed that NIQs have external validity. If estimates and rank ordering of NIQ are so unreliable, they would not correlate with anything. No one claimed that correlates at the group level will always be replicated at the individual level, since within-group factors could be subsets of the between-group factors. In the case of NIQ research, these correlates often hold within countries and make theoretical sense. There is an important lesson to be learned given the large discrepancy in African IQs between Wicherts and L&V but extremely high correlation between Wicherts’ and L&V NIQ datasets. Reports on regression analyses should focus on the strength of the relationship rather than on the predicted means due to the possible intercept bias at the lower end.
Since the debate is about the likelihood of the low IQ of African countries, Sear (2022) doubts this value is even theoretically plausible because it would mean that African people would “be on the verge of intellectual impairment”. But Sear doesn’t understand the difference between familial and organic mental retardation. Long ago, Arthur Jensen stated that “the familial retarded are biologically normal individuals who deviate statistically from the population mean because of the same factors that cause IQ variation among all other biologically normal individuals in the population”. But deviations caused by organic retardation (which involves impaired mental functioning) are not due to the same factors. This is crucial because Jensen observed that the White/Black ratio in terms of organic retardation is 4/1. This matches what Jensen typically observed during his teaching days: the White children with IQ 70 were often incapable of doing the most basic thing, while the Black children with IQ 70 behave just normally.
Samorodnitsky et al. later objected that convenience samples cannot be averaged to produce national IQs. There are four counter-arguments. First, meta-analytic estimates cancel out errors assuming no systematic bias. Even higher data quality won’t prevent systematic bias if the average is mainly composed of elite samples. Second, using both L&V and Wicherts’ IQ data alleviates the problem since the criticism (especially Sear 2022) strongly focuses on African IQ. Third, weighting the NIQs by their quality or representativity is one way to reduce the influence of poor quality data on the final regression estimate. Fourth, if national assessment data are more representative they would serve as effective robustness checks, assuming they are viable proxies for IQs.
4. Do national assessments reflect cognitive ability?
Samorodnitsky et al. point out that 1) IQ tests are not culture free, 2) and IQ scores are dependent on formal education. Answering these points is crucial for what will follow. 1) As te Nijenhuis & van der Flier (2003) expressed clearly, cultural loading is unavoidable and even desirable as long as future school and work achievement may have a high cultural loading. Removing such items may adversely affect the predictive validity of the test. Furthermore, culture load does not equate culture bias. There is a vast literature on measurement invariance concluding that IQ tests are not biased against minorities. 2) There are reasons to doubt the causal impact of education on IQ, and not just because of upward biases of schooling effect owing to inappropriate statistical procedures (Bingley & Martinello, 2017; Eriksson et al., 2024; Kane, 2017; Armor et al., 2018; Marks, 2015, 2024; Marks & O’Connell, 2021b). First, educational programs aimed to boost the IQ of low-income Blacks did not have any lasting effect. Second, schooling increases observed IQ, but not latent g. Third, a plethora of research showed that increases in parental education after childbirth showed almost no relationship with the child cognitive test scores (Harding, 2015) even after controlling for parental cognitive test scores (Augustine & Negraia, 2018; Awada & Shelleby, 2021; Klein & Kühhirt, 2023; Marks & O’Connell, 2021a).
One way NIQ can be validated is by comparing it with national assessments. Samorodnitsky et al., as well as Rutkowski et al. (2024), argued that NIQ cannot be compared or adjusted to national achievement tests because the latter measures learning, not intelligence. Jensen (1998) reviewed several studies showing a strong relationship between the complexity of Elementary Cognitive Tasks (ECTs) – which typically involve no past-learned information content – and the g factor of tests that have a strong knowledge component such as the ASVAB. Such a high correlation wouldn’t make sense if achievement test requires learning rather than intelligence. Indeed g is also dominant in achievement tests. Pokropek et al. (2022a) factor analyzed the items of the Polish national PISA subscales along with the Raven’s matrices. A bifactor IRT model (versus other g and non-g models) fits the data best, the g factor explains 70% of the common variance, similar to standard IQ tests. The specific factors orthogonal to g, namely, Raven, Math, Reading, Science, had no predictive validity. Pokropek et al. (2022b) used the same procedure and factor analyzed the items of the 2018 PISA across 33 participating OECD countries. The bifactor g is more interpretable than the bifactor “general” reading model, the g factor explains about >80% of the common variance for all countries. Interestingly, both studies revealed that none of the PISA subscales were reliable factors, based on Omega hierarchical. Thus both studies confirm Ree & Carretta’s (2022) conclusion about the ASVAB and AFOQT that specific abilities have no predictive power over g. Perhaps even more important, this g-achievement seems to be reflecting g-IQ, given that Zaboski II et al. (2018) found that each area of academic achievement was explained by g to a much greater extent than any group factors of intelligence. A remaining question is whether the g factor is also found in non-industrialized countries. Warne & Burningham (2019) explored whether the cognitive g factor was dominant as well in non-Western cultures and in non-industrialized nations, including a non-trivial number of African countries. They found that 94 of the 97 cognitive test data sets from 31 countries exhibited the g factor.
In a very important article, Warne (2023) reported IQ-scaled international achievement scores from data collected by Angrist et al. (2021) and Gust et al. (2022). These tests include PIRLS, PISA, TIMSS, and regional exams, like SACMEQ, LLECE, and PASEC. Based on this IQ metric, Warne observes that these achievement scores are highly congruent with L&V (despite large intranational gaps for a few countries). The high degree of concordance also applies to African countries, with some of the IQ-scaled scores being even lower than L&V own IQ scores. This supports the idea that L&V extremely low IQs may not be implausible. Warne (2023) provides another, indirect evidence for the robustness of L&V dataset. In the data for the 2019-TIMSS test, students who meet a “low international benchmark” can add, subtract, multiply, and divide one-digit and two-digit whole numbers. Nearly all fourth-graders in developed nations can do this, compared to a much smaller percentage in African countries. The low performance of African countries was confirmed using PASEC and SACMEQ exams. The large African/non-African difference in percentage that meets the benchmark is roughly congruent with L&V or B&L NIQs suggesting large or very large African/non-African cognitive gaps.
Warne (2023) enumerates some problems with L&V NIQs. First, the geographically imputed IQs show poor concordance with data drawn from a country. Second, Raven matrices are over-represented in L&V African IQ data, yet it seems Africans use visualization but not abstract reasoning ability when solving Raven’s test (Becker et al., 2020). Third, weighting the meta-analytic estimate of a nation’s IQ by data quality can only reduce but not eliminate the problem caused by loss of accuracy due to having a single low-quality sample because the number of samples is often small. Fourth, there were some very large intra-national discrepancies between NIQ and achievement scores for a few countries.
The fact that Samorodnitsky et al. wondered how Warne “concluded the dataset should continue to be used is something of a mystery to us” shows that their reading of Warne’s article was superficial or that their depiction of Warne’s article was intentionally misleading. The intention of the article was clear: the term “racist” was used 11 times, “eugenics” 3 times, “nazi” 3 times, “supremacist” 8 times, and even Goebbels and Hitler were mentioned. A very disappointing piece. But fortunately, as of late, Rutkowski provided perhaps the most complete criticism of NIQ and its proxy, the national assessment.
Rutkowski et al. (2024) argued that the PISA subscales were never designed to be averaged and that a latent factor derived from PISA tests would make no sense even if it exists. This is akin to a design fallacy. Dalliard recounted how in the earlier days some psychologists were eager to demonstrate that the cognitive g-factor was an artefact, arguing it would be easy to make it vanish if the test battery was designed to suppress g, but only to find out later that the existence of g was pervasive. A good example is the Woodcock-Johnson, now widely recognized as an excellent measure of g, despite the fact it was originally designed around the idea that g was a statistical artefact. Thus, the argument that the designers of these international assessments did not intend to measure g makes no sense. An even stronger case is provided by Reeve & Basalik (2014, Table 4) who reported that the shared variance of health literacy tests and cognitive tests greatly overlapped. Literacy tests certainly were not designed as an IQ test, yet they reflect g to such a great extent. This is crucial here because the PISA is usually understood as a reading literacy test as well.
If test scores are not invariant, group differences are not entirely due to cognitive ability, but partly due to nuisance ability not intended to be measured. Rutkowski et al. (2024) claimed that measurement invariance studies are lacking, except for von Davier et al. (2023) who studied the PIRLS 2021. But Rutkowski missed Odell et al. (2021, Table 5) who used a novel technique called alignment method that is appropriate for assessing measurement invariance across many groups (i.e., >2) and found that invariance holds for the PISA scores on mathematics and science scales across the 47 countries which participated in the 2015 PISA.
Motivation is one of these nuisance abilities that threatens the validity of IQ or achievement tests, especially if there are group differences in motivation. Rutkowski et al. (2024) argued that test scores cannot be compared across nations because their observed scores vary depending on test motivation or effort. The evidence comes from the OECD 2019 questionnaires asking whether the students expended less effort on the PISA test than if the test counted towards their marks: This percentage varied across countries. Furthermore, they observe that the median time response varies across performance levels.
Their argument has multiple problems. First, this is a self-report question of “what would be”, not an actual measure of test motivation. This issue is best illustrated by Lee & Jia (2014) who found that the test response time effort (an objective measure) and self-reported motivation are not congruent at all. Second, varying levels of motivation does not necessarily threaten test validity. Gignac et al. (2019) and Gignac & Wong (2020) showed that a valid interpretation of IQ scores does not require maximum effort but only moderate effort, as the link between effort and score holds only for the low-to-moderate levels of motivation. Third, experimental studies showed (at best) only a small effect of motivation on test scores (Bates & Gignac, 2022; Gignac, 2018). Fourth, even if test motivation varies to an extent it causes large differences in observed scores between groups, a solution is to model rapid-guessing behaviour through time response (Wise & Kong, 2005). Methods such as effort-moderated scoring (Wise & Kingsbury, 2016) and motivation filtering (Wise & DeMars, 2010) were found to be successful at mitigating the impact of rapid guesses. Fifth, even if rapid guessing is hardly avoidable, one needs to assess its overall impact. Wise & Kingsbury (2016) found that this behaviour is idiosyncratic and not pervasive across items. More importantly, the impact of rapid guessing was found to be minimal on the NAEP (Lee & Jia, 2014) and PISA (Michaelides et al., 2024). Sixth, DeMars & Wise (2010) discovered that differential propensity for rapid guessing can sometimes lead to item bias (or DIF), which means that DIF adjustment reduces the biasing effect of guessing. Seventh, the mere observation of a positive correlation between median time and performance does not always indicate lower effort. As I noted before, during my investigation of Prolific test takers, poor quality data occurred when the test was not only difficult but also time consuming due to requiring deep thinking, as is often the case in abstract reasoning tests. The relationship between time and score was usually zero or negative in the case of vocabulary tests having a strong knowledge component. This negative correlation could make sense, because people who know the answer would answer faster than those who struggle and don’t know the answer.
To further strengthen their case against NIQ proxies, Rutkowski et al. (2024) argued that item bias methods cannot detect DIF in low-performing countries. This is not true. Methods have been devised to avoid false detection of bias, despite large group differences in latent proficiency level, such as the outlier detection proposed by von Davier & Bezirhan (2023) or weighted item fit statistics by Joo et al. (2024) or the LPE generalization by Huang et al. (2024).
Rutkowski et al. (2024) wrongly believe that if IQ and achievement scores show diverging trends over time (Flynn and anti-Flynn effects), the only possible conclusion is that they don’t measure the same construct. First, they don’t seem to understand the notion of vehicle (Jensen, 1998, ch. 10) because they fail to distinguish latent ability and observed score. Second, the reasons why IQs increased and why achievement tests such as the SAT declined are not due to the same causes (Murray & Herrnstein, 1992). Third, as explained already, means and correlations are not necessarily related (Rodgers, 1999).
Yes, national IQ is fine.
References
Abrevaya, J., & Hausman, J. A. (1999). Semiparametric Estimation with Mismeasured Dependent Variables: An Application to Duration Models for Unemployment Spells. Annales d’Economie et de Statistique, 243–275.
Angel, S., Disslbacher, F., Humer, S., & Schnetzer, M. (2019). What did you really earn last year?: explaining measurement error in survey income data. Journal of the Royal Statistical Society Series A: Statistics in Society, 182(4), 1411–1437.
Armor, D. J., Marks, G. N., & Malatinszky, A. (2018). The impact of school ses on student achievement: Evidence from u.s. statewide achievement data. Educational Evaluation and Policy Analysis, 40(4), 613–630.
Augustine, J. M., & Negraia, D. V. (2018). Can increased educational attainment among lower-educated mothers reduce inequalities in children’s skill development? Demography, 55(1), 59–82.
Awada, S. R., & Shelleby, E. C. (2021). Increases in maternal education and child behavioral and academic outcomes. Journal of Child and Family Studies, 30(7), 1813–1830.
Bates, T. C., & Gignac, G. E. (2022). Effort impacts IQ test scores in a minor way: A multi-study investigation with healthy adult volunteers. Intelligence, 92, 101652.
Bauhoff, S. (2011). Systematic self-report bias in health data: impact on estimating cross-sectional and treatment effects. Health Services and Outcomes Research Methodology, 11, 44–53.
Becker, D., Meisenberg, G., Dutton, E., Bakhiet, S. F., Humad, O. A. M., Abdoulaye, H. A., & Ahmed, S. A. E. S. (2022). Factor structure in Raven’s Progressive Matrices Plus in sub-Saharan Africa–Benin and Djibouti. Journal of Psychology in Africa, 32(2), 103–114.
Bingley, P., & Martinello, A. (2017). Measurement error in income and schooling and the bias of linear estimators. Journal of labor economics, 35(4), 1117-1148.
Black, D., Sanders, S., & Taylor, L. (2003). Measurement of higher education in the census and current population survey. Journal of the American Statistical Association, 98(463), 545–554.
Brener, N. D., McManus, T., Galuska, D. A., Lowry, R., & Wechsler, H. (2003). Reliability and validity of self-reported height and weight among high school students. Journal of adolescent health, 32(4), 281–287.
Bollinger, C. R., Hirsch, B. T., Hokayem, C. M., & Ziliak, J. P. (2019). Trouble in the tails? What we know about earnings nonresponse 30 years after Lillard, Smith, and Welch. Journal of Political Economy, 127(5), 2143–2185.
Bound, J., Brown, C., Duncan, G. J., & Rodgers, W. L. (1994). Evidence on the validity of cross-sectional and longitudinal labor market data. Journal of Labor Economics, 12(3), 345–368.
Bound, J., Brown, C., & Mathiowetz, N. (2001). Measurement error in survey data. In J. J. Heckman & E. Leamer (Eds.). Handbook of econometrics (5, pp. 3705–3843).
Carl, N. (2014). Does intelligence explain the association between generalized trust and economic development?. Intelligence, 47, 83–92.
Celhay, P., Meyer, B. D., & Mittag, N. (2024). What leads to measurement errors? Evidence from reports of program participation in three surveys. Journal of Econometrics, 238(2), 105581.
Choi, A., & Cawley, J. (2018). Health disparities across education: The role of differential reporting error. Health Economics, 27(3), e1–e29.
Cole, J. S., & Gonyea, R. M. (2010). Accuracy of self-reported SAT and ACT test scores: Implications for research. Research in Higher Education, 51, 305–319.
Dama, M. S. (2013). Cognitive ability correlates positively with son birth and predicts cross-cultural variation of the offspring sex ratio. Naturwissenschaften, 100(6), 559–569.
Daniele, V., & Ostuni, N. (2013). The burden of disease and the IQ of nations. Learning and Individual Differences, 28, 109–118.
Davillas, A., & Jones, A. M. (2021). The implications of self-reported body weight and height for measurement error in BMI. Economics Letters, 209, 110101.
DeMars, C. E., & Wise, S. L. (2010). Can differential rapid-guessing behavior lead to differential item functioning?. International Journal of Testing, 10(3), 207–229.
Duncan, G. J., & Hill, D. H. (1985). An investigation of the extent and consequences of measurement error in labor-economic survey data. Journal of Labor Economics, 3(4), 508–532.
Eppig, C., Fincher, C. L., & Thornhill, R. (2010). RETRACTED: Parasite prevalence and the worldwide distribution of cognitive ability. Proceedings of the Royal Society B: Biological Sciences, 277(1701), 3801–3808.
Eriksson, K., Sorjonen, K., Falkstedt, D., Melin, B., & Nilsonne, G. (2024). A formal model accounting for measurement reliability shows attenuated effect of higher education on intelligence in longitudinal data. Royal Society Open Science, 11(5), 230513.
Figueredo, A. J., Hertler, S. C., & Peñaherrera-Aguirre, M. (2021). The biogeography of human diversity in cognitive ability. Evolutionary Psychological Science, 7, 106–123.
Gallup, J. L. (2023). Cognitive and Economic Development.
Gignac, G. E. (2018). A moderate financial incentive can increase effort, but not intelligence test performance in adult volunteers. British Journal of Psychology, 109(3), 500–516.
Gignac, G. E., Bartulovich, A., & Salleo, E. (2019). Maximum effort may not be required for valid intelligence test score interpretations. Intelligence, 75, 73–84.
Gignac, G. E., & Wong, K. K. (2020). A psychometric examination of the anagram persistence task: More than two unsolvable anagrams may not be better. Assessment, 27(6), 1198–1212.
Glewwe, P. (2007). Measurement error bias in estimates of income and income growth among the poor: analytical results and a correction formula. Economic Development and Cultural Change, 56(1), 163–189.
Gnambs, T., & Kaspar, K. (2017). Socially desirable responding in web-based questionnaires: A meta-analytic review of the candor hypothesis. Assessment, 24(6), 746–762.
Gorber, S. C., Tremblay, M., Moher, D., & Gorber, B. (2007). A comparison of direct vs. self‐report measures for assessing height, weight and body mass index: a systematic review. Obesity reviews, 8(4), 307–326.
Gottschalk, P., & Huynh, M. (2010). Are earnings inequality and mobility overstated? The impact of nonclassical measurement error. The Review of Economics and Statistics, 92(2), 302–315.
Gunnell, D., Berney, L., Holland, P., Maynard, M., Blane, D., Frankel, S., & Smith, G. D. (2000). How accurately are height, weight and leg length reported by the elderly, and how closely are they related to measurements recorded in childhood?. International journal of epidemiology, 29(3), 456–464.
Hafer, R. W. (2016). Cross-country evidence on the link between IQ and financial development. Intelligence, 55, 7–13.
Hafer, R. W. (2017a). New estimates on the relationship between IQ, economic growth and welfare. Intelligence, 61, 92–101.
Hafer, R. W. (2017b). Female entrepreneurship and IQ. In: Ahmetoglu, G., Chamorro-Premuzic, T., Klinger, B., Karcisky, T. (Eds.), The Wiley Handbook of Entrepreneurship. John Wiley & Sons Ltd., West Sussex, pp. 187–204.
Hafer, R. W., & Jones, G. (2015). Are entrepreneurship and cognitive skills related? Some international evidence. Small Business Economics, 44, 283–298.
Hahn, J., & Ridder, G. (2017). Instrumental variable estimation of nonlinear models with nonclassical measurement error using control variables. Journal of Econometrics, 200(2), 238–250.
Harding, J. F., Morris, P. A., & Hughes, D. (2015). The relationship between maternal education and children’s academic outcomes: A theoretical framework. Journal of Marriage and Family, 77(1), 60–76.
Haroun, D., & Ehsanallah, A. (2024). Validity of self-reported weight and height among female young adults in the United Arab Emirates. Plos one, 19(4), e0302439.
Hassall, C., & Sherratt, T. N. (2011). Statistical inference and spatial patterns in correlates of IQ. Intelligence, 39(5), 303–310.
Hausman, J. (2001). Mismeasured variables in econometric analysis: problems from the right and problems from the left. Journal of Economic perspectives, 15(4), 57–67.
Hu, Y., & Wansbeek, T. (2017). Measurement error models: editors’ introduction. Journal of Econometrics, 200(2), 151–153.
Huang, Q., Bolt, D. M., & Lyu, W. (2024). Investigating item complexity as a source of cross-national DIF in TIMSS math and science. Large-scale Assessments in Education, 12(1), 12.
Hyslop, D. R., & Imbens, G. W. (2001). Bias from classical and other forms of measurement error. Journal of Business & Economic Statistics, 19(4), 475–481.
Jerrim, J., & Micklewright, J. (2014). Socio-economic gradients in children’s cognitive skills: Are cross-country comparisons robust to who reports family background? European Sociological Review, 30(6), 766–781.
Jones, G. (2012). Cognitive skill and technology diffusion: An empirical test. Economic systems, 36(3), 444–460.
Jones, G., & Nye, J. V. (2011). Human Capital in the Creation of Social Capital: Evidence from Diplomatic Parking Tickets.
Jones, G., & Podemska-Mikluch, M. (2010). IQ in the utility function: Cognitive skills, time preference, and cross-country differences in savings rates.
Jones, G., & Potrafke, N. (2014). Human capital and national institutional quality: Are TIMSS, PISA, and national average IQ robust predictors?. Intelligence, 46, 148–155.
Joo, S., Valdivia, M., Valdivia, D. S., & Rutkowski, L. (2024). Alternatives to Weighted Item Fit Statistics for Establishing Measurement Invariance in Many Groups. Journal of Educational and Behavioral Statistics, 49(3), 465–493.
Kane, M. T. (2017). Measurement error and bias in value‐added models. ETS Research Report Series, 2017(1), 1-12.
Kane, T. J., C. E. Rouse, and D. Staiger (1999). Estimating returns to schooling when schooling is misreported. Working Paper 7235, National Bureau of Economic Research.
Kim, C., & Tamborini, C. R. (2014). Response error in earnings: An analysis of the survey of income and program participation matched with administrative data. Sociological Methods & Research, 43(1), 39–72.
Klein, M., & Kühhirt, M. (2023). Parental education and children’s cognitive development: A prospective approach. PsyArXiv.
Kreuter, F., Eckman, S., Maaz, K., & Watermann, R. (2010). Children’s reports of parents’ education level: Does it matter whom you ask and what you ask about? Survey Research Methods, 4(3), 127–138.
Kuncel, N. R., Credé, M., & Thomas, L. L. (2005). The validity of self-reported grade point averages, class ranks, and test scores: A meta-analysis and review of the literature. Review of educational research, 75(1), 63–82.
Lee, Y. H., & Jia, Y. (2014). Using response time to investigate students’ test-taking behaviors in a NAEP computer-based study. Large-scale Assessments in Education, 2(8), 1–24.
Lenis, D., Ebnesajjad, C. F., & Stuart, E. A. (2017). A doubly robust estimator for the average treatment effect in the context of a mean-reverting measurement error. Biostatistics, 18(2), 325–337.
Lin, C. J., DeRoo, L. A., Jacobs, S. R., & Sandler, D. P. (2012). Accuracy and reliability of self-reported weight and height in the Sister Study. Public health nutrition, 15(6), 989–999.
Loy, T. R. (2018). Intelligence, institutions, a culture of secrecy and earnings management. Corporate Ownership & Control, 15(4), 96–106.
Lynn, R., & Meisenberg, G. (2010). The average IQ of sub-Saharan Africans: Comments on Wicherts, Dolan, and van der Maas. Intelligence, 38(1), 21–29.
Madsen, J. B. (2016). Barriers to prosperity: Parasitic and infectious diseases, IQ, and economic development. World Development, 78, 172–187.
Marks, G. N. (2015). Are school-ses effects statistical artefacts? evidence from longitudinal population data. Oxford Review of Education, 41(1), 122–144.
Marks, G. N. (2024). No substantive effects of school socioeconomic composition on student achievement in australia: A response to Sciffer, Perry and McConney. Large-Scale Assessments in Education, 12(1), 8.
Marks, G. N., & O’Connell, M. (2021a). Inadequacies in the ses–achievement model: Evidence from pisa and other studies. Review of Education, 9(3), e3293.
Marks, G. N., & O’Connell, M. (2021b). No evidence for cumulating socioeconomic advantage. ability explains increasing ses effects with age on children’s domain test scores. Intelligence, 88, 101582.
Meyer, B. D., & Mittag, N. (2021). Combining administrative and survey data to improve income measurement. Administrative records for survey methodology, 297–322.
Meyer, B. D., Mok, W. K., & Sullivan, J. X. (2015). Household surveys in crisis. Journal of Economic Perspectives, 29(4), 199–226.
Michaelides, M. P., Ivanova, M. G., & Avraam, D. (2024). The impact of filtering out rapid-guessing examinees on PISA 2015 country rankings. Psychological Test and Assessment Modeling, 66, 50–62.
Minkov, M., Welzel, C., & Bond, M. H. (2016). The impact of genes, geography, and educational opportunities on national cognitive achievement. Learning and Individual Differences, 47, 236–243.
Moore, J. C., Stinson, L. L., & Welniak, E. J. (2000). Income measurement error in surveys: A review. Journal of Official Statistics, 16(4), 331–361.
Murad, H., Kipnis, V., & Freedman, L. S. (2016). Estimating and testing interactions when explanatory variables are subject to non-classical measurement error. Statistical Methods in Medical Research, 25(5), 1991–2013.
Murray, C., & Herrnstein, R. J. (1992). What’s really behind the SAT-score decline. The Public Interest, 106, 32–56.
Nikolaev, B., & Salahodjaev, R. (2016). The role of intelligence in the distribution of national happiness. Intelligence, 56, 38–45.
Obydenkova, A. V., & Salahodjaev, R. (2017). Government size, intelligence and life satisfaction. Intelligence, 61, 85–91.
Odell, B., Gierl, M., & Cutumisu, M. (2021). Testing measurement invariance of PISA 2015 mathematics, science, and ICT scales using the alignment method. Studies in Educational Evaluation, 68, 100965.
Omanbayev, B., Salahodjaev, R., & Lynn, R. (2018). Are greenhouse gas emissions and cognitive skills related? Cross-country evidence. Environmental research, 160, 322–330.
Pokropek, A., Marks, G. N., & Borgonovi, F. (2022a). How much do students’ scores in PISA reflect general intelligence and how much do they reflect specific abilities?. Journal of Educational Psychology, 114(5), 1121–1135.
Pokropek, A., Marks, G. N., Borgonovi, F., Koc, P., & Greiff, S. (2022b). General or specific abilities? Evidence from 33 countries participating in the PISA assessments. Intelligence, 92, 101653.
Potrafke, N. (2012). Intelligence and corruption. Economics Letters, 114(1), 109–112.
Potrafke, N. (2019). Risk aversion, patience and intelligence: evidence based on macro data. Economics letters, 178, 116–120.
Portela, M., Alessie, R., & Teulings, C. (2010). Measurement error in education and growth regressions. Scandinavian Journal of Economics, 112(3), 618–639.
Probst, T. M., & Estrada, A. X. (2010). Accident under-reporting among employees: Testing the moderating influence of psychological safety climate and supervisor enforcement of safety practices. Accident analysis & prevention, 42(5), 1438–1444.
Ree, M. J., & Carretta, T. R. (2022). Thirty years of research on general and specific abilities: Still not much more than g. Intelligence, 91, 101617.
Reeve, C. L., & Basalik, D. (2014). Is health literacy an example of construct proliferation? A conceptual and empirical evaluation of its redundancy with general cognitive ability. Intelligence, 44, 93–102.
Rindermann, H. (2013). African cognitive ability: Research, results, divergences and recommendations. Personality and Individual Differences, 55(3), 229–233.
Rios-Leyvraz, M., Ortega, N., & Chiolero, A. (2023). Reliability of self-reported height and weight in children: a school-based cross-sectional study and a review. Nutrients, 15(1), 75.
Rodgers, J. L. (1999). A critique of the Flynn effect: Massive IQ gains, methodological artifacts, or both?. Intelligence, 26(4), 337–356.
Rosen, J. A., Porter, S. R., & Rogers, J. (2017). Understanding student self-reports of academic performance and course-taking behavior. AERA Open, 3(2), 1–14.
Roth, L. W., Allshouse, A. A., Lesh, J., Polotsky, A. J., & Santoro, N. (2013). The correlation between self-reported and measured height, weight, and BMI in reproductive age women. Maturitas, 76(2), 185–188.
Rutkowski, L., Rutkowski, D., & Thompson, G. (2024). What are we measuring in international assessments? Learning? Probably. Intelligence? Not likely. Learning and Individual Differences, 110, 102421.
Salahodjaev, R. (2015). Does intelligence help fighting inflation: an empirical test?.
Salvucci, S., Walter, E., Conley, V., Fink, S., & Saba, M. (1997). Measurement Error Studies at the National Center for Education Statistics.
Schennach, S. M. (2016). Recent advances in the measurement error literature. Annual Review of Economics, 8(1), 341–377.
Sear, R. (2022). ‘National IQ’ datasets do not provide accurate, unbiased or comparable measures of cognitive ability worldwide.
te Nijenhuis, J., & van der Flier, H. (2003). Immigrant–majority group differences in cognitive performance: Jensen effects, cultural effects, or both?. Intelligence, 31(5), 443-459.
von Davier, M., & Bezirhan, U. (2023). A robust method for detecting item misfit in large-scale assessments. Educational and Psychological Measurement, 83(4), 740–765.
von Davier, M., Mullis, I. V. S., Fishbein, B., & Foy, P. (Eds.). (2023). Methods and procedures: PIRLS 2021 technical report. TIMSS & PIRLS International Study Center.
Voracek, M. (2013). National intelligence estimates and the failed state index. Psychological Reports, 113(2), 519–524.
Warne, R. T. (2023). National mean IQ estimates: Validity, data quality, and recommendations. Evolutionary Psychological Science, 9(2), 197–223.
Warne, R. T., & Burningham, C. (2019). Spearman’s g found in 31 non-Western nations: Strong evidence that g is a universal phenomenon. Psychological Bulletin, 145(3), 237.
Webster, G. D., & Duffy, R. D. (2016). Losing faith in the intelligence–religiosity link: New evidence for a decline effect, spatial dependence, and mediation by education and life quality. Intelligence, 55, 15–27.
Wicherts, J. M. (2007). Group differences in intelligence test performance. Unpublished doctoral dissertation. University of Amsterdam.
Wicherts, J. M., Dolan, C. V., & van der Maas, H. L. (2010a). A systematic literature review of the average IQ of sub-Saharan Africans. Intelligence, 38(1), 1–20.
Wicherts, J. M., Dolan, C. V., Carlson, J. S., & van der Maas, H. L. (2010b). Another failure to replicate Lynn’s estimate of the average IQ of sub-Saharan Africans. Learning and Individual Differences, 20(3), 155–157.
Wise, S. L., & DeMars, C. E. (2010). Examinee noneffort and the validity of program assessment results. Educational Assessment, 15(1), 27–41.
Wise, S. L., & Kingsbury, G. G. (2016). Modeling student test‐taking motivation in the context of an adaptive achievement test. Journal of Educational Measurement, 53(1), 86–105.
Wise, S. L., & Kong, X. (2005). Response time effort: A new measure of examinee motivation in computer-based tests. Applied Measurement in Education, 18, 163–183.
Zaboski II, B. A., Kranzler, J. H., & Gage, N. A. (2018). Meta-analysis of the relationship between academic achievement and broad abilities of the Cattell-Horn-Carroll theory. Journal of school psychology, 71, 42–56.
seeing the name Samorodnitsky almost gave me a heart attack, I had thought the guy who developed the theory of heavy tailed distributions was now railing against IQ
What do you make of Seb Jensen’s attempt at estimating changes in IQ using the national IQs score + scholastic tests like PISA. https://open.substack.com/pub/sebjenseb/p/changes-in-relative-cognitive-performance?r=1sg94k&utm_medium=ios
If you look at a country like Ireland, the national IQ scores are low. However, their scholastic results are very good. Seb puts them at about 100.