The Fallacy of Significance Tests

Jun 10, 2014

It must be known that a p-value, or any other statistics based on the Chi-Square, is not a useful number. It has two components : sample size and effect size. Its ability to detect a non-zero difference increases when either sample size or effect size increases. If only sample size increases, even with the other left constant, the statistics become inflated. There is also a problem with the assumption. If it is about the detection of “non-zero” difference, it is of no use if the magnitude, i.e., effect size, is of no importance. I will provide several examples of the deceptiveness of significance tests.

If one takes a glance at google, the web is replete with websites, chapters, presentations with how to interpret and report the p-value. They all have their share of troubles. Each time the authors present a result with an effect that may be small, modest, or large, but clearly different from zero, they constantly ignore it and go to examine the p-value. If it does not reach significance (inferior than 0.05 or sometimes 0.10) then whatever the effect size is, they conclude there is no difference or no correlation. The correct interpretation should have been to say that whatever the effect size we find, the non-significant p-values suggest we need more samples to have more confidence in our results. A related problem is the cut-off level used for significance. There is no logical reason to affirm that 0.04 is significant but 0.06 is not. If we need some indices of “confidence” then the so-called confidence intervals (CIs become larger when sample is small) would have been largely sufficient and much better. Although even in this case, the CI may still have problems since it has been advanced that if the CIs include zero, then we must conclude there is no effect different from zero. Well, this is just based on the same old-fashioned fallacy.

Rogosa (1980) provides an enlightening illustration of the consequences of this utter fallacy. In the CLC path analysis framework, it is said that when the cross-lagged correlations do not show any causal dominance whatsoever, the difference in the (cross-lagged) path correlations must not be significant. However :

Rejection of the null hypothesis of equal cross-lagged correlations (H0: p(x1*y2) = p(y1*x2)) often is interpreted with little regard for the power of the statistical test. Users of CLC are advised to use large samples; Kenny (1975) advises that “cross-lagged analysis is a low-power test” (p. 887) and that even with moderate sample sizes (defined as 75 to 300), statistically significant differences are difficult to obtain. With large enough samples, trivial deviations from the null hypothesis lead to rejection. For example, Crano, Kenny, and Campbell (1972) found significant differences between cross-lagged correlations of .65 and .67 because the sample size was 5,495.

And that’s how significance test is used. To produce misleading conclusions. Concerning the scientists having claimed a “significant” difference that is ridiculously small in effect size, one relevant question is : do they really believe what they say ? Sometimes, I doubt. They don’t have the guts to question the gold standards, as if they are the words of God.

Other dangerous claims come from studies aimed to detect item bias. Especially, some early DIF studies did not even report effect sizes of the DIFs. This is what happened in Willson et al. (1989), among quite many others in the 1980s, where they claim they found no black-white item bias in the K-ABC, based on significance tests, and yet this is not surprising since their sample is small (N=100). With regard to the few significant DIF items found, they note that “the effects, although statistically significant, tend to be of no real or practical consequence.” (p. 295). The first problem is with the interpretation. Do they mean no practical consequence in terms of their individual effect sizes ? Or no practical consequence in terms of the impact of the whole set of DIFs on the total test score ? The other problem is that a large amount of DIFs has probably been missed due to low power of detection. There is no way we can tell if the undetected DIFs would show a pattern of DIF cancellation, which would be evidence of no bias.

Still another illustration of the insidious effect of relying on significance tests, van Soelen et al. (2011, Table 5) claimed that the childhood heritability of PIQ is 0.64 in AE model instead of 0.46 in ACE model. Usually, the purpose in such modeling is to find the most parsimonious, i.e., simplest model, having the least free parameters to be estimated. When a parameter is removed, and that model fit is not worsened, it is said that the reduced model is acceptable, compared to the full model. Thus, the reason they select the AE estimates is because the removal of C parameter (shared environment) does not reach significance, given Chi-Square statistics. Their sample size was modest (224+46). The problem is that C has a value of 0.17. Modest but not zero. In AE model, where C is dropped, the C value becomes obviously zero. Surely, with a statistic less impacted by sample size, or in sample size larger, the result will be different. In the ACE model, A amounts to 0.46, C to 0.17 and E to 0.38. The total is 1.00, as usually the case with standardized parameters, which must totalize 100%. Now, if we look at AE model, A equals 0.64 and E equals 0.36. What happened ? Simply. When C has been dropped, its value is given to A, which becomes inflated. E being the nonshared environment + measurement error. This distortion has serious implications for their conclusion, where they imply that, based on AE models, heritability for PIQ does not increase with age when in fact it has probably increased.

The consequences of all of these mistakes can be best understood when one is reading an article that reviews previous research. The author(s) begin to say that researchers X find no relationship between A and B, researchers Y found no relationship between A and B and C. When we look at the referenced articles, however, it sometimes happens that the claim about the null relationship is due to the authors focusing on p-value instead of effect size, which can be as low as 0.10 or as high as 0.30. Clearly, it’s not equal to zero. Even worse comes when they summarize numerous studies each with small samples, all of them having no relationship. And yet, when the samples are combined, the p-value will be highly significant. This is what happened with Besharov (2011) who constantly rejects every experimental study that fails to improve IQ but improves scholastic achievement. This conclusion is still right, but the way Besharov relies on significance tests not only obscures the report of effect size but also the largely “significant” difference that will emerge when considering all the studies collectively. Since we don’t necessarily have a lot of time to read all of these papers, one would easily prefer to trust the article review. Unfortunately, it may go wrong. There is no way to know until we read all of these papers ourselves.

Indeed, one must wonder first of all if the significance test really adds any relevant information. An effect size that is small can be easily disregarded if the sample is small. We will conclude it needs more replication, without even looking at the p-value. My opinion is obviously that significance tests should never be used again. It does not add any new information above what is provided by sample size and effect size. It only adds confusion.

Meng Hu on HBD and Austrian Economics

Discussion about this post