Spearman's g Explains Black-White but not Sex Differences in Cognitive Abilities in the Project Talent
In this analysis of the Project Talent data, the g factor model as represented by the Spearman’s Hypothesis (SH) was confirmed for the black-white cognitive difference but not for the sex difference. Results from MGCFA and MCV corroborate each other. MGCFA detected small-modest bias with respect to race but strong bias with respect to sex cognitive difference. Full result is available at OSF.
The Project Talent is the largest study ever conducted in the United States involving 377,016 9th-12th grade students during 1960 and drawn from all of the 50 states. The sample includes 4,481 twins and triplets from 2,233 families, and 84,000 siblings from 40,000 other families. The goal was to identify individuals’ strengths (i.e., “talents”) and steer them on to paths where those strengths would be best utilized. To this end, data on personal experiences, activities, aptitudes and abilities, health and plans for college, military service, marriage and carriers were collected. Follow-up surveys were conducted until the students were approximately age 30. In 2019, a follow-up study examined the determinants of later life cognitive diseases.
CONTENT
1.1. Spearman’s Hypothesis
1.2. MGCFA
2.1. CFA/SEM Assumptions
2.2. Measurement invariance assumptions
2.3. Inadmissible solutions
2.4. Bifactor bias
2.5. Cross loadings
2.6. Issues with fit indices and cut-offs
2.7. MGCFA: Lack of power?
2.8. Equivalent model
2.9. Bad use of modification indices
2.10. Scale reliability
3. Results
3.1. Race analysis
3.2. Gender analysis
3.3. Latent mean differences
3.4. Solving power issues with sliced samples
3.5. Method of Correlated Vectors
4. Discussion
Let’s start explaining why MGCFA was NOT a fun journey…
1.1. Spearman’s Hypothesis
The hypothesis tests whether group differences in IQ tests are mainly due to general ability or specific abilities. But whether such hypothesis is true or not has no consequence on the hereditarian hypothesis which proposes a partial genetic cause. The relevance of SH is to provide a general explanation as to why and how group differences can magnify and why it matters in everyday life.
SH has been tested mostly using either the Correlated Vectors (Jensen, 1998, Appendix B) or Multi-Group CFA (Dolan, 2000). But there are many other ways it can be tested, yet often obscured. The absence of far transfer in schooling effect or working memory training is another way of testing whether such experiments have an effect on general ability. Group differences magnifying with increasing task complexity would otherwise serve as another confirmation for SH.
1.2. Multi Group Confirmatory factor Analysis
The traditional view of culture bias accepts that members of two groups, after being perfectly matched for latent ability, have equal probability of correct response on any given item. In MGCFA, members of different groups matched in latent factor mean (e.g., crystallized, fluid) should get the same score on the indicator variables loading onto this latent factor and each indicator’s loading should be similar across groups. If latent factor means do not account fully for the group difference in the test scores, the remainder is due to external influences, commonly assumed to be culture bias.
MGCFA starts by adding additional constraints to the free, baseline model: configural, metric, scalar, strict. A rejection of configural invariance implies that the groups use different latent abilities to solve the same set of item variables. A rejection in metric (loading) invariance implies that the indicators of a latent factor are unequally weighted across groups. A rejection in scalar (intercept) invariance implies that the subtest score differs across groups when their latent factor means is equalized. A rejection in strict (residual) invariance implies there is a group difference in specific variance and/or measurement error. Steinmetz (2013) showed that unequal factor loadings had a minor effect in latent means while unequal intercepts lead to serious latent means differences. A last constraint is the equality of factor variance. Although not a necessary step to measurement invariance, it is still important as its rejection implies that the two groups do not use an equal range of latent ability.
When measurement equivalence is established, latent means are deemed comparable. Model comparison can be employed to test whether the g model fits better than the non-g model (Dolan, 2000). The g factor mean difference can be fixed to zero within a g model to test whether the model fit deteriorates (Frisby & Beaujean, 2015). Ideally, SH should be confirmed using both methods.
2.1. CFA/SEM Assumptions.
Before applying CFA, EFA must be used to determine the number and pattern of factors. Despite its appeal, Brown (2015, p. 141) argued that EFA may provide misleading information regarding the appropriate number of factors, e.g., when the relationships among indicators are better accounted for by correlated errors than separate factors as in the case of method effects.
Key assumptions of ML include large sample, indicators being measured on continuous scales and multivariate normal distribution of the indicators (Brown, 2015, pp. 64-65). The latter two assumptions apply to indicators of latent factors but not covariates. Non-normality would otherwise cause biased standard errors and poorly behaved χ2 test of overall model fit. Curran et al. (1996) determined that univariate skewness of 2.0 and kurtosis of 7.0 are suspect, and that ML is robust to modest deviation from multivariate non-normality but that ML χ2 is inflated otherwise. Because data transformation has little effect on point estimates and standard errors, Little (2013, p. 17) recommended to use instead a robust estimation such as bootstrap. Three options to avoid bias include: 1) normalizing variables before analysis with default ML, 2) using ML with robust standard errors and corrected model statistics such as the Satorra-Bentler statistic, 3) using ML with nonparametric bootstrapping (Kline, 2011, p. 177).
ML and GLS estimators are only robust to minor departure from normality, while robust ML can be used in case of severe non-normality (Brown, 2015, pp. 345-346). ML and GLS do not require the variables to have the same scale while ULS does not require a positive-definite covariance matrix (Kline, 2011, p. 176).
In means and covariances models, all the free location (means/intercepts) and scale (variances) parameters have to be estimated uniquely. There are three ways to set the latent scales (Beaujean, 2014, pp. 48, 58; Brown, 2015, p.54; Kline, 2011, pp.128-130; Little, 2013, pp. 80-84): 1) fixing a factor loading to 1, known as marker variable, 2) fix the latent variance to 1 in one group, known as standardized latent variable, 3) constraining the loadings to sum to the number of its indicators and, to set the latent variable’s location, constraining the intercepts to sum to 0, which is known as effects-coding. The primary effect of fixing the variance instead of the loading is to cause the unstandardized estimates to take on the same values of the partially standardized estimates (Brown, 2015, pp. 120-121).
The CFA output includes an unstandardized solution which returns the original metrics of the indicators, a partially standardized solution which returns the unstandardized indicators and standardized latent variables or vice versa, and a fully standardized solution which returns the standardized indicators and latent variables (Brown, 2015, p. 36). Although the standardized estimates are appealing, it is recommended to report the unstandardized estimates as well, especially in SEM.
Kline (2011, ch. 13) enumerated 52 points to remember for properly conducting CFA/SEM. Two points are worth elaborating on. First. Parsimonious models, in which some effects are intentionally constrained to zero for testing specific, are selected as best models if several models fit almost equally well (see also p. 102). Otherwise it is easy to overfit the model by specifying one as complex as possible to the extent that it does not respect theory or that it cannot have poor fit even if it is grossly misspecified due to having so many estimated parameters. Second. Equality constraints are always applied to unstandardized coefficients only. If groups differ in their variance, the standardized coefficients will differ between groups even if equality constraints are applied. Only the unstandardized solution matters for measurement invariance.
Typically, it is recommended to use the original scale of the variables. However, Little (2013, p. 18) noted that when variables are on very different metrics, rescaling them to be on comparable scales helps convergence.
2.2. Measurement invariance assumptions.
It is well established that measurement invariance requires that factor patterns (configural), factor loadings (metric) and intercepts (scalar) should be equal across groups. But no agreement is reached regarding residuals (strict). Lubke & Dolan (2003) reported that a model with free residuals overestimates slightly the latent mean differences whenever the groups differ in their residuals because the model has to compensate for the differences in residuals. These authors recommend strict invariance. DeShon (2004, p. 146) explained that the common view that item specific variance is removed from the latent variable is based on the assumption that item uniquenesses are uncorrelated with each other or the latent variable. Violating this assumption will affect the estimation of the latent variables. Widaman & Reise (1997) argued that strict invariance has the advantage of having fewer parameters to estimate but the step can be skipped if the difference in error variance is justified (e.g., growth model with an age-related variable).
On the other hand, it was sometimes argued that this step is dropped due to confoundings between measurement errors and specific/unique variance (Dolan & Hamaker, 2001, p. 15). There is no theoretical reason to expect equality in errors or item-specific content which can vary as a function of measurement factors such as assessment conditions (Little et al., 2007). Strict invariance may be implausible, e.g., in the case of the unequal effect of an intervention which increased the standard deviation over time because some individuals responded favorably whereas others did not (Brown, 2015, p. 234). Admittedly, strict invariance has a biasing effect only if the group difference in residuals is small (Little, 1997, p. 55). Another argument is that latent variables are theoretically perfectly reliable, which makes strict invariance useless when evaluating latent means but useful when evaluating the reliability differences between groups or over time (Vandenberg & Lance, 2000, p. 57). If the sum of the specific and random variance is not equal, the amount of misfit that the constraints on the residuals would create must permeate all other estimated parameters (Little, 2013, p. 143).
Millsap & Olivera-Aguilar (2012, p. 388) argued that one common cause of non-invariance comes from unwanted nuisance factors beyond the factor(s) that are the intended target of the measures. The inclusion of a math test having a mixture of multiple-choice items and problem-solving items, with the latter being story problems, may introduce bias against foreign language speakers due to the verbal content of this test. If the math factor is not supposed to measure verbal skill, then such test should be discarded in the baseline model.
Invariance is only applied to the unstandardized values (Brown, 2015, p. 207). This means the standardized values between the groups can still differ despite equality constraints if the groups differ in their variance.
It is possible to conduct a partial invariance test if the non-invariant indicators are a minority (Brown, 2015, p. 271; Little, 2013, p. 159). This is because the factors can be reliably represented by the remaining indicators. The contrary would mean the construct is not cohesive. Furthermore, if invariance is not found, it is generally recommended to employ the fixed factor method because the marker indicator could hide a potential non-invariant indicator.
Often, researchers do not compute the effect size of partial invariance. Gunn et al.’s (2010) proposed the SDI, which is similar to Nye and Drasgow’s (2011) dMACS for MGCFA non-invariance except SDI uses the focal group SD (in this case, the black group) instead of the pooled SD. Either methods yield an effect size comparable to Cohen’s d. Groskurth (2023) proposes the MIVIs as effect sizes which are computed using the pooled SD of the latent factor. The latent SD has the advantage of being the same for all subtests loading onto that factor and consisting of true score variance only. Since observed SDs vary across subtests, the effect sizes are not comparable across subtests. At the same time, MIVIs are partially but not fully standardized due to not using the observed SD, making them comparable within but not across factors. Groskurth (2023) argues however that the calculation of non-invariant intercepts typically assumes invariant loadings and no cross loadings. The interpretation is less ambiguous in the absence of cross loadings and/or non-invariance in loadings.
2.3. Inadmissible solutions.
Impossible solutions have two categories: nonpositive definite matrix and Heywood cases. The most common cause is a misspecified model (Brown, 2015, pp. 60-61, 152, 163-167, 179; Kline, 2011, pp. 146-148, 158, 233; Newsom, 2023). Improper solution and nonconvergence occur because the fitted model is very different from models that the data would support. Underidentification can lead to improper solutions (Bentler & Chou, 1987, pp. 101-102; especially in correlated methods approaches, see, e.g., Brown, 2015, pp. 179, 199). This may occur due to highly correlated indicators, or very low loading, or very low factor correlations, or when an essential constraint (e.g., reference loading or variance to 1) is omitted. In particular, Chen et al. (2001, p. 482) explained that misspecification alone is not affecting the probability of improper solution. A population parameter value that is close to zero, a more skewed distribution or greater standard deviation of the estimator of the error variance, will all cause a higher probability of nonpositive error variances.
How misspecification causes estimation problem? Let’s say the variance of an indicator is a small negative. Constraining it to be positive prevents the iterative convergence algorithm from finding the right values for the parameter estimates, because the numbers keep changing. Convergence will fail.
Nonpositive definite matrix is often a symptom of linear of dependency (i.e., a variable is the sum of some other variables) but sometimes also a result of pairwise deletion or violation of normality when using ML. Another condition is the requirement of all eigenvalues being >0, which can be examined by submitting the fitted variance-covariance matrix to PCA (Brown, 2015, p. 163). Even wih all eigenvalues >0, a misspecification can force the ML estimation to push one of more parameter out of the range of admissibility as a compensation (e.g., X1 and X2 specified as loading onto latent factor Y1 despite X1 being much more related to X3 than X2).
Other problems often arise with small samples. Nonpositive definite matrix may be due to sampling error, outliers are more impactful in a small sample. Outliers can cause collinearities and non-normality and lead to Heywood cases.
Heywood cases most often arise from misspecification/underidentification but also from outliers, small sample sizes and extraction of more factors than are present (Bartholomew, 2011, pp. 67-68). A misconception is that standardized estimates cannot take on values larger than 1. Jöreskog (1999) and Brown (2015, p. 162) explained that when factors are correlated and variables have multiple loadings, the factor loadings are regression coefficients and not correlations. Values larger than 1, despite being admissible, still suggest there is a high degree of multicollinearity.
Model size could help avoiding this problem. Gerbing & Anderson (1987) showed that loadings of 0.90 produced the largest proportion of improper solutions (or not at all) when there were 2 indicators per factor (as opposed to 3 or 4).
A related problem is the poor discriminant validity, which reflects factor correlations which are too high, exceeding 0.85 or approaching 1.0 (Brown, 2015, pp. 146-147). This may indicate that too many factors have been specified, rejecting the model. A common solution is to either combine factors to acquire a more pasimonious solution or drop one of the factors and its constituent indicators. However in intelligence research, r=0.85 is typical. Another, but less common, scenario is when the factor correlation becomes too high because too many cross loadings are specified (Bentler & Chou, 1987, p. 106).
Solutions to improper solutions involve, if substantively justified, dropping nonsignificant parameters or placing other restrictions on the model such as equality constraints. Another is by adding variables (Bentler & Chou, 1987, p. 104). One solution regularly proposed to solve the negative variance is to fix residuals to 0. This is a bad practice because negative variance may be due to outliers, multicollinearity or misspecification (Brown, 2015, p. 167). The only scenario of negative variance which does not require respecification is sampling error, i.e., when the confidence intervals include zero. On the other hand, Chen et al. (2001, p. 504) argued that if the negative error variance is nonsignificant, constraining the error to zero can reduce bias in estimates since improper solutions yield more biased estimates than proper solutions. They also evaluate (Table 3) the performance of constrained and unconstrained estimation methods of samples with improper solutions, and the former achieves convergence more often. If constrained estimation achieves convergence, but not the unconstrained, then the constrained estimation produces more bias. If both estimation methods converge, the constrained estimation produces less bias. Savalei & Kolenikov (2008) concluded that constrained estimation generally causes more problem than it solves, while its few advantages include omnibus test of model fit.
2.4. Bifactor bias.
Murray & Johnson (2013) argued that the bifactor model may simply be better at accommodating unmodelled complexity, e.g., cross-loadings or correlated residuals: “It is arguable that the global fit of the bi-factor model may be less sensitive to these mis-specifications than the higher-order model because it includes more free parameters and a g that loads directly on observed subtests. However, in accommodating these mis-specifications, modelled parameters may become biased. … In the higher-order model, the effect of constraining these cross-loadings to zero is to force the covariances due to these unmodelled cross-loadings to be absorbed by the modelled factor loadings, inflating factor covariances and, in turn, the variance attributable to g (Asparouhov & Muthén, 2009). Thus, specific factor correlations can to some degree compensate for unmodelled cross-loadings that would otherwise decrease global fit (Beauducel & Wittmann, 2005). In a bi-factor model the correlations amongst specific factors and g are constrained to zero so this compensatory route is not available. Instead, subtests are connected directly to g, providing an alternative and more direct mediation route for this unmodelled complexity, i.e. the covariance due to unmodelled cross-loadings can be absorbed by g without going through specific ability factors.” (p. 409). In other words, misspecification leads to smaller reductions in global fit but greater parameter bias for the bifactor model. They suggest that the difference in fit must be very large to establish the superiority of the bifactor, in order to overcome this inherent bias. This is because fit indices, while adjusting for model df, do not make reference to the location of the additional parameters nor to the possibility that additional parameters may model both intended covariances and unintended unmodelled covariances. Yet they also conclude that a higher-order factor model should produce less accurate estimates of the specific factors because it conflates g and specific variance.
Mansolf & Reise (2017) simulated conditions in which bifactor and higher order models are distinguishable. The higher order model imposes more constraints, one of which being the tetrad constraint. It is only when this assumption is not violated that both models display the same Chi-Square value. When cross loadings and correlated residuals are misspecified or when the proportionality constraint is not met, tetrad constraints are violated. The greater the degree of the tetrad violation, the greater the bias favoring bifactor model. They conclude that it is not the disproportionality of loadings per se that causes fit indices to favor a bifactor when there are unmodeled complexities, it is their effect on tetrads.
Comparison between correlated factors (i.e., non-g) model with a bifactor model is still possible under specific conditions. Assuming no unmodeled misspecification, Morgan et al. (2015) demonstrate that when data were sampled from a true correlated factors structure, with unequal factor correlations, fit indices favor a correlated factors model. They also admit, though, that when data were sampled from a true higher order structure, fit indices favor a bifactor model. On the other hand, Greene et al. (2019, Tables 2-4) compared correlated factors and bifactor models under a true correlated three-factor model, using different misspecifications. When there is no misspecification, the SRMR favors the bifactor when sample size is smaller. When cross loadings or between-factor correlated residuals are misspecified, all fit indices favor slightly the bifactor regardless of sample size and size of misspecification, except for SRMR which favors the correlated factors when misspecification of between-factor correlated residuals is large. When within-factor correlated residuals are misspecified, all fit indices favor the correlated factors model regardless of conditions, except for the SRMR. An interesting pattern is that when the size of misspecification is very small, the probifactor bias is negligible.
Conceptually the bifactor can be thought as more parsimonious than the higher order factor model at explaining the relationship between subtests and g because it does not require a theoretical justification for full mediation of the specific factors and does not impose proportionality constraints on the loadings, despite the bifactor model having fewer degrees of freedom due to introducing more parameters (Cucina & Byle, 2017; Gignac, 2008). Perhaps more importantly, a bifactor model is consistent with Spearman’s initial conceptualization of g as having direct influences on the measured tests (Frisby & Beaujean, 2015, p. 95).
The selection of the best g-model may be based on substantive meaning. An advantage of the bifactor is that the specific and general factors are completely separated while the specific factors are represented as residuals in a higher order model, a point made clear by Bornovalova et al. (2020). An illustration is provided by Gignac (2005) in his analysis of the WAIS-R. Arithmetic and Digit Span subtests loaded onto VIQ factor in a higher order but not bifactor model, which lead to the conclusion that “oblique and higher order factor models may suggest that a subtest loads onto a group-level factor, when in fact the observed loading between a subtest and a group-level factor is actually mediated by its relationship with the higher order general factor.” (p. 328). This pattern was later replicated in the WAIS-III test (Gignac, 2006a, Table 2). Similarly, Arithmetic showed a strong loading onto VIQ for the higher order model but zero loading onto VIQ for the bifactor model in the MAB test (Gignac, 2006b). Ocasionally, specific factors have very low factor loadings or account for a very small variance in some indicators within bifactor models (Brown, 2015, pp. 302-303). A solution is to respecify the model by removing this specific factor and having its indicators loading on the generaly factor only.
Markon (2019) suggest that parameter estimates should also be scrutinized. They cite a study that estimates a bifactor model with externalizing and internalizing variables, in which the externalizing variables have high loadings on the specific factor but low or modest loadings on the general factor. This pattern questions the adequacy of this general bifactor model.
2.5. Cross loadings.
A common EFA modeling practice is to ignore all loadings below a certain threshold value such as 0.3 on a standardized scale. Recently, numerous studies showed that this recommendation is misguided.
Asparouhov and Muthén (2009) were one of the first to investigate the impact of misspecification of zero cross loadings. They noted: “When non-zero cross-loadings are specified as zero, the correlation between factor indicators representing different factors is forced to go through their main factors only, leading to over-estimated factor correlations and subsequent distorted structural relations.” (p. 4). Their simulations compare SEM which assumes no cross loadings and Exploratory SEM which allows small loadings, under condition of small cross loadings (λ=0.25) within a 2-factor model and N=1,000. They found that the SEM model is rejected 100% of the time while the ESEM model is rejected only 7% of the time.
Similarly, Mai et al. (2018) advocate the use of Exploratory SEM over SEM when cross loadings are ≥0.10. Regression coefficients were unbiased except when cross loadings are as large as 0.25 for ESEM but consistently biased for SEM. It was also found that ESEM produces more precise estimation, although the difference decreases as sample and target-factor loadings increase.
Hsu et al. (2014) analyzed the impact of fixing low cross-loadings (>0.13) to zero in a correlated 3-factor SEM model and found that factor coefficients and factor correlations were overestimated. RMSEA and SRMR were not sensitive enough to detect this misspecification under all design conditions. They recognize that the larger the magnitude of the factor pattern coefficients, the more one can generalize from that factor to the measured variables but that this does not mean that smaller factor pattern coefficients can be ignored. If sample size is too small to allow for proper estimation, they recommend using Bayesian SEM with non-informative priors for cross loadings.
Xiao et al. (2019, Table 2) compare the standard CFA (i.e., no cross loadings) with nonstandard CFA (which allows cross loadings), ESEM and BSEM under varying conditions of cross loadings and loading variance. Regarding target loading and cross loading estimation, when cross loading mean was 0.20 and loading variance not large (0.01 or 0.04), the standard CFA and BSEM with large- and small-variance priors showed large bias while the nonstandard CFA showed small bias only when loading variance is larger than 0.01. Regarding factor correlation estimation, when cross loading mean was 0.20 and loading variance was medium or large (0.04 or 0.09), only the ESEM had small bias whereas only the BSEM had small bias when loading variance was small (0.01). While both ESEM and BSEM with correct priors performed well, specifying correct priors is a challenge (e.g., requires knowledge from experts, published research results, pilot studies) and in such a case ESEM might be preferred.
Wei et al. (2022, Figures 3-4) compare SEM, ESEM, and BSEM using a 2-factor structure. When true cross loadings equal zero or when target loadings are very large, all methods perform equally well but SEM should be preferred due to parsimony. If there are cross loadings and adequate information on priors for cross loadings, BSEM shows lower rejection rates and lower estimation bias of regression coefficients regardless of target and cross loadings. Without adequate information on priors, BSEM shows the largest estimation bias when target loadings are low (0.55) and cross loadings are small/modest (0.20-0.30). Under the same condition of loadings, ESEM shows lower rejection rate than SEM.
Zhang et al. (2023) evaluate the impact of fixing small cross loadings (0.20) to zero using several bifactor predictive models: SEM, Exploratory SEM and Bayesian SEM with either classical approach or augmented approach (which uses an additional indicator that only loads onto the general factor). Power and type 1 error were negatively affected and caused bias in regression coefficients of both the general and specific factors. The effectiveness of the augmentation approach was not affected by fixing small cross loadings to zero. None of the classical methods, SEM, ESEM, BSEM, were robust to estimation bias. However, the augmented BSEM accurately identified the predictive validity of the general factor and specific factors.
Ximénez et al. (2022) examined the condition of forcing small cross loadings (λ≥0.20) to zero under a bifactor CFA model and found that the loadings in the general factor are overestimated and those in the specific factors are underestimated. The factor loadings estimates are unbiased when sample is large (N=500) but biased when sample is small (N=200 or less) and the cross loadings are biased when the magnitude of the loadings in the specific factors decreases (λ<0.50) and sample size is very small (N=100). Both the CFI and RMSEA fail to detect misspecification of ignoring low and moderate cross loadings, whereas SRMR adjusted by the communality level performs well. The finding of biased loadings owing to misspecification is consistent with previous simulation studies which did not use a bifactor structure.
An important discovery is the lower power associated with increased model size. Savalei (2012, Figures 1-4) reported that RMSEA becomes less sensitive to misfit, when the number of latent factors increases and (sometimes) when indicators per factor increases. Poor measurement plays a major role as well. For instance, loadings of only 0.40 or 0.50 produced an RMSEA that was largely insensitive to large misspecification (e.g., large size and number of omitted covariance residuals or cross loadings and number of latent factors). Savalei explains that “it is reasonable to expect the RMSEA to decrease with model size, because the misspecification becomes progressively more “diluted” and concerns an increasingly smaller part of the overall model” (p.925). Similarly, Shi et al. (2022) found that RMSEA, CFI and SRMR become less accurate in detecting misspecification when loadings and sample size decrease in small df models. The only remedy to large model size appears to be measurement quality (high loadings).
Even though a model can allow multiple loadings, it may happen that two factors display extremely high correlations if too many variables load on both factors (Bentler & Chou, 1987, p. 106).
2.6. Issues with fit indices and cut-offs.
Usually, researchers are unaware that fit indices are affected by sample size, number of indicators, indicators per factor, factor loadings, model parsimony (Cheung & Rensvold, 2001, p. 249; 2002). Results from simulation only apply to the condition set in the analysis. There is no such thing as a one-size-fits-all cutoff. One striking illustration comes from Sivo et al. (2006, Tables 8-10). The optimal cutoff value of fit indices for rejecting misspecified models depends on sample size: among complex models, Mc varies between 0.90-0.95 at N=150 and 0.80-0.85 at N=5,000, RMSEA varies between 0.03-0.05 at N=150 and 0.06-0.07 at N=5,000, CFI varies between 0.97-0.99 at N=150 and 0.94-0.96 at N=5,000, whereas SRMR is almost unaffected.
Shi et al. (2022, Figure 5) analyzed the performance of RMSEA, SRMR and CFI in small-df SEM model. Using conventional cutoffs (i.e., Hu & Bentler, 1999), RMSEA often rejects correct models and slightly misspecified models but also often fails to reject severely misspecified models, unlike SRMR and CFI. For all fit indices, increasing loadings leads to more accurate detection of misspecification.
Heene et al. (2011, pp. 322-325) observed that the lower the factor loadings and the more likely RMSEA and SRMR will fail to detect misspecified CFA models (no cross loadings and no factor correlation) when using these cutoffs while the opposite was true for CFI. Generally, CFI performed well while RMSEA and SRMR fail to reject complex misspecification (i.e., cross loadings) when loadings were low (0.40). An important finding is the impact of model size. Increasing the number of variables per factor from 5 to 15 causes all indices to have better fit, therefore lowering detection: RMSEA fails to detect misspecification while SRMR and CFI performed well only for complex misspecification (i.e., cross loadings).
High loading is therefore always better. One peculiarity observed by Browne et al. (2002) is that excellent measurement (i.e., very high loadings or, equivalently, very small uniqueness) may yield small residuals which would indicate good fit despite fit indices indicating poor fit. The only fit index that shows good fit was the RMR/SRMR because it only depends on the residuals.
One undesirable property of ∆ in goodness of fit is to be correlated with the goodness of fit of the overall model, which implies that less accurately specified models produce larger values of difference statistics when measurement invariance constraints are added. Cheung & Rensvold (2002) found that, if the null hypothesis of invariance is true, ∆CFI, ∆Gamma hat and ∆McDonald’s NCI are not correlated with the overall fit measures and are not affected by model complexity.
Simulations with varying levels of non-invariance, and assuming Type I error rate of 0.01, lead Meade et al. (2008) to recommend a cutoff of 0.002 in ∆CFI to detect metric and scalar invariance while the cutoff for McDonald’s NCI depends on the number of factors and items (Table 12) and most realistic conditions (i.e., up to 6 factors and up to 30 total items) probably lie between ∆Mc 0.0065 and 0.0120. Simulations varying the proportion of non-invariant indicators and pattern of non-invariance (unidirectional or bidirectional bias) and sample size ratios, based on a 1-factor model, lead Chen (2007) to propose a cutoff for testing loading invariance, a change of ≥.005 in CFI, supplemented by a change of ≥.010 in RMSEA or a change of ≥.025 in SRMR would indicate noninvariance; for testing intercept or residual invariance, a change of ≥.005 in CFI, supplemented by a change of ≥.010 in RMSEA or a change of ≥.005 in SRMR would indicate noninvariance. The values of ∆Mc vary greatly depending on the condition and invariance steps (see Tables 4-6) but often lie between 0.010 and 0.015.
Finch & French (2018, Table 2) simulated the power rates of model fit indices for scalar invariance in a single-factor MGCFA model. ΔCFI displays power rates of only 0.45 and 0.67 for 0.3 and 0.5 in intercept differences, respectively. RMSEA shows a 60% proportion of poor fit when intercepts differ by 0.3 but its accuracy also increases with sample size and greater number of indicators. With respect to metric invariance, both indices had very low power rates when only one indicator (as opposed to two or three) displayed loading differences. Both indices reached acceptable power rates when loadings differ by 0.4 with 2 or 3 non-invariant indicators.
The above studies did not assess fit indices within higher order models or bifactor models. Khojasteh & Lo (2015, Table 1) investigated the performance of fit indices in bifactor models for metric invariance and recommended the cutoffs 0.016-0.023 for ΔGamma, 0.077-0.101 for ΔMc, 0.003-0.004 for ΔCFI, 0.021-0.030 for ΔSRMR, 0.030-0.034 for ΔRMSEA. Cutoffs were smaller as sample sizes grow (from 400 to 1,200). However, ΔSRMR showed some type I error and ΔRMSEA showed low power. Their Table 3 displays the cutoff criteria from their study and previous studies.
Savalei et al. (2023) showed that ΔRMSEA is quite insensitive to model misfit when comparing two nested models, because the initial Model A often has large degrees of freedom (dfA) relative to the degrees of freedom introduced by the constraints in Model B (dfD). When RMSEAB is computed, the additional misfit introduced by the constraints is spread over the entire df for Model B, dfB = dfA + dfD, resulting in very similar values of RMSEAB and RMSEAA, hence a very small ΔRMSEA. This happens more often for models with large df, because any change in the fit function will be diluted by dividing by the overall df in their computation. They advocate computing the RMSEA associated with the χ2 difference test: RMSEAD=SQRT(D-dfD/dfD(N-1) and where RMSEAD equals zero if D<dfD. In small df model however, cutoffs for RMSEAD must be relaxed due to its increased sensitivity (e.g., 0.10 instead of 0.08). Generally, 0.08 is considered fair fit and 0.10 poor fit.
On top of model fit indices, the residual matrix and modification indices can be examined to assess the model fit. The fitted residual matrix is also useful to identify focal areas of misfit (Brown, 2015, pp. 97-99). Standardized residuals, computed by dividing the fitted residuals by their standard errors, are easier to interpret than the unstandardized residuals when the units of measurement of the indicators are disparate. A positive (negative) standardized residuals suggests that the model’s parameters underestimate (overestimate) the zero-order relationship between two indicators and may indicate omitted (unecessary) parameters. are needed in the model to better account for the covariance between the indicators. Typical cutoffs for standardized residuals would be ≥1.96 or ≥2.58 for large samples, which correspond to a significant z score at p<.05 or p<.01 respectively. Another way to evaluate the model is through modification indices (Brown, 2015, p. 102). If the largest indices, which are based on χ2 statistic with 1 df, are small enough, it would mean that the fit cannot be significantly improved at p<0.05. These two strategies though are highly sensitive to sample size. Brown rightfully noted that modification indices should be examined along with the (standardized) Expected Parameter Change, which tells us how much a parameter is expected to change in a positive or negative direction if it is freely estimated, but he did not say that the residual matrix can be transformed to a correlation matrix, which produces a more interpretable size of the residuals.
If during the inspection of the residuals and modification indices, one variable consistently is associated with misfit, one could simply drop the variable to improve fit (Brown, 2015, p. 156).
2.7. MGCFA: Lack of power?
Several failures to detect a sex difference in g when applying MGCFA lead Molenaar et al. (2009) to investigate the statistical power of MGCFA using high order models. Table 2 below shows that power increases with subtest correlation or g effect size. To attain sufficient power (0.80), a g effect size of SD=0.40 is required, though we also notice that the number of non-invariant intercepts and the number of first-order residual mean differences cause power to decrease. Even though Molenaar et al. recognize that larger sample size greatly increases power, their simulation assumed equal group size.
When investigating ethnic differences, sample sizes are usually unequal. Yoon & Lai (2018) reported that a higher ratio of largest/smallest sample reduces power in detecting invariance. They simulated group 1 with a fixed sample of 200 and group 2 with a sample varying from 200 to 3000. As the sample in group 2 increases from 200 to 3000, both the ΔCFI and ΔRMSEA become much smaller. This was true for both metric and scalar invariance. In the only one instance involving an equal total sample size, It was observed that ΔCFI and ΔRMSEA are smaller in the condition of 200/600 as opposed to 400/400. Their proposed solution of a subsampling approach is promising. However, SRMR never performed well in either approach.
Since Dolan (2000) reported a lack of difference in fit between the correlated factors non-g models and higher order g-models with respect to black-white difference in the WISC-R, Lubke et al. (2001, Table 3) calculated the power to reject weak Spearman g-models in favor of the less restrictive four correlated factors model with mean differences in all four factors. Given the sample size used in the study (N=2,400), power equals 0.16 for a weak Spearman model with residual mean differences in 2 specific factors and equals 0.05 for a model with 3 specific factors. To reach high power (0.80), these models require N=13,581 and N=124,320, respectively.
One major concern in MGCFA observed by Fan and Sivo (2009, Figures 2-3) is that the changes in fit indices such as ΔCFI and ΔRMSEA and ΔGamma (but not ΔMc) had a poor performance in detecting latent mean differences, within first order models, because these tests were highly related to model size (i.e., increased number of factors or indicators per factor, or both) and became much less sensitive as model size grew. As indicators per factor increase from 2 to 6, power levels for all fit indices (except Mc) dropped to zero when latent means differ by d=0.20. Only ΔGamma and ΔMc maintained high power levels regardless of the number of factors or indicators when latent means differ by d=0.50. Except for ΔMc, all fit indices performed even worse when the number of factors increases.
With respect to invariance, Lubke & Dolan (2003, Table 4) reported that the sample size needed to reject the specification of free residual variances, but equal loadings and intercepts, in a single factor model at power 0.80 is generally small. Smaller differences in either factor means or residual variances, smaller number of indicators, and more unbalanced sample sizes negatively affect power. Yet even in the worst case scenario, with 5:1 in ratio of group sizes and 0.25 SD in latent means and a group difference of 0.30 in residual variances, the required sample is N=360 or N=561 for the 8- or 4-indicator model, respectively. If the model is respecified with group equality in residuals, the sample sizes are much lower, at N=88 and N=109 for the 8- and 4-indicator model. Strict invariance therefore increases power, but only if the group difference in residual variance is sizeable. Interestingly, increasing the number of factors (2-, 3-, 4-factor models with 4 indicators) does not affect power as long as the misspecification is restricted to one subscale.
Cao & Liang (2022, Tables 3-4) carefully examined the impact of model size on invariance test. For metric invariance, both CFI and RMSEA have very low sensitivity with loading difference of 0.15 regardless of sample size and model size. When loading difference is 0.30, sensitivity for both was very high but CFI (and RMSEA) become completely insensitive when there are 6 (and 2) factors with 8 indicators per factor (there are a few inconsistent pattern with CFI). Increasing only the number of indicators per factor will decrease power for both indices (here again, there are a few inconsistent pattern with CFI). For scalar invariance, both CFI and RMSEA have very low sensitivity (or very high, except few exceptions for RMSEA) with intercept difference of 0.15 (or 0.30) regardless of sample size. Increasing the number of indicators per factor (or total number of indicators) will decrease power for both indices but only for 0.15 difference. In general, increasing only the number of factors improves slightly the sensitivity of CFI (but not RMSEA) for both metric and scalar invariance.
The finding of lower power associated with greater number of residual mean differences is rather concerning given that more complex models are also associated with lower power. The main requirement for power is a combination of large sample and large standardized difference in g.
2.8. Equivalent model.
Kline (2011, p. 246) showed that correlated factors and higher order models can be equivalent. Hershberger (2006, pp. 36-38) proposed several strategies to eliminate equivalent models: 1) select model with smaller misfit and/or lower correlations among parameter estimates, 2) discard model with computational issues associated with misspecification such as negative variances and constrained boundary values, 3) comparing R² values among models, 4) select model with lower residuals based on the Extended Individual Case Residuals (EICR) instead of the residual covariance matrix since the latter will be identical among equivalent models.
Williams (2012, pp. 257-259) argued that one method to reduce equivalent models is by adding variables to the research design that are related to some variables but not others which will then allow these path coefficients be fixed at zero. Variables added in such a way serve a purpose similar to that of instrumental variables in cases where a model is just identified or underidentified. Parameter estimates may vary across alternative models (p. 255). models with lots of paths tend to have more equivalent models because these paths can be reversed or replaced with correlated residuals.
Rarely, equal goodness of fit may occur by chance for two models that do not produce the same model-implied covariance (Brown, 2015, p. 185). Truly equivalent models produce identical df and goodness of fit, as well as residuals and fitted matrices (Hershberger, 2006).
2.9. Bad use of modification indices.
When an initial, baseline model does not fit well, one common procedure is to examine modification indices (i.e., specification searches) and see which newly added parameters will most improve fit based on the χ2 value and standardized EPC. Unfortunately, nonsensical parameters are also recommended by the software.
MacCallum (1986) described an extreme view in which specification searches should not be used at all in SEM because doing so is “an admission that the initial model was not well conceived and/or that the research was not well conducted” (p. 109) and further explained that at the very least data-driven model modifications “cannot be statistically tested with any validity, that their goodness of fit and substantive meaning must be evaluated with caution, and that their validity and replicability are open to question” (p. 109).
As noted by MacCallum et al. (1992), modifications based on sample-specific characteristics are problematic because the population value is zero and the results will not generalize to other samples or the population. A large number of modifications increases the likelihood of capitalization on chance relative to a few modifications that are made early in the sequential ordering (Landis et al., 2009, p. 208).
A good illustration of specification search problem is the presence of correlated residuals which, according to Landis et al. (2009, p. 204), is an indication that there exists a common cause of both of these variables but that is not specified in the model. Allowing these residuals to be estimated improves fit because it captures the influence of this omitted cause, but does not specify what the cause is and this is not helpful. On the other hand, a defensible case can be made from identical measures across time, because the residuals attached to identical indicators separated only by time will obviously correlate. Other causes of correlated residuals include: subset of items being speeded or subset of items on a scale being associated with different stimuli (e.g., reading comprehension for different paragraphs). Potentially, the ordering of items could introduce correlated errors, particularly if similarly worded items were adjacent to each other.
Correlated residuals should have a substantive reason because it adds one df (Kline, 2011, pp. 110, 115). Justified correlated residuals due to, e.g., method effects are discussed by Cole et al. (2007, pp. 383-387) and Brown (2015, ch. 6).
2.10. Scale reliability.
Latent scale reliability alpha (α) and omega (ω) hierarchical (ωh) can be used to assess the internal consistency of unidimensional and multidimensional scales, respectively. Although α and ω both estimate the proportion of variance in the observed total score attributable to all sources of common variance, α assumes that the factor loadings of all indicators are equal while ω relaxes this assumption (Brown, 2015, p. 309; Rodriguez et al., 2016). Another disadvantage with α is that the presence of correlated errors in the indicators will create bias in the scale reliability. When data are suitably represented by a bifactor structure, omega hierarchical (ωh) and hierarchical subscale (ωhs) are more useful model-based reliability indices: ωh estimates the variance in total scores that is attributable to a general factor while ωhs estimates the variance of subscale score due to a group factor after controlling for the variance due to the general factor. A high value of ωh indicates that the general factor is the dominant factor and is not affected by multidimensionality caused by specific factors.
Zinbarg et al. (2005) showed that ωh is robust to the presence of cross loadings, number of indicators and sample size. Cho (2022, Figure 2, Tables 4, 8-9) compared several ωh estimators between several categories (non-FA non-PCA, CFA, EBA, PCA). The Schmid-Leiman iterative target rotation (ωSLi) from the Exploratory Bifactor Analysis (EBA) produces the best average performance, yet he warns us that the performance of ωh estimators depends on the data structures (see Table 8). Cho then compared the ω reliability estimators and found that the bottom-up estimators ρBμ and ρBC performed overall better, yet again the performance depends on the conditions. An interesting observation is that ω reliability estimators are more accurate than ωh estimators. The explanation is that the former divides variances into explainable and nonexplainable while the latter further divides the explainable variances into general/specific factors. Sophisticated statistics require high quality data for reliable estimation: “A complex model uses many parameters to detect signals better than a simple model but is also more susceptible to noise (Fortmann-Roe, 2012). A complex model explains more, be it signal or noise, than a simple model. Since reliability is proportional to the variance explained, complex models are prone to overestimating reliability.” (p. 31). Cho provides the reliacoef package to compute these estimators.
3. Results.
The Project TALENT public-use data contains a sample of 377,016 high school students which are selected to be representative of all 9th, 10th, 11th and 12th grade students in the US. The items of the subtest variables are described in detail in the Booklets A1, B1, C1, C2 and by Wise et al. (1979). All files available in the public documentation. The final sample includes 70,776 white males, 71,381 white females, 2,443 black males, 3,642 black females with a mean age of 15.9, 15.8, 16.0, and 15.8, respectively. It is likely that a substantial portion of the black males dropped out of high school.
The data provided us with an IQ composite variable, which is more inclusive than the tests I used below. The weighted means (and SDs) for whites and blacks are 183 (50.3) and 111 (52.9) among males and 184 (46.7) and 109 (47) among females. This gives us a BW d gap of 1.43 and 1.60 for males and females respectively. The sex d gaps are 0.02 and 0.04 for white and black samples, respectively.
The 34 aptitude/achievement test variables which are used:
1. Vocabulary
2. Knowledge in Literature
3. Knowledge in Music
4. Knowledge in Social Studies
5. Mathematics
6. Knowledge in Physical Science
7. Knowledge in Biological Science
8. Knowledge in Aeronautics and Space
9. Knowledge in Electricity and Electronics
10. Knowledge in Mechanics
11. Knowledge in Art
12. Knowledge in Law
13. Knowledge in Health
14. Knowledge in the Bible
15. Knowledge in Theater
16. Miscellaneous Knowledge
19. Disguised Words
20. Word Spelling
21. Rules of Capitalization
22. Knowledge in the use of Punctuation
23. English Usage
24. Effective Expression
25. Word Function in Sentences
26. Reading Comprehension
28. Mechanical Reasoning
29. Visualization in 2D Figures
30. Visualization in 3D Figures
31. Abstract Reasoning
32. Arithmetic Reasoning
33. High School Math
34. Speed in Arithmetic Computation
35. Speed in Table Reading
36. Speed in Clerical Checking
37. Speed in Object Inspection
Previously, Major et al. (2012) included 37 aptitude tests in their CFA. I removed three tests: Memory for Sentences and Words, and Creativity. The memory tests are highly correlated with each other but show a low correlation with all other variables (between r=0.10 and r=0.20), which makes them unsuitable for CFA. Creativity has moderate correlations with other variables and also does not display a clear pattern of loadings and does not theoretically belong to any factor.
Exploratory factor analysis (EFA) was used to determine the number of factors. Just like Major, I found the 6-factor model to be the most interpretable. 4- and 5-factor models blend indicators into factors which aren’t easy to interpret and cause severe unbalances in factor sizes. The 7- and 8-factor models produce additional factors which are not associated with any particular ability or do not have any indicators with good loading. EFA reveals a large number of medium-size cross loadings. As recommended by simulation studies cited above, I allow cross loadings when the average of the two groups is close to 0.20 but with a minimum of 0.15 per group.
While Major et al. (2012) demonstrated that the VPR high order g-model fits better than the CHC, they use the entire sample, disaggregating it by grade level and gender but not by ethnicity. I do not display the result from VPR g-model because in my samples disaggregated by race and gender groups, the VPR g fits marginally better than the CHC (with CFI +0.001 or +0.002). Also, the VPR g often yields parameter values which are either out of bounds (Heywood cases) or close to being inadmissible (near-zero residuals). Another variant of the VPR g-model is the VPR correlated factors model, which was used for illustration purposes only, because at baseline it is an equivalent model (same df, same fit, same parameter estimates and residual matrices) of the VPR g-model and because the CHC correlated factors model fits better. Therefore, my analysis evaluates correlated factors (CF), higher order g (HOF) and bifactor g (BF) models within the CHC structure.
Unlike Major et al. (2012), I did not use imputation on the entire sample. Instead, I use imputation for each separate groups. This is because the predictive mean matching method from the mice package is a regression method. Using the entire sample implies that the correlation pattern is identical across groups, an assumption that may not be true and may eventually conceal measurement non-invariance.
Some variables were not normally distributed, so I normalized them before applying z-score transformation. This is not a necessary procedure but, as explained above, there are some advantages. Values for univariate kurtosis and skewness were acceptable but multivariate normality was sometimes rejected. The multivariate non-normality displayed by the QQ plot was moderate for black-white analysis but perfectly fine for sex analysis.
Initially, the MGCFA models were conducted without disaggregating by gender. But it was later found that disaggregating produced better fit. The reason was obviously because the test was gender biased. With respect to strict invariance, this step is ignored. As explained earlier, there is no agreement among statisticians on its necessity to establish the comparability of latent means. In addition, SH does not even assume group equality in the residuals.
All models adjust for differential age effect across groups by imposing the group equality constraint on the regression of aptitude tests on age. Student weight was also applied.
3.1. Race analysis.
Because the battery wasn’t meant to be an IQ test, it is no wonder why too many cross loadings were found. This has the consequence of producing an ill-defined latent factor mean. The models used in the BW male and BW female analyses, respectively, are:
english =~ 1 + 19 + 20 + 21 + S22 + 23 + 24 + 25 + 26 + 31 + 34
math =~ 5 + 6 + 25 + 32 + 33 + 34
speed =~ 19 + 29 + 34 + 35 + 36 + 37
info =~ 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 11 + 12 + 13 + 14 + 15 + 16 + 19 + 26
science =~ 1 + 6 + 7 + 8 + 9 + 10 + 28
spatial =~ 28 + 29 + 30 + 31 + 37
english =~ 1 + 13 + 19 + 20 + 21 + 22 + 23 + 24 + 25 + 26 + 31 + 34
math =~ 5 + 25 + 32 + 33 + 34
speed =~ 19 + 34 + 35 + 36 + 37
info =~ 1 + 2 + 3 + 4 + 7 + 8 + 11 + 12 + 13 + 14 + 15 + 16 + 19 + 26
science =~ 1 + 6 + 7 + 8 + 9 + 10
spatial =~ 28 + 29 + 30 + 31 + 37
Using the traditional approach, lack of power was very apparent.
Table 1. Black White analysis among males
Model Level X² df CFI RMSEA SRMR Mc RMSEAD [CI]
CF
1. Configural 69791 990 .955 .044 .033 .625
2. Regression 70088 1024 .955 .043 .034 .624 .014[.012:.015]
3. Metric 71549 1069 .954 .042 .035 .618 .028[.026:.029]
3a. Part.Metric 71072 1066 .954 .042 .035 .620 .023[.022:.025]
4. Scalar 75161 1094 .951 .043 .035 .603 .059[.058:.061]
4a. Part.Scalar 73762 1091 .952 .043 .035 .609 .051[.049:.053]
5. Lv covar 74582 1106 .952 .043 .036 .605 .035[.033:.037]
6. Lv var-covar 74827 1112 .951 .043 .036 .604 .032[.028:.035]
7. Lv reduced 74962 1113 .951 .043 .036 .604 .047[.039:.056]
HOF
1. Configural 83357 1008 .946 .047 .040 .570
2. Regression 83657 1042 .946 .047 .040 .569 .014[.012:.016]
3. Metric 85893 1092 .944 .046 .042 .560 .033[.031:.034]
3a. Part.Metric 84945 1089 .945 .046 .041 .564 .025[.024:.027]
4. Scalar 88575 1116 .942 .046 .041 .550 .056[.054:.057]
4a. Part.Scalar 87413 1114 .943 .046 .042 .555 .048[.046:.049]
5. Lv variance 87615 1121 .943 .046 .042 .554 .025[.022:.029]
6a. Strong SH 88785 1127 .942 .046 .042 .550 .076[.072:.079]
6b. Weak SH 87648 1124 .943 .046 .042 .554 .019[.014:.024]
BF
1. Configural 65836 952 .957 .043 .029 .642
2. Regression 66133 986 .957 .042 .030 .641 .014[.012:.015]
3. Metric 68480 1064 .956 .042 .032 .631 .026[.025:.027]
3a. Part.Metric 67753 1062 .956 .041 .031 .634 .022[.021:.023]
4. Scalar 70835 1089 .954 .042 .032 .621 .053[.051:.055]
4a. Part.Scalar 69688 1087 .955 .042 .032 .626 .043[.041:.045]
5. Lv variance 69961 1094 .955 .041 .032 .625 .029[.026:.032]
6a. Strong SH 71036 1100 .954 .042 .033 .620 .065[.062:.069]
6b. Weak SH 69988 1097 .955 .041 .032 .625 .014[.009:.019]
6c. No SH 74348 1098 .952 .043 .043 .606 .161[.157:.166]
Note: Models a, b, c are compared to the previous level model.
In the male sample, both metric and scalar invariance hold. Upon examining the χ2 from modification indices, 3 loadings and 2 intercepts displayed larger χ2 values than the other parameters in the correlated factors but only 3 loadings and 2 intercepts had relatively larger χ2 in the two g models. One of the loadings showed a small/modest difference. But regarding the intercept differences, since they are not easily interpretable, I computed Gunn et al.’s (2020) effect size SDI. Mechanics (S10), Physical Science (S6) and Arithmetic Reasoning (S32) showed the strongest bias. The SDI computed from the correlated factors model for S10 and S32 were d=0.36 and d=0.35. Positive values show bias against the focal group (i.e., black). SDI for S6 was not computed due to double loadings.
Table 2. Black White analysis among females
Model Level X² df CFI RMSEA SRMR Mc RMSEAD [CI]
CF
1. Configural 69163 998 .948 .043 .032 .635
2. Regression 69589 1032 .948 .042 .033 .633 .016[.015:.018]
3. Metric 71432 1073 .946 .042 .034 .626 .032[.031:.033]
3a. Part.Metric 70400 1069 .947 .042 .033 .630 .023[.022:.025]
4. Scalar 76280 1097 .943 .043 .035 .606 .070[.069:.072]
4a. Part.Scalar 72652 1093 .945 .042 .034 .621 .047[.046:.049]
5. Lv covar 74067 1108 .944 .042 .037 .615 .044[.042:.047]
6. Lv var-covar 75159 1114 .943 .042 .038 .610 .062[.058:.065]
7. Lv reduced 75451 1115 .943 .042 .038 .609 .075[.067:.084]
HOF
1. Configural 78750 1016 .941 .045 .037 .595
2. Regression 79176 1050 .940 .045 .038 .594 .016[.015:.018]
3. Metric 82098 1096 .938 .044 .041 .583 .038[.037:.039]
3a. Part.Metric 80609 1092 .939 .044 .039 .589 .028[.027:.029]
4. Scalar 86643 1119 .935 .045 .040 .565 .071[.070:.073]
4a. Part.Scalar 82974 1115 .937 .044 .040 .579 .048[.047:.050]
5. Lv variance 83892 1122 .937 .044 .042 .576 .049[.046:.052]
6a. Strong SH 84863 1128 .936 .044 .042 .572 .067[.064:.071]
6b. Weak SH 84004 1125 .937 .044 .042 .575 .035[.030:.040]
BF
1. Configural 57816 960 .957 .040 .028 .684
2. Regression 58247 994 .956 .039 .029 .683 .016[.015:.018]
3. Metric 61807 1068 .954 .039 .032 .667 .032[.031:.033]
3a. Part.Metric 60806 1064 .954 .039 .032 .671 .028[.027:.029]
4. Scalar 66601 1091 .950 .040 .033 .646 .074[.072:.075]
4a. Part.Scalar 63741 1087 .952 .039 .033 .659 .057[.055:.058]
5. Lv variance 64880 1094 .951 .039 .034 .654 .057[.054:.060]
6a. Strong SH 66814 1100 .950 .040 .035 .645 .080[.077:.084]
6b. Weak SH 64889 1095 .951 .039 .034 .654 .010[.003:.020]
6c. No SH 68774 1096 .948 .041 .039 .637 .192[.186:.199]
Note: Models a, b, c are compared to the previous level model.
In the female sample, both metric and scalar invariance barely hold. Upon examining the χ2 from modification indices, 4 loadings and 4 intercepts had large χ2 values. There was indeed a non-trivial group difference in these loadings. Intercept bias was even more serious: by far the largest χ2 was Mechanics (S10), followed by Clerical Checking (S36), Arithmetic Computation (S34), Arithmetic Reasoning (S32). The SDI for S10, S36 and S32 were d=0.67, d=-0.43, d=0.43. Positive values show bias against the focal group (i.e., black). I did not compute S34 because I use the code provided by Lasker (2021) and it does not handle cross loadings.
After establishing partial invariance, we now examine the Spearman’s models. To be clear, the “Lv variance” and “weak SH” models are similar, except the former estimates all factor means, i.e., is less parsimonious than the latter. The BF g-model fits barely better than CF in the male sample but BF fits clearly better in the female sample. The CF always fits way better than the HOF g-model. Within the bifactor g-model, fixing the g difference to zero leads to worse fit but not by much, likely owing to lack of power.
In previous MGCFA studies, when comparing models, measurement invariance is not evaluated separately for each competing model. This was an important omission. This analysis provides an illustration. In the male sample, S32 had a large χ2 in the correlated factors but very small χ2 in the higher order and a non-significant χ2 (p=0.277) in the bifactor model despite the extremely large sample size. In both male and female samples, the BW group difference in loadings differ depending on the model. In the higher order and bifactor models, most often the biased loadings are the loadings on g, which do not appear as such in the correlated factors model.
3.2. Gender analysis.
Owing to equivalent sample sizes, power is less of an issue.
Table 3. Male-female analysis among Whites
Model Level X² df CFI RMSEA SRMR Mc RMSEAD [CI]
CF
1. Configural 115791 984 .958 .041 .030 .668
2. Regression 119145 1018 .957 .040 .031 .660 .035[.034:.036]
3. Metric 128322 1066 .953 .041 .039 .639 .048[.047:.049]
4. Scalar 195992 1094 .928 .050 .044 .504 .173[.172:.174]
4a. Part.Scalar 138034 1082 .950 .042 .039 .618 .085[.084:.087]
5. Lv covar 141168 1097 .948 .042 .055 .611 .051[.049:.053]
6. Lv var-covar 142175 1103 .948 .042 .057 .609 .044[.041:.046]
7. Lv reduced 142176 1104 .948 .042 .057 .609 NaN*
HOF
1. Configural 136882 1002 .950 .044 .035 .620
2. Regression 140316 1036 .949 .043 .036 .613 .035[.034:.036]
3. Metric 150122 1089 .945 .044 .043 .592 .047[.047:.048]
4. Scalar 215694 1116 .921 .052 .049 .470 .168[.167:.169]
4a. Part.Scalar 160436 1104 .941 .045 .044 .571 .089[.088:.091]
5. Lv variance 163046 1111 .940 .045 .060 .566 .067[.065:.069]
6a. Strong SH 313949 1117 .885 .063 .091 .333 .615[.613:.618]
6b. Weak SH 163046 1112 .940 .045 .060 .566 NaN*
BF
1. Configural 109567 946 .960 .040 .027 .682
2. Regression 112939 980 .959 .040 .028 .674 .035[.034:.036]
3. Metric 123859 1061 .955 .040 .037 .649 .040[.039:.040]
3a. Part.Metric 131921 1060 .956 .040 .036 .654 .036[.035:.037]
4. Scalar 151477 1087 .945 .044 .039 .589 .114[.113:.115]
4a. Part.Scalar 129982 1080 .953 .041 .037 .635 .066[.065:.068]
5. Lv variance 133080 1087 .951 .041 .056 .629 .074[.071:.076]
6a. Strong SH 303573 1093 .889 .062 .091 .345 .632[.630:.635]
6b. Weak SH 133902 1088 .951 .041 .057 .627 .102[.095:.108]
6c. No SH 142850 1088 .948 .043 .064 .607 .671[.665:.677]
Note: Models a, b, c are compared to the previous level model.
* NaN is the result of a Chi-square that is negative or lower than 1 (model fits much better). RMSEAD therefore cannot be computed.
Among whites, metric invariance barely holds but scalar invariance is so severely rejected in all models that it required 12 free intercepts, for both the CF and HOF model, to achieve invariance. Bifactor model required freeing 1 loading and 7 intercepts. Selection of free parameters was based on highest χ2. At this point, comparing latent means becomes useless because the anchor tests, which are the unbiased tests set equal in order to estimate the latent means, do not constitute the great majority anymore. All subsequent model constraints follow the partial scalar model. The assumption of equal latent variance and covariance (for the CF model) is cleary rejected.
The bifactor fits marginally better than the correlated factors model and, as always, the higher order performs much worst. Among the BF models, the no SH constrains the g factor difference to be zero is largely rejected, with RMSEAD=.671.
Table 4. Male-female analysis among Blacks
Model Level X² df CFI RMSEA SRMR Mc RMSEAD [CI]
CF
1. Configural 5342 994 .961 .038 .034 .699
2. Regression 5422 1028 .961 .037 .034 .697 .019[.013:.025]
3. Metric 6290 1071 .953 .040 .051 .651 .074[.070:.079]
3a. Part.Metric 5875 1068 .957 .038 .043 .674 .054[.049:.059]
4. Scalar 7468 1096 .943 .044 .053 .592 .130[.124:.135]
4a. Part.Scalar 6324 1089 .953 .040 .044 .650 .083[.077:.090]
5. Lv covar 6828 1104 .949 .041 .077 .625 .087[.079:.095]
6. Lv var-covar 6929 1110 .948 .042 .079 .620 .062[.050:.075]
7. Lv reduced 6929 1111 .948 .041 .079 .620 NaN*
HOF
1. Configural 5869 1012 .957 .040 .036 .671
2. Regression 5951 1046 .956 .039 .037 .668 .020[.014:.026]
3. Metric 7194 1094 .945 .043 .064 .606 .085[.081:.089]
3a. Part.Metric 6469 1089 .952 .040 .046 .643 .056[.052:.061]
4. Scalar 8374 1116 .935 .046 .053 .551 .143[.138:.149]
4a. Part.Scalar 6922 1110 .948 .041 .047 .620 .076[.069:.082]
5. Lv variance 7009 1117 .947 .042 .065 .616 .053[.042:.065]
6a. Strong SH 8675 1123 .933 .047 .076 .537 .305[.292:.317]
6b. Weak SH 7009 1118 .947 .042 .065 .616 NaN*
BF
1. Configural 4623 956 .967 .036 .023 .740
2. Regression 4704 990 .967 .035 .024 .737 .020[.014:.026]
3. Metric 5911 1066 .957 .039 .059 .671 .063[.060:.067]
3a. Part.Metric 5251 1059 .963 .036 .042 .708 .043[.039:.046]
4. Scalar 6498 1086 .952 .040 .046 .641 .114[.108:.119]
4a. Part.Scalar 5690 1082 .959 .037 .042 .685 .068[.062:.074]
5. Lv variance 6281 1089 .954 .040 .064 .653 .173[.162:.184]
6a. Strong SH 9095 1095 .929 .049 .079 .518 .427[.415:.439]
6b. Weak SH 6294 1090 .953 .040 .064 .652 .047[.021:.080]
6c. No SH 6554 1091 .951 .041 .072 .638 .248[.227:.270]
Note: Models a, b, c are compared to the previous level model.
* NaN is the result of a Chi-square that is negative or lower than 1 (model fits much better). RMSEAD therefore cannot be computed.
Among blacks, both the metric and scalar invariance do not meet the cutoff criteria. There are 3/5 non-invariant loadings and 7/6 non-invariant intercepts for the CF/HOF models. The number of freed parameters is large enough to jeopardize latent means comparison. But fair comparison might be possible for the bifactor which has 7 non-invariant loadings and only 4 non-invariant intercepts. Yet the assumption of equal latent variance and covariance (for the CF model) is also rejected.
The correlated factors and higher order models show equal fit but the bifactor fits way better than these two models. Within the BF model the no-SH is largely rejected, with RMSEAD=.248. Upon examining the d gaps across latent means, the g factor mean shows a much smaller difference than other specific abilities.
Thus, using Dolan’s approach (g vs non-g model) still produces ambiguous conclusion while using Beaujean’s (fixing g score to zero), the g model seems to be supported. But it remains to be seen whether the contribution of g latent means is substantial.
3.3. Latent mean differences.
This table displays the standardized latent mean differences.
Table 5. d gaps from the best fitting bifactor model per group analysis
BW d (male) BW d (female) sex d (white) sex d (black)
English (fixed to 0) -1.081 2.816 1.810
Math -0.326 (fixed to 0) 0.783 0.808
Speed (fixed to 0) 0.225 0.544 0.281
Info (fixed to 0) -0.679 1.974 1.500
Science -0.897 -0.211 -1.740 -1.361
Spatial -0.430 -0.783 -0.329 -0.179
g -1.502 -1.272 -0.853 -0.554
Note 1: Sex gaps may not be trusted because too many intercepts are freed.
Note 2: Negative values indicate advantage for whites (or males).
In the sex analysis, it was shown earlier that weak SH fits better than the non-SH model within a g-model. Upon examining the means however, it is clear that the relative contribution of g to the various test score differences is small compared to the other specific factors (e.g., English, Information, Science). So this does not seem to support SH model in accounting for sex differences. Of course, this conclusion is still jeopardized, along with the mean estimates, due to the severity of non-invariance. An interesting discovery is that the HOF model, as opposed to BF, revealed a sex difference in g that is very close to zero in both the white and black sample.
In the black-white analysis, a few factors typically had a very small group difference, so they are fixed to zero to get more parsimonious models. It is clear from the estimates that g accounts for a relatively greater portion of the difference than specific factors.
3.4. Solving power issues with sliced Black-White samples.
As noted in section 2.7, the larger the sample ratio the stronger the bias toward the null. The white-black ratio in this data is 31:1 and 21:1. Using slice_sample() package, I request random samples of whites to perfectly match the black sample, yielding 2443/2443 for males and 3642/3642 for females. There were 10 runs for each group.
Starting with the bad news. Upon examining the unstandardized loadings and intercepts in the free model (configural), there are small-modest variations in the loadings but non-trivial variations in the intercepts. Another issue is that some data runs yield worse or better fit for all models (e.g., 0.006 in ΔCFI). This happens more often in the BW female sample. And although there is consistency among the largest biased tests, their effect vary across runs: Mechanics was by far the most biased test (especially among females) yet sometimes it causes a change in 0.001 or 0.003 in CFI for the scalar invariance. And 0.003 is considered biased by Meade et al. (2008).
Now with the good news. Across runs, there is a strong consistency with which the data rank the models. For both the BW male and BW female groups, the bifactor model always fits better than the correlated factor model (0.010 in ΔCFI) which always fits better than the higher order model (0.005 in ΔCFI). This is a pattern which seemed apparent in the analysis using the full sample of whites, but the advantage of the bifactor was too small given the positive bifactor bias reported in recent simulations. Another advantage is the model misfit being more visible. The ΔCFI values for metric and scalar invariance in the bifactor model are 0.010 and 0.005 in the male group and 0.015 and 0.010 in the female group. The ΔCFI values for metric and scalar invariance in the correlated factors model are 0.007 and 0.011 in the male group and 0.007 and 0.016 in the female group. Finally, in both BW male and BW female samples, the weak SH fits so much better than the no-SH model within the bifactor, while this pattern was barely visible using the full white sample.
So it appears that while it may eventually cause larger non-invariance due to randomness, simulation studies were probably correct in their statement that MGCFA is underpowered when sample ratios are high. In the highly unbalanced full samples, the non-trivial bias was hidden by the lack of sensitivity in fit measures in the female group. Fortunately, because there was a small number of biased tests, the overall battery of 34 tests is not severely affected.
3.5. Method of Correlated Vectors.
Proposed by Jensen (1998, Appendix B), MCV correlates the (sub)test g-loadings (corrected for reliability) with the (sub)test standardized group difference. Results corroborate with MGCFA.
The correlation between g-loadings and BW test scores differences is high (0.7-0.8) regardless of the loading vector being used (white, black, estimated from EFA or bifactor model). The correlation is not robust though, because it’s mainly driven by the 3 speed tests 35-37. Removing them yields r=0.30-0.44 for the male sample and r=0.48-0.64 for the female sample. Typically, one should not remove any tests without good reason but the speed tests in this battery are not good measures of intelligence owing to their poor g-loadings. At the same time, information tests have high loadings but most are not even supposed to measure intelligence.
The correlations between g-loadings and sex differences range from -0.2 and +0.1, regardless of which loading vector is used (male, female, estimated from EFA or bifactor model). Overall these results confirm the MGCFA earlier. It is important to note that MCV and MGCFA are different kinds of tests and are not comparable. MCV is not a latent variable approach and does not handle measurement bias. Unsurprisingly, Lasker et al. (2021) reported conflicting results between MCV and MGCFA.
4. Discussion.
Cognitive/achievement tests were strongly biased with respect to gender and minimally biased with respect to race. Therefore, the test of Spearman’s hypothesis (SH) was only suitable for the black-white analysis. There are two methods. Dolan (2000) proposed model comparison. Using this approach, the bifactor g model fits better than the correlated factors but the higher-order g model fits worse. Frisby & Beaujean (2015) proposed a constraint on the g factor means in the g model. Using this approach of fixing the g factor mean difference to zero in the bifactor, the constrained model fits worse in both the male and female subsamples.
The test of SH is not always consistent with MGCFA (Dolan, 2000, & Hamaker, 2001; Kane & Oakland, 2010; Frisby & Beaujean, 2015; Lasker et al., 2021) but it can be tested in different ways. McDaniel & Kepes (2014) evaluate SH by manipulating the g saturation of composite tests and found support for the hypothesis. SH is supported through the examination of Forward and Backward Digit Span, showing a BDS gap that is larger (d=0.5) than the FDS gap (Jensen, 1998, p. 370). The most powerful and direct way of testing Spearman’s is by examining ECT’s task complexity. Jensen (1998, p. 391) reported high correlation between task complexity and the magnitude of the Black-White gap (r=0.86).
Cultural bias and Spearman’s Hypothesis are two unrelated analyses, and one says nothing about the other. A culture loaded item is biased only if the groups differ in the specific knowledge elicited by the test, given equal latent ability. And only culture bias would undermine group comparison. Malda et al. (2010) misconstrued SH by pretending that SH is valid only if cultural and cognitive complexity do not covary. Jensen (1998, p. 89) argued that culturally loaded items such as vocabulary require a great deal of fluid abilities because most words in a person’s vocabulary are learned through inferences of their meaning by the eduction of relations and correlates. The higher the level of a person’s g, the fewer encounters with a word are needed to correctly infer its meaning. Knowledge gap is a necessary outcome of a g gap (Jensen, 1973, pp. 89-90; 1980, pp. 110, 235). Thus, as long as the cultural item is not culturally biased, the scores are comparable across groups. One should proceed to test for a g model only if the tests are not biased, or minimally biased (using partial invariance). Instead, Malda et al. (2010) used a sub-optimal method of DIF, an effect size from logistic regression which underestimates the effect of DIF, and they did not evaluate the items’ culture loading with item curves (see Jensen, 1980, p. 106), and did not remove the biased items before testing for the effect of complexity on the black (urban and rural Tswana) – white (urban Afrikaans) South African gap. Worst of all, their test of SH vs cultural hypothesis is highly suspect. Instead of measuring cultural complexity by the rarity of words, they measure it by the group difference in familiarity. Instead of measuring cognitive complexity (SH) by the items’ complexity within test using a latent variable approach, they measure it by arbitrarily ranking the complexity between tests, with memory and attention tests assumed to be low and reasoning tests to be high in complexity.
Whatever the case, the great majority of the studies confirms the cross-cultural comparability, the exception mainly comes from South African samples (Dolan et al., 2004; Lasker, 2021). Due to the omnipresent force of the mass-market culture in developed countries, it is not surprising that culture bias is rarely noticeable (Rowe et al. 1994, 1995).
Bogus attacks and failed attempts to negate the IQ gap using alternative cognitive tests are nothing new (Jensen, 1973, pp. 299-301; 1980, pp. 518, 522). The most recent attempt at reducing the IQ gap comes from Goldstein et al. (2023). They devised a reasoning test composed of novel tasks that do not require previously learned language and quantitative skills. Because they found a black-white gap ranging between d=0.35 and d=0.48 across their 6 independent samples, far below the typically found d=1.00, they conclude that traditional IQ tests are biased. First, they carefully ignore measurement invariance studies. Second, their analysis adjusted for socio-economic status because they compare blacks and whites who had the same jobs (police officers, deputy sheriffs, firefighters) within the same cities. Third, they commit the same, old fallacy as many anti-IQ crusaders did before them. The fallacy that IQ tests are not comparable as long as they contain some degree of cultural content and that knowledge causes g differences rather than the opposite. Jensen (1980, p. 234) once noted: “verbal analogies based on highly familiar words, but demanding a high level of relation eduction are loaded on gf, whereas analogies based on abstruse or specialized words and terms rarely encountered outside the context of formal education are loaded on gc.”
The Project Talent administered aptitude tests. They serve as a proxy for cognitive tests, but they are not cognitive tests. For instance, most of the information test items require specific knowledge: asking who was the hero of the Odyssey, or what a female horse is called, etc. They do not call for relation eduction. Other tests, fortunately, require inductive reasoning. Overall, this is a mixed bag. But what the present study shows, together with previous studies on test bias in the ASVAB (Hu et al., 2019; Lasker et al., 2021), is that aptitude tests produce similar outcomes than cognitive tests.
Recommended readings
Beaujean, A. A. (2014). Latent variable modeling using R: A step-by-step guide. Routledge.
Bentler, P. M., & Chou, C. P. (1987). Practical issues in structural modeling. Sociological methods & research, 16(1), 78-117.
Brown, T. A. (2015). Confirmatory factor analysis for applied research. Guilford publications.
Browne, M. W., MacCallum, R. C., Kim, C. T., Andersen, B. L., & Glaser, R. (2002). When fit indices and residuals are incompatible. Psychological methods, 7(4), 403.
Cao, C., & Liang, X. (2022). The impact of model size on the sensitivity of fit measures in measurement invariance testing. Structural Equation Modeling: A Multidisciplinary Journal, 29(5), 744-754.
Chen, F., Bollen, K. A., Paxton, P., Curran, P. J., & Kirby, J. B. (2001). Improper solutions in structural equation models: Causes, consequences, and strategies. Sociological methods & research, 29(4), 468-508.
Cheung, G. W., & Rensvold, R. B. (2002). Evaluating goodness-of-fit indexes for testing measurement invariance. Structural equation modeling, 9(2), 233-255.
Cho, E. (2022). Reliability and omega hierarchical in multidimensional data: A comparison of various estimators. Psychological Methods.
Cole, D. A., Ciesla, J. A., & Steiger, J. H. (2007). The insidious effects of failing to include design-driven correlated residuals in latent-variable covariance structure analysis. Psychological methods, 12(4), 381.
Gignac, G. E. (2005). Revisiting the factor structure of the WAIS-R: Insights through nested factor modeling. Assessment, 12(3), 320-329.
Gignac, G. E. (2006a). The WAIS-III as a nested factors model: A useful alternative to the more conventional oblique and higher-order models. Journal of Individual Differences, 27(2), 73-86.
Greene, A. L., Eaton, N. R., Li, K., Forbes, M. K., Krueger, R. F., Markon, K. E., … & Kotov, R. (2019). Are fit indices used to test psychopathology structure biased? A simulation study. Journal of abnormal psychology, 128(7), 740.
Groskurth, K. (2023). Why one size does not fit all: Evaluating the validity of fixed cutoffs for model fit indices and developing new alternatives. Doctoral dissertation.
Heene, M., Hilbert, S., Draxler, C., Ziegler, M., & Bühner, M. (2011). Masking misfit in confirmatory factor analysis by increasing unique variances: a cautionary note on the usefulness of cutoff values of fit indices. Psychological methods, 16(3), 319.
Hershberger, S. L., & Marcoulides, G. A. (2006). The problem of equivalent structural models. In Hancock, G. R., & Mueller, R. O. (Eds.), Structural equation modeling: A second course (pp. 13-41). IAP Information Age Publishing.
Kline, R. B. (2011). Principles and practice of structural equation modeling. Guilford publications.
Landis, R. S., Edwards, B. D., & Cortina, J. M. (2009). On the practice of allowing correlated residuals among indicators in structured equation models. In C. E. Lance & R. J. Vandenberg (Eds.), Statistical and methodological myths and urban legends (pp. 193–215). Routledge.
Little, T. D. (2013). Longitudinal structural equation modeling. Guilford press.
Lubke, G. H., Dolan, C. V., & Kelderman, H. (2001). Investigating group differences on cognitive tests using Spearman’s hypothesis: An evaluation of Jensen’s method. Multivariate Behavioral Research,, 36(3), 299-324.
Lubke, G. H., & Dolan, C. V. (2003). Can unequal residual variances across groups mask differences in residual means in the common factor model?. Structural Equation Modeling, 10(2), 175-192.
MacCallum, R. C., Roznowski, M., & Necowitz, L. B. (1992). Model modifications in covariance structure analysis: the problem of capitalization on chance. Psychological bulletin, 111(3), 490.
Rodriguez, A., Reise, S. P., & Haviland, M. G. (2016). Evaluating bifactor models: Calculating and interpreting statistical indices. Psychological methods, 21(2), 137.
Savalei, V. (2012). The relationship between root mean square error of approximation and model misspecification in confirmatory factor analysis models. Educational and Psychological Measurement, 72(6), 910-932.
Savalei, V., Brace, J. C., & Fouladi, R. T. (2023). We need to change how we compute RMSEA for nested model comparisons in structural equation modeling. Psychological Methods.
Shi, D., DiStefano, C., Maydeu-Olivares, A., & Lee, T. (2022). Evaluating SEM model fit with small degrees of freedom. Multivariate behavioral research, 57(2-3), 179-207.
Sivo, S. A., Fan, X., Witta, E. L., & Willse, J. T. (2006). The search for “optimal” cutoff properties: Fit index criteria in structural equation modeling. The journal of experimental education, 74(3), 267-288.
Xiao, Y., Liu, H., & Hau, K. T. (2019). A comparison of CFA, ESEM, and BSEM in test structure analysis. Structural Equation Modeling: A Multidisciplinary Journal, 26(5), 665-677.
Widaman, K. F., & Reise, S. P. (1997). Exploring the measurement invariance of psychological instruments: applications in the substance use domain. In K. J. Bryant, M. Windle, & S. G. West (Eds.), The science of prevention (pp. 281−324). Washington, DC: American Psychological Association.
Zinbarg, R. E., Revelle, W., Yovel, I., & Li, W. (2005). Cronbach’s α, Revelle’s β, and McDonald’s ω H: Their relations with each other and two alternative conceptualizations of reliability. psychometrika, 70, 123-133.
Maybe a look at measurement invariance for other SIRE group could be a future blog post/paper, given that there's beenn some controveresy to the extent of the asian-white difference in early samples(James flynn found that in the mid 20th century east-asian immigrants performed equivalent to white americans in the US, but academically and socioeconomically performed significantly above, but cremieux thinks the difference is mostly due to language/test bias(believable,and with some evidence in Jensen 1973,https://twitter.com/cremieuxrecueil/status/1661449979116806148) in samples like Project Talent(and the unbiased gap is the same as modern samples), as well as geographical factors impacting their success.
If there's subtest data and a large sibling/twin sample, could you have tested between-group heritability directly(using a common pathway model, and if it fits including group differences)?