Using the General Social Survey data, I try to investigate which factor is the most determinant of health. The present analysis had been edited since I got the SPSS data format available here. (click on “Cumulative Data Set (cross-sectional samples from all years): GSS 1972-2012 Cross-Sectional Cumulative Data (Release 2, June 2013)”. All of the following is done in SPSS, rather than SPA program. For this purpose, I use the binary/binomial logit regression (this webpage contains an introduction for SPSS to this topic). The dependent variable is health, the independent are Race (dichotomized), sex, age, survey year, real income, politic views, attending. A short description below :
TRUST. Generally speaking, would you say that most people can be trusted or that you can’t be too careful in life?. 1 = Can trust, 2 = Cannot trust, 3 = Depends. I give the syntax at the end of the post. I don’t think we can code the variable like this in SDA program. What I did is to remove 3=Depends. The reason behind this operation is that the answer is highly ambiguous. My expectation is that people who don’t really know how to respond or don’t have any certainty choose this reply.
SEX. 1 = MALE, 2 = FEMALE.
WORDSUM. Vocabulary test (a proxy for IQ, correlation = 0.71; 0.83 for g). Should not be taken as a measure of general intelligence however. It is a ten item/question variable, having a rather low reliability, of about 73. See “Reliability and Stability Estimates for the GSS Core Items from the Three-wave Panels, 2006–2010” (Michael Hout & Orestes P. Hastings, 2012).
REALINC. Family income on 1972-2006 surveys in constant dollars (base = 1986).
POLVIEWS. 1 = Extremely liberal, 4 = Moderate, 7 = Extremely conservative. We hear a lot of talk these days about liberals and conservatives. I’m going to show you a seven-point scale on which the political views that people might hold are arranged from extremely liberal – point 1 – to extremely conservative – point 7. Where would you place yourself on this scale?
ATTEND. 0 = Never, 8 = More than once week. How often do you attend religious services?
AGE. Respondent’s age.
RACE. 1 = White, 2 = Black, 3 = Other.
Since the dependent variable is coded as 0=cannot trust and 1=can trust, a positive sign in the value of the independent (predictor) variable denotes a positive relationship. Concerning race variable, I recoded it as to remove all cases having values of 3. So, higher value in race variable show what happens when people is black (i.e., the effect of being black, as opposed of being white). The variable here has been dichotomized as 1=whites, 2=blacks. Also, because GSS questions have been administered several years, it is necessary to control for the survey year effect.
Now, the main problem with logistic regression is to express initially the unstandardized beta coefficients, meaning that the relative importance, contribution of each predictors cannot be compared among them since the scales (see above) are different. For instance, a change in one unit of an independent variable with a high point-scale would have a very little effect, especially when the independent variable can take on many values (for instance, years, age, or income), on the dependent variable. That is why an independent variable with a low point-scale (say, 2) is expected to have a higher unstandardized coefficient than an independent variable with a high point-scale (say, 10). Following Osborne (2012), I decided to standardize the independent variables (except for the dichotomized ones). This method sounds reasonable because it expresses all the variables in standard deviation units. In other words, the column “B” denotes the impact of the independent variable (on the dependent var.) for one SD units increase in this variable.
Here is the coefficients table (for N=12143) :
Wald is a kind of Chi-square statistics. We see that Wald stats tend to be higher for the most significant predictors. Exp(B) refers to the change in odds ratio attributed to the independent variable. Odds ratio of zero denotes no impact, a value higher than 1 means that the predictor is actually increasing the logit, and a value less than 1 means that it decreases the logit.
There might be a lot of reason for trusting/distrusting others. Perhaps due to anxiety, or simply some people dislike the others around them. Anyway, being black, a woman, a conservative (e.g., as opposed of being a ‘liberal’) is negatively associated with trusting others, and being old, (verbally) intelligent, rich, and religious tends to enhance trust in other people. The negative sign for YEAR shows that confidence among people tends to fade over time. A first explanation is that multiculturalism, along with the rise of immigration, may account for this decline in confidence. A second explanation is that the rise in income inequality accounts for the decline in confidence.
This aside, we might be interested by the R² value related to this set of predictors, which is, to recall, an indicator of the percentage of variance explained by the actual set of independent variable (or by the model, roughly speaking). But normally, R² is a statistics appropriate only for linear regression with its continuous (thus not dichotomized as is the case in logistic regression) dependent variables. Logistic regression uses instead two indices of pseudo R². It displays the proportion of unexplained variance that is reduced by adding variables in the model. As explained in this page, Cox & Snell pseudo R² has the problem that its maximum value is less than 1. This is why Nagelkerke’s value (a modified version of Cox & Snell) is usually higher, and should be used instead. My SPSS output reveals values of 0.130 and 0.175, respectively.
Another statistics is the model fit, which is assessed by Hosmer & Lemeshow goodness-of-fit Chi-square index. The problem with Chi-square however is to be very sensitive to sample size. Actually, my H&L statistics table shows the following Chi-square = 10,396, df = 8, Sig. = 0.238. In the case of H&L goodness of fit, the p-value must be higher (not lower) than the usual cut-off of 0.05 to indicate reasonable model fit. Otherwise would indicate that there is misspecification in the predictive capacity of the model, meaning that the model predicts values significantly different from what we actually observe. But because sample size is pretty large in the GSS cumulative data set, the p-value is likely to be very significant. We shouldn’t put too much weight on it.
I also display the above plot which is a sort of histogram that shows how the full model predicts membership. The 0 and 1 values on the x (i.e. left/horizontal) axis represent the value of the dependent (dichotomized) variable. The cases/subjects who were given a value of 0 should appear on the left hand and those having a value of 1 should appear on the right hand. But when the points are clustered at the center of the graph, this indicates that those cases had about 50/50 change that the data are correctly predicted by the model. The more accurate is the model, and the more further apart the values move away from each other, and the clearer the middle of the (above) graph. When the model is accurate, there will be less misclassification as well. Normally, the 1 values should appear at the right hand or the side having the 1 values, the 0 values at the left hand or the side having the 0 values.
Regarding the choice of sampling weight, in the GSS codebook, we read :
Due to the adoption of the non-respondent, sub-sampling design described above, a weight must be employed when using the 2004-08 GSSs. One possibility is to use the variable PHASE and weight by it so that the sub-sampled cases were properly represented. If one wanted to maintain the original sample size, one would weight by PHASE*0.86258 in 2004 and PHASE*.80853 in 2006. This weight would only apply to 2004-08 and would not take into account the number of adults weight discussed above. As such, it would be appropriate for generalizing to households and not to adults.
A second possibility is to use the variable WTSS. This variable takes into consideration a) the sub-sampling of non-respondents, and b) the number of adults in the household. It also essentially maintains the original sample size. In years prior to 2004+ a one is assigned to all cases so they are effectively unweighted. To adjust for number of adults in years prior to 2004, a number of adults weight would need to be utilized as described above. WTSSALL takes WTSS and applies an adult weight to years before 2004.
A third possibility is to use the variable WTSSNR. It is similar to WTSS, but adds in an area nonresponse adjustment. Thus, this variable takes into consideration a) the sub-sampling of nonrespondents, b) the number of adults in the household, and c) differential non-response across areas. It also essentially maintains the original sample size.
As with WTSS, WTSSNR has a value of one assigned to all pre-2004 cases and as such they are effectively unweighted. Number of adults can be utilized to make this adjustment for years prior to 2004, but no area non-response adjustment is possible prior to 2004.
For all these reasons, I obviously use WTSSALL. Because WTSSNR (the one used presently) is only operative starting from 2004 while before 2004, it gives a weight of 1 for all cases, meaning that it is as if we don’t weight at all. To note, the sample size is not exactly the same, as far as I can see, but it is very, very close to the original sample size, so that is of no concern. Generally, weight or no weight, choice of weight, does not affect the outcome.
Concerning the below syntax, I noticed there is no change in the logistic regression output whether I weight just before or just after I transform the predictor variables into z-standardized variables. I also display below the syntax for doing the analyses within white and black population, respectively. Within white sample (N=10450), there is no difference than what is observed above. Within black sample (N=1693), the only change is for year with near-zero beta coefficient, and real income having a very high positive relationship with trust.
I also display the z-scored SQRT real income computation, because as usual, income variables in survey data are not normally distributed. So, we might want to compare the effect. As far as I can see, the above real income variable has not been SQRT-transformed but the result after SQRT-transformation is not altered. Generally, in linear regression, for what I have seen so far, there is some alteration, tending slightly towards lower impact of income when it is not SQRT-transformed so as to make it normally distributed. On ther other hand, logistic regression is far less restrictive as it makes no assumption about the distribution of independent variables, not even homoscedasticity.
It might be worth trying to use the Zcohort variable I created in order to check for the effect of cohort. By including cohort variable instead of survey year variable, there is a change in the sign of Zage. Being old is now negatively related with trust when cohort is taken into account. A refresher : the cohort effect means that birth cohorts indicate a social change experienced by a given society, that is, a change or common experience which characterizes populations born at a particular point in time. The period effect is a change, that is, a particular historical event, which occurs at a particular time, affecting all age groups and cohorts uniformly. Cohort effect and period effect do not necessarily have the same meaning.
SPSS syntax :
RECODE TRUST (2=0) (1=1) (ELSE=SYSMIS) INTO TRUST_DICHOTOMIZED.
EXECUTE.
RECODE race (1=1) (2=2) (ELSE=SYSMIS) INTO RACE_DICHOTOMIZED.
EXECUTE.
COMPUTE SQRTrealinc=SQRT(realinc).
VARIABLE LABELS SQRTrealinc ‘square root of R income in constant dollars’.
EXECUTE.
DESCRIPTIVES VARIABLES=age year COHORT WORDSUM SEI realinc SQRTrealinc POLVIEWS ATTEND
/SAVE
/STATISTICS=MEAN STDDEV MIN MAX.
FREQUENCIES VARIABLES=Zsei Zrealinc SQRTrealinc ZSQRTrealinc Zage Zyear Zwordsum Zpolviews Zattend
/FORMAT=NOTABLE
/HISTOGRAM NORMAL
/ORDER=ANALYSIS.
COMPUTE wtssall_oversamp=wtssall*oversamp.
EXECUTE.
WEIGHT BY wtssall_oversamp.
LOGISTIC REGRESSION VARIABLES TRUST_DICHOTOMIZED
/METHOD=ENTER RACE_DICHOTOMIZED sex Zage Zyear Zwordsum Zrealinc Zpolviews Zattend
/CLASSPLOT
/PRINT=GOODFIT CORR ITER(1) CI(95)
/CRITERIA=PIN(0.05) POUT(0.10) ITERATE(20) CUT(0.5).
WEIGHT BY wtssall.
USE ALL.
COMPUTE filter_$=(race=1).
VARIABLE LABELS filter_$ ‘race=1 (FILTER)’.
VALUE LABELS filter_$ 0 ‘Not Selected’ 1 ‘Selected’.
FORMATS filter_$ (f1.0).
FILTER BY filter_$.
EXECUTE.
LOGISTIC REGRESSION VARIABLES TRUST_DICHOTOMIZED
/METHOD=ENTER sex Zage Zyear Zwordsum Zrealinc Zpolviews Zattend
/CLASSPLOT
/PRINT=GOODFIT CORR ITER(1) CI(95)
/CRITERIA=PIN(0.05) POUT(0.10) ITERATE(20) CUT(0.5).
WEIGHT BY wtssall_oversamp.
USE ALL.
COMPUTE filter_$=(race=2).
VARIABLE LABELS filter_$ ‘race=2 (FILTER)’.
VALUE LABELS filter_$ 0 ‘Not Selected’ 1 ‘Selected’.
FORMATS filter_$ (f1.0).
FILTER BY filter_$.
EXECUTE.
LOGISTIC REGRESSION VARIABLES TRUST_DICHOTOMIZED
/METHOD=ENTER sex Zage Zyear Zwordsum Zrealinc Zpolviews Zattend
/CLASSPLOT
/PRINT=GOODFIT CORR ITER(1) CI(95)
/CRITERIA=PIN(0.05) POUT(0.10) ITERATE(20) CUT(0.5).
WEIGHT OFF.
FILTER OFF.
USE ALL.
EXECUTE.