Using the General Social Survey data, I try to investigate which factor is the most determinant of health. The present analysis had been edited since I got the SPSS data format available here. (click on “Cumulative Data Set (cross-sectional samples from all years): GSS 1972-2012 Cross-Sectional Cumulative Data (Release 2, June 2013)”. All of the following is done in SPSS, rather than SPA program. For this purpose, I use the binary/binomial logit regression (this webpage contains an introduction for SPSS to this topic). The dependent variable is health, the independent are Race (dichotomized), sex, age, survey year, real income, politic views, attending. A short description below :
HEALTH. Would you say your own health, in general, is excellent, good, fair, or poor?. 1 = Excellent, 2 = Good, 3 = Fair, 4 = Poor. In SDA program, I have recoded it as follows : HEALTH (d:1-2). The dummy variable created means that the cases coded 3 through 4 receive a code of 1. In other words, cases coded 1 through 2 receive a code of 0. For the SPSS syntax, see below.
SEX. 1 = MALE, 2 = FEMALE.
WORDSUM. Vocabulary test (a proxy for IQ, correlation = 0.71; 0.83 for g). Should not be taken as a measure of general intelligence however. It is a ten item/question variable, having a rather low reliability, of about 73. See “Reliability and Stability Estimates for the GSS Core Items from the Three-wave Panels, 2006–2010” (Michael Hout & Orestes P. Hastings, 2012).
REALINC. Family income on 1972-2006 surveys in constant dollars (base = 1986).
POLVIEWS. 1 = Extremely liberal, 4 = Moderate, 7 = Extremely conservative. We hear a lot of talk these days about liberals and conservatives. I’m going to show you a seven-point scale on which the political views that people might hold are arranged from extremely liberal – point 1 – to extremely conservative – point 7. Where would you place yourself on this scale?
ATTEND. 0 = Never, 8 = More than once week. How often do you attend religious services?
AGE. Respondent’s age.
RACE. 1 = White, 2 = Black, 3 = Other.
Initially, the higher values in health variable denote poorer health. I have recoded it so as the higher values denote better health. Therefore, positive signs in the predictors means that when the values of the said predictors increases, health self-assessment increases as well.
Concerning race variable, I recoded it as to remove all cases having values of 3. So, higher value in race variable show what happens when people is white (i.e., the effect of being white, as opposed of being black). The variable here has been dichotomized as 1=blacks, 2=whites. See ‘BW’ variable below.
Because GSS questions have been administered several years, it is necessary to control for the survey year effect.
Now, the main problem with logistic regression is to express initially the unstandardized beta coefficients, meaning that the relative importance, contribution of each predictors cannot be compared among them since the scales (see above) are different. For instance, a change in one unit of an independent variable with a high point-scale would have a very little effect, especially when the independent variable can take on many values (for instance, years, age, or income), on the dependent variable. That is why an independent variable with a low point-scale (say, 2) is expected to have a higher unstandardized coefficient than an independent variable with a high point-scale (say, 10). Following Osborne (2012), I decided to standardize the independent variables (except for the dichotomized ones). This method sounds reasonable because it expresses all the variables in standard deviation units. In other words, the column “B” denotes the impact of the independent variable (on the dependent var.) for one SD units increase in this variable. Generally, one should better read Field (2009) on logistic regression.
Here is the coefficients table (for N=14741) :
Wald is a kind of Chi-square statistics. We see that Wald stats tend to be higher for the most significant predictors. Exp(B) refers to the change in odds ratio attributed to the independent variable. Odds ratio of zero denotes no impact, a value higher than 1 means that the predictor is actually increasing the logit, and a value less than 1 means that it decreases the logit.
As expected, being (verbally) intelligent, rich, religious is associated with a good health. Being black and old is associated with worse self-reported health. Survey year, gender, and political views have virtually no impact whatsoever.
This aside, we might be interested by the R² value related to this set of predictors, which is, to recall, an indicator of the percentage of variance explained by the actual set of independent variable (or by the model, roughly speaking). But normally, R² is a statistics appropriate only for linear regression with its continuous (thus not dichotomized as is the case in logistic regression) dependent variables. Logistic regression uses instead two indices of pseudo R². It displays the proportion of unexplained variance that is reduced by adding variables in the model. As explained in this page, Cox & Snell pseudo R² has the problem that its maximum value is less than 1. This is why Nagelkerke’s value (a modified version of Cox & Snell) is usually higher, and should be used instead. My SPSS output reveals values of 0.091 and 0.143, respectively.
Another statistics is the model fit, which is assessed by Hosmer & Lemeshow goodness-of-fit Chi-square index. The problem with Chi-square however is to be very sensitive to sample size. No wonder why my H&L statistics table shows the following Chi-square = 42,596, df = 8, Sig. = 0.000. In the case of H&L goodness of fit, the p-value should have been higher (not lower) than the usual cut-off of 0.05 to indicate reasonable model fit. All this indicates that there is misspecification in the predictive capacity of the model, meaning that the model predicts values significantly different from what we actually observe. But because sample size is pretty large in the present data set, the p-value is likely to be very significant. We shouldn’t put too much weigth on it. Perhaps a better fit assessment is the classification table. An accurate model will predict values close to the observed values. The total percentage correct was 79.7%. Not high.
I also display the above plot which is a sort of histogram that shows how the full model predicts membership. The 0 and 1 values on the x (i.e. left/horizontal) axis represent the value of the dependent (dichotomized) variable. The cases/subjects who were given a value of 0 should appear on the left hand and those having a value of 1 should appear on the right hand. But when the points are clustered at the center of the graph, this indicates that those cases had about 50/50 change that the data are correctly predicted by the model. The more accurate is the model, and the more further apart the values move away from each other, and the clearer the middle of the (above) graph. When the model is accurate, there will be less misclassification as well. Normally, the 1 values should appear at the right hand or the side having the 1 values, the 0 values at the left hand or the side having the 0 values.
Regarding the choice of sampling weight, in the GSS codebook, we read :
Due to the adoption of the non-respondent, sub-sampling design described above, a weight must be employed when using the 2004-08 GSSs. One possibility is to use the variable PHASE and weight by it so that the sub-sampled cases were properly represented. If one wanted to maintain the original sample size, one would weight by PHASE*0.86258 in 2004 and PHASE*.80853 in 2006. This weight would only apply to 2004-08 and would not take into account the number of adults weight discussed above. As such, it would be appropriate for generalizing to households and not to adults.
A second possibility is to use the variable WTSS. This variable takes into consideration a) the sub-sampling of non-respondents, and b) the number of adults in the household. It also essentially maintains the original sample size. In years prior to 2004+ a one is assigned to all cases so they are effectively unweighted. To adjust for number of adults in years prior to 2004, a number of adults weight would need to be utilized as described above. WTSSALL takes WTSS and applies an adult weight to years before 2004.
A third possibility is to use the variable WTSSNR. It is similar to WTSS, but adds in an area nonresponse adjustment. Thus, this variable takes into consideration a) the sub-sampling of nonrespondents, b) the number of adults in the household, and c) differential non-response across areas. It also essentially maintains the original sample size.
As with WTSS, WTSSNR has a value of one assigned to all pre-2004 cases and as such they are effectively unweighted. Number of adults can be utilized to make this adjustment for years prior to 2004, but no area non-response adjustment is possible prior to 2004.
For all these reasons, I obviously use WTSSALL. Because WTSSNR is only operative starting from 2004 while before 2004, it gives a weight of 1 for all cases, meaning that it is as if we don’t weight at all. To note, the sample size is not exactly the same, as far as I can see, but it is very, very close to the original sample size, so that is of no concern. Generally, weight or no weight, choice of weight, does not affect the outcome. Read the complete GSS codebook for more information on weight.
Concerning the below syntax, I noticed there is no change in the logistic regression output whether I weight just before or just after I transform the predictor variables into z-standardized variables. I also display below the syntax for doing the analyses within white and black population, respectively. Within black population (N=2240), the sign for sex is moderately negative, moderately positive for year, wordsum, and attending, strongly negative for age, strongly positive for income, zero for politic views. Within white population (N=12501), the numbers are nearly all the same as the ones displayed above.
I also display the z-scored SQRT real income computation, because as usual, income variables in survey data are not normally distributed. So, we might want to compare the effect. As far as I can see, the above real income variable has not been SQRT-transformed but the result after SQRT-transformation is not altered. Generally, in linear regression, for what I have seen so far, there is some alteration, tending towards lower impact of income when it is not SQRT-transformed so as to make it normally distributed. On ther other hand, logistic regression is far less restrictive as it makes no assumption about the distribution of independent variables, not even homoscedasticity.
SPSS syntax :
RECODE HEALTH (1 thru 2=1) (3 thru 4=0) INTO HEALTH_DICHOTOMIZED.
EXECUTE.
RECODE race (1=2) (2=1) (ELSE=SYSMIS) INTO BW.
EXECUTE.
COMPUTE SQRTrealinc=SQRT(realinc).
VARIABLE LABELS SQRTrealinc ‘square root of R income in constant dollars’.
EXECUTE.
DESCRIPTIVES VARIABLES=age year COHORT WORDSUM SEI realinc SQRTrealinc POLVIEWS ATTEND
/SAVE
/STATISTICS=MEAN STDDEV MIN MAX.
FREQUENCIES VARIABLES=Zsei Zrealinc SQRTrealinc ZSQRTrealinc Zage Zyear Zwordsum Zpolviews Zattend
/FORMAT=NOTABLE
/HISTOGRAM NORMAL
/ORDER=ANALYSIS.
COMPUTE wtssall_oversamp=wtssall*oversamp.
EXECUTE.
WEIGHT BY wtssall_oversamp.
LOGISTIC REGRESSION VARIABLES HEALTH_DICHOTOMIZED
/METHOD=ENTER BW sex Zage Zyear Zwordsum Zrealinc Zpolviews Zattend
/CLASSPLOT
/PRINT=GOODFIT CORR ITER(1) CI(95)
/CRITERIA=PIN(0.05) POUT(0.10) ITERATE(20) CUT(0.5).
USE ALL.
COMPUTE filter_$=(BW=1).
VARIABLE LABELS filter_$ ‘BW=1 (FILTER)’.
VALUE LABELS filter_$ 0 ‘Not Selected’ 1 ‘Selected’.
FORMATS filter_$ (f1.0).
FILTER BY filter_$.
EXECUTE.
FREQUENCIES VARIABLES=Zsei Zrealinc SQRTrealinc ZSQRTrealinc Zage Zyear Zwordsum Zpolviews Zattend
/FORMAT=NOTABLE
/HISTOGRAM NORMAL
/ORDER=ANALYSIS.
LOGISTIC REGRESSION VARIABLES HEALTH_DICHOTOMIZED
/METHOD=ENTER sex Zage Zyear Zwordsum Zrealinc Zpolviews Zattend
/CLASSPLOT
/PRINT=GOODFIT CORR ITER(1) CI(95)
/CRITERIA=PIN(0.05) POUT(0.10) ITERATE(20) CUT(0.5).
WEIGHT BY wtssall.
USE ALL.
COMPUTE filter_$=(BW=2).
VARIABLE LABELS filter_$ ‘BW=2 (FILTER)’.
VALUE LABELS filter_$ 0 ‘Not Selected’ 1 ‘Selected’.
FORMATS filter_$ (f1.0).
FILTER BY filter_$.
EXECUTE.
FREQUENCIES VARIABLES=Zsei Zrealinc SQRTrealinc ZSQRTrealinc Zage Zyear Zwordsum Zpolviews Zattend
/FORMAT=NOTABLE
/HISTOGRAM NORMAL
/ORDER=ANALYSIS.
LOGISTIC REGRESSION VARIABLES HEALTH_DICHOTOMIZED
/METHOD=ENTER sex Zage Zyear Zwordsum Zrealinc Zpolviews Zattend
/CLASSPLOT
/PRINT=GOODFIT CORR ITER(1) CI(95)
/CRITERIA=PIN(0.05) POUT(0.10) ITERATE(20) CUT(0.5).
FILTER OFF.
USE ALL.
EXECUTE.
WEIGHT OFF.