Get (easily) introduced to SPSS

Oct 24, 2014

I do not want to make multiple posts on this topic. So, this post will be updated little by little (if I learn something new). Here, we have a list of essential commands for creating variables, making basic (and advanced) statistical analyses. I sometimes illustrate with examples using the General Social Survey data.

Set maximum cells

SET MXCELLS=9000.

It seems that SPSS does not allow output larger than x number of cells (usually, 1000) to be displayed. We can enlarge the maximum cells.

Generate variables

RECODE IQitem1 (0=0) (1=1) (9=0) (ELSE=SYSMIS) INTO IQi1.
RECODE old_variable (lowest thru highest=Copy) INTO new_variable.
RECODE gender ('m' = 1) ('f' = 2) INTO ngender.
COMPUTE IQscale=SUM(item1, item2, item3, item4, item5, item6, item7, item8, item9, item10).
COMPUTE var_mean=MEAN(var1, var2, var3, var4).
RECODE race (1=2) (2=1) (ELSE=SYSMIS) INTO BW.
VALUE LABELS BW 1 'blacks' 2 'whites'.
COMPUTE age2=age**2.
COMPUTE age3=age**3.
COMPUTE race=BW*wordsum.
COMPUTE logincome = LN(income).
COMPUTE sqrtincome = SQRT(income).
EXECUTE.

This is an example of how it is done in SPSS. SYSMIS will set the specified values to missing data. We can also copy an existing variable into a new one (with new name) by simply copying all values (lowest through highest) of the original variable. The command EXECUTE is special, sometimes needed, but not always. Unless EXECUTE is executed, the syntax won't run. But if we use recode or compute commands, and then performs an analysis (tabulation, correlation, etc.) the variables will be generated automatically.

Removing a bunch of variables at once

GET FILE='C:\your_directory_path\GSS7212_R1 (GSS 1972-2012).sav'.
MATCH FILES FILE = * /
 /KEEP wtssall oversamp id year cohort educ degree realinc sei marital childs born sibs reg16 res16 family16 wordsum worda wordb wordc wordd worde wordf wordg wordh wordi wordj race hispanic ethnic sex age.
SAVE OUTFILE= 'wordsumDIF'.
EXECUTE.

When data set is too large because of too many cases and too many variables, the softwares will work very slowly. Sometimes, we may need to keep a portion of the variables. We can obviously convert .sav file into .dta for Stata or .csv for R. This is a good way to work with R, which is very slow with big data.

DO IF and SELECT IF

DO IF NOT(MISSING(R0536401)) AND NOT(MISSING(R0536402)).
COMPUTE BIRTHDATE=R0536402+(R0536401/100*(100/12)).
COMPUTE AGEat1997=1997-BIRTHDATE.
END IF.
EXECUTE.
SELECT IF(NOT MISSING(race)) AND (NOT MISSING(age)) AND (NOT MISSING(sex)).
EXECUTE.

The DO IF condition must be followed by a command generating/modifying a variable, and finally by the END IF command. The SELECT IF is based on the same logic, however, we must keep in mind that all cases/observations that do not satisfy the condition will be removed from the data. You probably don't want to save the data when your analysis is done after this.

Subsetting data (filter)

USE ALL.
COMPUTE filter_$=(race=2 and sex=1).
VARIABLE LABELS filter_$ 'race=2 and sex=1 (FILTER)'.
VALUE LABELS filter_$ 0 'Not Selected' 1 'Selected'.
FORMATS filter_$ (f1.0).
FILTER BY filter_$.
EXECUTE.
FILTER OFF.
USE ALL.
EXECUTE.

This allows us to work with a subset of the data, conditional on if or if and if, etc. We filter off if we have no use of filtering.

Split file

SORT CASES BY race sex.
SPLIT FILE SEPARATE BY race sex.
SPLIT FILE OFF.

This function is useful. Instead of repeating the syntax for all (sub)groups with a series of filter, SPLIT FILE allows us to repeat the analysis by each subgroups. For the above, we ask to perform the analysis by race and by sex. For 3 categories of race, and 2 of sex, we get 6 analyses.

Reshape (cases to vars)

SORT CASES BY family_id id.
CASESTOVARS
/ID = family_id
/GROUPBY = VARIABLE.

This command allows us to inverse data rows and columns. Say, we have 2 columns, id and family. Using family id will multiply every variables by the number of persons having the same family id. This way, it is possible to conduct a sibling analysis. Instead of having one variable of IQ, education, income, etc., there will be as many variables of IQ, education, income, as there are siblings (or members with the same family id).

Data imputation

*Analyze Patterns of Missing Values.
MULTIPLE IMPUTATION MOM_EDUC DAD_EDUC SQRT_P_INCOME ASVAB1999 RGRADE2010 GPA_OVERALL GS AR WK PC NO CS AI SI MK MC EI AO
/IMPUTE METHOD=NONE
/MISSINGSUMMARIES OVERALL VARIABLES (MAXVARS=25 MINPCTMISSING=0) PATTERNS.

Pattern analysis is useful to understand the pattern of missing data (e.g., arbitrary or monotonic). This should help us decide which imputation method to run. MINPCTMISSING is set at zero, because we want the summary for all variables (even ones with no missing data).

*Impute Missing Data Values.
DATASET DECLARE NLSY97_ASVAB_SES_GPA_W.
DATASET DECLARE nlsy97iteration_W.
MULTIPLE IMPUTATION MOM_EDUC DAD_EDUC SQRT_P_INCOME ASVAB1999 RGRADE2010 GPA_OVERALL GS AR WK PC NO CS AI SI MK MC EI AO
/IMPUTE METHOD=FCS MAXITER=100 NIMPUTATIONS=5 SCALEMODEL=PMM INTERACTIONS=NONE SINGULAR=1E-012
MAXPCTMISSING=NONE
/MISSINGSUMMARIES NONE
/IMPUTATIONSUMMARIES MODELS DESCRIPTIVES
/OUTFILE IMPUTATIONS=NLSY97_ASVAB_SES_GPA_W FCSITERATIONS=nlsy97iteration_W.

DATASET DECLARE requests a name of the imputed data set that will be saved. Max iteration is set at 100. The IMPUTE METHOD is FCS, or Fully Conditional Specification MCMC, but we can replace FCS by MONOTONE if there is evidence of monotonic missingness, or by AUTO if we want to let SPSS decides which methods appears most appropriate. The SCALEMODEL is PMM, or Predictive Mean Matching, but we can replace it by LINEAR if we want linear regression scale model. NIMPUTATIONS can be set at 10 if we want 10 imputed data sets.

Descriptive statistics

MEANS TABLES=IQtest education age BY race BY sex
/CELLS MEAN COUNT STDDEV.
DESCRIPTIVES VARIABLES=IQtest income education occupation
/SAVE
/STATISTICS=MEAN STDDEV MIN MAX KURTOSIS SKEWNESS.
FREQUENCIES VARIABLES=IQtest income education occupation
/FORMAT=NOTABLE
/STATISTICS=SKEWNESS SESKEW KURTOSIS SEKURT
/HISTOGRAM NORMAL
/ORDER=ANALYSIS.
CROSSTABS
/TABLES=degree BY race
/FORMAT=AVALUE TABLES
/STATISTICS=CHISQ CC CORR CMH(1)
/CELLS=COUNT EXPECTED 
/COUNT ROUND CELL.

The first function gives the mean, SD and N for each cells. We have two "BY" and this means we are requesting the stats of IQ score by sex within each racial groups. The second function gives the mean, SD, minimum and maximum value as well as the kurtosis and skewness for each variables. The /SAVE function generates the z-score variables for all variables selected. The third function gives the histogram, skewness and kurtosis values as well as their standard errors. If we remove /FORMAT=NOTABLE, there is no frequency table generated. The fourth command requests the cross-tabulation, and gives the sample size (N) for each cells. We can also request the expected counts but also the Chi-Square, the contingency coefficients and the Cochran-Mantel-Haenszel stats.

Data distribution

EXAMINE VARIABLES=IQtest income
/PLOT BOXPLOT STEMLEAF HISTOGRAM NPPLOT
/COMPARE GROUPS
/STATISTICS DESCRIPTIVES EXTREME
/CINTERVAL 95
/MISSING PAIRWISE
/NOTOTAL.

The EXAMINE command (in fact, EXPLORE) is a global test of data distribution. Box plot, Stem-and-Leaf distribution, histogram, normal Q-Q plot and detrended normal Q-Q plot are displayed. Also displayed are the mean, CIs (95%) for the mean, the median, the variance, SD, min and max values, range, interquartile range as well as the kurtosis and skewness values, the Kolmogorov-Smirnov (although I do not recommend it).

Scatterplot and other graphs

GRAPH
/SCATTERPLOT(BIVAR)=IQ test WITH probability_of_correct_answer_in_item1 BY race
/MISSING=LISTWISE.
GRAPH
/LINE(MULTIPLE)=PCT BY income BY race.
GRAPH
/BAR(GROUPED)=COUNT BY income BY race.
GRAPH
/HISTOGRAM(NORMAL)=income
/PANEL COLVAR=sex COLOP=CROSS ROWVAR=race ROWOP=CROSS.
EXAMINE VARIABLES=income BY race BY sex
/PLOT=BOXPLOT
/STATISTICS=NONE
/NOTOTAL.

The above syntax gives the scatter plot of probability of correct answer in a given IQ item (Y axis) for each level of IQ total score (X axis) by racial group. It is possible to display the fitted line and the confidence interval bands. We click on the outputted graph, and on the new window we click on "elements" and "fit line at total". A new window "Properties" will open, and under Confidence Intervals option we select "mean". Select "apply" and close the window.

GRAPH LINE displays the number of cases or (% if PCT instead of COUNT) for the X-axis variable (income) for each category of the grouping variable. Using BAR instead of LINE displays a clustered bar chart (i.e., histogram) by each category of the grouping variable. GRAPH HISTOGRAM displays the histogram (as well as the normal curve) of income by gender within each racial groups. The EXAMINE command displays the boxplot by gender within each racial groups.

Correlation

CORRELATIONS
/VARIABLES=IQtest race income education health region
/PRINT=TWOTAIL NOSIG
/STATISTICS DESCRIPTIVES XPROD
/MISSING=PAIRWISE.
NONPAR CORR
/VARIABLES=IQtest race income education health region
/PRINT=BOTH TWOTAIL NOSIG
/MISSING=PAIRWISE.
PARTIAL CORR
/VARIABLES=IQtest cohort BY race sex education income
/SIGNIFICANCE=TWOTAIL
/STATISTICS=DESCRIPTIVES CORR
/MISSING=LISTWISE.

The command CORRELATIONS use Pearson, while NONPAR CORR uses Spearman's rank order correlation. If we use LISTWISE in correlations, the analysis is done on cases with no missing on all variables. DESCRIPTIVES gives the mean, SD and N for each variable. XPROD gives the sum of squares and cross-products as well as covariance. In partial correlation, the variables in BY are those that are partialled out. CORR is important because it gives the zero-order correlation.

Cronbach's alpha

RECODE wordsum (0 thru 10=COPY) (ELSE=SYSMIS) INTO GSSwordsum.
RECODE worda (0=0) (1=1) (9=0) (ELSE=SYSMIS) INTO word_a.
RECODE wordb (0=0) (1=1) (9=0) (ELSE=SYSMIS) INTO word_b.
RECODE wordc (0=0) (1=1) (9=0) (ELSE=SYSMIS) INTO word_c.
RECODE wordd (0=0) (1=1) (9=0) (ELSE=SYSMIS) INTO word_d.
RECODE worde (0=0) (1=1) (9=0) (ELSE=SYSMIS) INTO word_e.
RECODE wordf (0=0) (1=1) (9=0) (ELSE=SYSMIS) INTO word_f.
RECODE wordg (0=0) (1=1) (9=0) (ELSE=SYSMIS) INTO word_g.
RECODE wordh (0=0) (1=1) (9=0) (ELSE=SYSMIS) INTO word_h.
RECODE wordi (0=0) (1=1) (9=0) (ELSE=SYSMIS) INTO word_i.
RECODE wordj (0=0) (1=1) (9=0) (ELSE=SYSMIS) INTO word_j.
SELECT if age<70.
EXECUTE.
SELECT IF(NOT MISSING(GSSwordsum)).
EXECUTE.
RELIABILITY
/VARIABLES=word_a word_b word_c word_d word_e word_f word_g word_h word_i word_j
/SCALE('ALL VARIABLES') ALL
/MODEL=ALPHA
/STATISTICS=DESCRIPTIVE SCALE HOTELLING CORR COV TUKEY
/SUMMARY=TOTAL MEANS VARIANCE COV CORR.

The above syntax gives the internal consistency reliability, using the unstandardized variables. Displayed are the inter-item correlation matrix and covariance matrix, a descriptive statistics of the test scale, the scale mean and scale variance if a given item is deleted, the Cronbach's alpha if a given item is deleted, and also the corrected item-total correlation and squared multiple correlation. Also requested are the ANOVA with Tukey's test for nonaddivity and Hotelling's T-Squared test.

Analysis of variance

ONEWAY IQtest BY race
/POLYNOMIAL=2
/STATISTICS DESCRIPTIVES EFFECTS HOMOGENEITY
/PLOT MEANS
/MISSING ANALYSIS
/POSTHOC=TUKEY BTUKEY DUNCAN BONFERRONI ALPHA(0.05).

The one-way ANOVA uses harmonic mean when Ns are unequal across groups. The plot of mean scores in the (list of) dependent variable(s) is displayed. We can request Post-Hoc multiple comparisons, e.g., Tukey, Tukey's-b, or Bonferonni. We can also request between-group comparison for linear and quadratic terms with POLYNOMIAL=2 and cubic terms with POLYNOMIAL=3.

UNIANOVA IQtest BY race parental_income WITH sex
/METHOD=SSTYPE(3)
/INTERCEPT=INCLUDE
/PLOT=PROFILE(parental_income*race)
/EMMEANS=TABLES(race) WITH(sex=MEAN)
/EMMEANS=TABLES(parental_income) WITH(sex=MEAN)
/EMMEANS=TABLES(race*parental_income) WITH(sex=MEAN)
/PRINT=LOF OPOWER ETASQ HOMOGENEITY DESCRIPTIVE
/PLOT=SPREADLEVEL RESIDUALS
/CRITERIA=ALPHA(.05)
/DESIGN=sex race parental_income race*parental_income.

The ANCOVA is ANOVA when we adjust for covariate(s). The variable(s) after WITH is(are) the covariate(s). The variables after BY are the fixed factors. Here, we allow the interaction between the fixed factors. The profile plot displays the mean score by degree for each race. A matrix of plot between the observed dependent variable, predicted dependent variable and standardized residuals is also provided. We also have a plot of variance in the dep var against mean in the dep var, and a plot of standard deviation in the dep var against mean in the dep var. We have also requested the Levene's test of equality of error variance (although I won't recommend it) and a test of between-subjects effects and a test for lack of fit.

Regression

SELECT is used as a filter for US born = 1. REGWGT uses sampling weight but in a way that does not inflate sample size or distort standard errors and p-values.

REGRESSION
/DESCRIPTIVES MEAN STDDEV CORR SIG N 
/SELECT=USBORN EQ 1
/MISSING LISTWISE
/REGWGT=weight
/STATISTICS COEFF OUTS CI(95) BCOV R ANOVA COLLIN TOL CHANGE ZPP
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT IQtest
/METHOD=ENTER age race cohort2-cohort6
/METHOD=ENTER racecohort2-racecohort6
/PARTIALPLOT ALL
/SCATTERPLOT=(*ZRESID, *ZPRED)
/RESIDUALS HISTOGRAM(ZRESID) NORMPROB(ZRESID)
/SAVE PRED ZPRED RESID ZRESID.

This is a dummy variable (multiple) regression. One dummy needs to be dropped in order to avoid perfect multicollinearity. In SPSS, it is possible to specify a hierarchical model, and this is why there are several lines for METHOD. In the first model, we have age and race, and in the second model, we have the interaction between race and dummy cohorts. We can request the descriptives statistics such as mean SD, correlations with significance levels and sample size. We can also request additional statistics such as confidence intervals, R² and F changes, the Durbin-Watson stats, covariance, collinearity diagnostic (with VIF and tolerance values), the part and partial correlations along with the zero-order correlations for the independent variables, and a summary of the predicted values and residuals. We can also request a scatterplot of predicted values (X axis) against residuals (Y axis), a partial regression plot for each independent variables against the dependent variable, a histogram and P-P plot of the standardized residuals. We can finally save the predicted values (unstandardized and standardized) and the residuals (unstandardized and standardized).

LOGISTIC REGRESSION VARIABLES IQitem1
/METHOD=ENTER IQscale
/METHOD=ENTER race
/METHOD=ENTER raceIQscale
/SAVE=PRED
/CLASSPLOT
/PRINT=GOODFIT CI(95)
/CRITERIA=PIN(0.05) POUT(0.10) ITERATE(20) CUT(0.5).

Above is an example of logistic regression, applied to DIF analysis (item in dependent var, IQ total score in model 1, IQ total score + race in model 2, IQ total score + race + race*IQ interaction in model 3. We also request Hosmer & Lemeshow goodness of fit (although I dont trust it) as well as classification plot and CI for exp(B) and save the predicted probabilities. If the predicted probability variable is plotted against IQ total score by race, we get the Item Characteristic Curve (ICC) of the racial groups.

* 2-Stage Least Squares.
TSET MXNEWVAR=2.
2SLS IQtest WITH number_of_siblings income
/INSTRUMENTS income region size_of_town
/CONSTANT
/SAVE PRED RESID
/PRINT COV.

The two-stage least squares (2SLS) is also called instrumental variable (IV) regression. The purpose is to correct biases due to measurement errors, simultaneous causality and/or omitted variables. The variables used as instruments should be correlated with the targeted independent variable but not with the error term (residuals). One variable (here, income) can serve both as independent and instrumental variable (IV). We try to find some good IVs (i.e., having good correlation with the targeted X variable) for income because it is suspected to be correlated with omitted variables (such as region and size of town) contained in the error term. SPSS does not allow more instruments than there are independent variables. Here, we have 2 independent vars (age and income) and 2 instrumental vars (region and size of town). We can request covariances and the residuals and predicted values.

Factor analysis

FACTOR
/VARIABLES var1-var10
/MISSING LISTWISE
/ANALYSIS var1-var10
/PRINT UNIVARIATE INITIAL CORRELATION SIG DET KMO EXTRACTION
/PLOT EIGEN
/CRITERIA MINEIGEN(1) ITERATE(25)
/EXTRACTION PAF
/ROTATION NOROTATE
/SAVE REG(ALL)
/METHOD=CORRELATION.

The extraction PAF specifies principal axis factoring as the factor analytic method. Replacing PAF by PC or ML specifies a principal component or maximum likelihood extraction. We can request the scree plot of eigenvalues, the univariate stats, Barltett's KMO. MINEIGEN(1) specifies that the minimum eigenvalue is 1 for any factor extracted. If we type FACTORS(3) we are asking that 3 factors should be extracted; there could be as many factors as there are variables, but no more. Sometimes the number of iterations (25 by default) is not sufficient to reach convergence and in this case we can set the number at 100, for example. We can request factor scores using the method "regression" "Bartlett" or "Anderson-Rubin". If we do not specify a rotation pattern (NOROTATE) the first factor or component is the so-called general factor. If we prefer the rotated solution (VARIMAX, QUARTIMAX, EQUAMAX, OBLIMIN, PROMAX(4)); note that in SPSS the Kappa 4 in PROMAX(4) is the default (and recommended) value. We have specified METHOD=CORRELATION for the method of analysis, but we could have specified COVARIANCE instead. The above syntax is to be used with raw data. If we have only input data, the following syntax must be run.

MATRIX DATA VARIABLES=GS AR WK PC NO CS AI SI MK MC EI AO
/contents=corr
/N=257.
BEGIN DATA.
1
.397 1
.587 .455 1
.452 .521 .509 1
.252 .513 .276 .453 1
.22 .372 .166 .374 .409 1
.274 .197 .281 .191 .087 .021 1
.25 .13 .234 .108 -.02 -.035 .255 1
.404 .563 .36 .488 .532 .349 .168 .157 1
.383 .413 .279 .381 .32 .131 .204 .287 .426 1
.472 .429 .47 .402 .286 .198 .318 .364 .35 .392 1
.262 .327 .244 .424 .288 .287 .089 .099 .384 .345 .246 1
END DATA.
EXECUTE.
FACTOR MATRIX=IN(COR=*)
/MISSING LISTWISE
/PRINT UNIVARIATE INITIAL CORRELATION SIG DET KMO EXTRACTION
/PLOT EIGEN
/CRITERIA FACTORS(3) ITERATE(100)
/EXTRACTION PC
/ROTATION NOROTATE
/METHOD=CORRELATION.

Meng Hu on HBD and Austrian Economics

Discussion about this post