Get (easily) introduced to Stata

Oct 17, 2014

I do not want to make multiple posts on this topic. So, this post will be updated little by little (if I learn something new). Here's the list of essential commands for creating variables, making basic (and advanced) statistical analyses.

Updates and help

help whatsnew
help xtmixed

This is a special feature of Stata, and a useful one. When we type the above syntax, we have the list of all recent changes, corrections, and new stuff added in Stata. The help function displays a new window for help in syntax with the command we have selected (e.g., regress, factor, sem).

Generate variables

Here's a list of basic commands. Open the tab Data Editor (Browse) to see your data.

keep if age<70
# keep if age<70 & !(ethnic==17) & !(ethnic==22) & !(ethnic==38)
# keep if age<70 & !missing(wordsum)

gen age2s = age^2
gen age3s = age^3
gen sweight = wtssall*oversamp
gen racism = wkracism-1
gen gender = sex-1
gen bw = 0 if race==2
replace bw = 1 if race==1
replace bw = . if missing(race)
gen blackwhite2000after=1 if year>=2000 & race==1 & hispanic==1
replace blackwhite2000after=0 if year>=2000 & race==2 & hispanic==1
gen blackwhite2000before=1 if year<2000 & race==1
replace blackwhite2000before=0 if year<2000 & race==2
gen bw1 = max(blackwhite2000after,blackwhite2000before)
gen bw1age = bw1*age
gen bw1age2 = bw1*age2s
gen bw1age3 = bw1*age3s
gen inc = realinc
replace inc = . if inc==0
gen logincome = ln(inc)
gen logincomec = loginc-10.13
replace educ = . if educ>20
replace degree = . if degree>4
replace sibs = . if sibs==-1
replace sibs = . if sibs>37
replace res16 = . if res16==0
replace res16 = . if res16>=8
gen residence = res16-3
replace family16 = . if family16==-1
replace family16 = . if family16==9
recode family16 (1=1) (0=0) (2/8=0), generate(family)
recode reg16 (0=0) (1/4=1) (5/7=0) (8/9=1), generate(reg)

replace wordsum = . if wordsum<0
replace wordsum = . if wordsum>10

recode age (18/26=1) (27/35=2) (36/44=3) (45/55=4) (56/69=5), generate(agegroup)
tabulate agegroup, gen(agedummy)

recode age (18/23=1) (24/28=2) (29/33=3) (34/38=4) (39/44=5) (45/50=6) (51/56=7) (57/62=8) (63/69=9), generate(age9)
tabulate age9, gen(aged)

replace cohort = . if cohort==0
replace cohort = . if cohort==9999

recode cohort (1905/1928=0) (1929/1943=1) (1944/1953=2) (1954/1962=3) (1963/1973=4) (1974/1994=5), generate(cohort6)
replace cohort6 = . if cohort6>5
tabulate cohort6, gen(cohortdummy)

recode cohort (1905/1915=0) (1916/1920=1) (1921/1925=2) (1926/1929=3) (1930/1933=4) (1934/1937=5) (1938/1941=6) (1942/1944=7) (1945/1947=8) (1948/1950=9) (1951/1953=10) (1954/1956=11) (1957/1959=12) (1960/1962=13) (1963/1965=14) (1966/1968=15) (1969/1972=16) (1973/1976=17) (1977/1980=18) (1981/1985=19) (1986/1994=20), generate(cohort21)
replace cohort21 = . if cohort21>20
tabulate cohort21 bw1 if !missing(wordsum)

gen bw1C1 = bw1*cohortdummy1
gen bw1C2 = bw1*cohortdummy2
gen bw1C3 = bw1*cohortdummy3
gen bw1C4 = bw1*cohortdummy4
gen bw1C5 = bw1*cohortdummy5
gen bw1C6 = bw1*cohortdummy6

gen bw1agedummy1 = bw1*agedummy1
gen bw1agedummy2 = bw1*agedummy2
gen bw1agedummy3 = bw1*agedummy3
gen bw1agedummy4 = bw1*agedummy4
gen bw1agedummy5 = bw1*agedummy5
gen bw1agegroup = bw1*agegroup

gen bw1aged1 = bw1*aged1
gen bw1aged2 = bw1*aged2
gen bw1aged3 = bw1*aged3
gen bw1aged4 = bw1*aged4
gen bw1aged5 = bw1*aged5
gen bw1aged6 = bw1*aged6
gen bw1aged7 = bw1*aged7
gen bw1aged8 = bw1*aged8
gen bw1aged9 = bw1*aged9

gen mycohort = year-age
pwcorr wordsum mycohort cohort year age [aweight = sweight], obs sig

egen zeduc = std(educ)

The commands gen and generate are equivalent. They allow recoding, creating interaction variable or squared/cubed terms, and how to set missing values (".") or not missing variable with !missing() or not equal to variable's value with !(==) under specified condition(s). Then, how to drop/remove all observations with age 70+ using the function "keep". The std function allows us to transform a variable into z-score (i.e., standardizing it). The command recode+generate is an easy way to create categorical variables whereas tabulate+gen is straightforward for generating a dummy variable. We can also use recode function to reverse coding a dummy variable 0;1 (female) into 1;0 (male).

The signs = and == are different. The first is used to generate variables, and the second is used to assign a value to a variable. We can make and set multiple assignment and conditions with & ("and"). If we need to specify "or" we will use |.

keep worda wordb wordc wordd worde wordf wordg wordh wordi wordj race 
save wordsumDIF2021

The keep (or drop) command deletes (or preserves) all variables not listed, while save preserves this new data.

Descriptive statistics

Now, here's a list of useful descriptive statistics for variables wordsum and education.

tabulate race sex if !missing(age) & !missing(wordsum) & !missing(educ)

summarize wordsum
by race sex, sort : summarize wordsum educ [aweight = sweight] if age>18 & age<70

svyset id [pweight=sweight]
svy: mean wordsum educ if sex==1, over(race)
estat sd

mean wordsum educ if sex==1, over(race)
describe sex

According to Stata faqs, aweight (analytic weight) is used for descriptive statistics while pweight (sampling weight) is used for inferential statistics and should thus be applied for all multi-variable analyses. svyset specifies the survey design characteristics. svy can be used for regression and can allow variance estimators with vce(cluster clustvar) which assumes the observations are independent across groups (clusters) but not necessarily within groups, and clustvar specifies to which group each observation belongs in the svyset.

Tabulate gives the sample size by categories of race and gender, if there is no missing values of IQ, education, and age. Summarize displays means, standard deviations, sample size and weighted sample size (optional) within a subset of data (certain age range conditional on gender) by subgroups of both race and cohort, i.e., each categories of these variables, independently and in combination. Mean with over() does the same thing. Describe displays the type of variable (string, numeric, etc.).

Scatterplot

scatter wordsum educ

twoway (scatter wordsum educ) if born==1 & age<70, ytitle(IQ score) xtitle(educ level) by(, title(scatter of IQ-educ relationship) subtitle(by gender group)) by(sex)

graph twoway (scatter wordsum educ if bw1==0, msymbol(Oh)) (scatter wordsum educ if bw1==1, msymbol(O)), legend(label(0 black) label(1 white))

twoway (scatter wordsum educ, msymbol(O) jitter(0)) (lfit wordsum educ if bw1==0) (lfit wordsum educ if bw1==1), ytitle("scatter plot of IQ against education by race") legend(order(2 "black" 3 "white"))

graph matrix wordsum educ inc age if age <70 [pweight = sweight], half maxis(ylabel(none) xlabel(none)) by(bw1)

The first line is the easy way to make scatterplot. The second line generates a scatterplot for males and females, separately but in a single picture, and with titles of X and Y. The third line is a scatterplot with data points by racial group, msymbol is used for setting the color and size of data points. The fourth line specifies the fitted line of the scatterplot with "lfit". The fifth line generates a correlation matrix in the form of scatterplots. The "half" statement requests only the lower triangle of the matrix to be displayed.

Data distribution

Next, we look at the different ways to examine the normal distribution of the variables. We can do it using stem and leaf distribution, P-P and Q-Q plot, box plots, histogram, skewness and kurtosis, Shapiro-Wilk test by subgroups.

stem educ if race==1
stem educ if race==2

symplot wordsum if bw1==0
symplot wordsum if bw1==1

quantile wordsum, rlopts(clpattern(dash))

qnorm wordsum, grid ylabel(, angle(horizontal) axis(1)) ylabel(, angle(horizontal) axis(2))

graph box educ wordsum, over(sex) over(race)

graph box educ wordsum [pweight = sweight] if age<=60 & born==1, over(sex) over(race) ytitle("level of IQ or education") title("Education and IQ by race and by sex") subtitle("(500 participants)" " ") note("Source: your data")

histogram wordsum, kdensity normal
kdensity wordsum, normal

twoway histogram wordsum, discrete by(race)
twoway histogram inc, freq by(race)

sktest inc

by race, sort : swilk wordsum inc

Cronbach's alpha

alpha worda-wordj
alpha worda-wordj, casewise generate(new_var) item std

ssc install kr20

kr20 worda wordb wordc wordd worde wordf wordg wordh wordi wordj if !missing(wordsum)

replace worda = . if worda<0
replace worda = 0 if worda==9
replace wordb = . if wordb<0
replace wordb = 0 if wordb==9
replace wordc = . if wordc<0
replace wordc = 0 if wordc==9
replace wordd = . if wordd<0
replace wordd = 0 if wordd==9
replace worde = . if worde<0
replace worde = 0 if worde==9
replace wordf = . if wordf<0
replace wordf = 0 if wordf==9
replace wordg = . if wordg<0
replace wordg = 0 if wordg==9
replace wordh = . if wordh<0
replace wordh = 0 if wordh==9
replace wordi = . if wordi<0
replace wordi = 0 if wordi==9
replace wordj = . if wordj<0
replace wordj = 0 if wordj==9

. alpha worda wordb wordc wordd worde wordf wordg wordh wordi wordj if !missing(wordsum), item
Test scale = mean(unstandardized items)
average
                             item-test     item-rest       interitem
Item         |  Obs  Sign   correlation   correlation     covariance      alpha
-------------+-----------------------------------------------------------------
worda        | 23817   +       0.4347        0.2727        .0344404      0.7072
wordb        | 23817   +       0.4814        0.3710         .034422      0.6951
wordc        | 23817   +       0.5228        0.3587        .0321716      0.6940
wordd        | 23817   +       0.4543        0.3496        .0351039      0.6985
worde        | 23817   +       0.6071        0.4495         .029806      0.6776
wordf        | 23817   +       0.6025        0.4543        .0302186      0.6773
wordg        | 23817   +       0.5509        0.3682        .0310597      0.6935
wordh        | 23817   +       0.5788        0.4086        .0304111      0.6854
wordi        | 23817   +       0.4806        0.3059        .0331632      0.7032
wordj        | 23817   +       0.5835        0.4263        .0305587      0.6821
-------------+-----------------------------------------------------------------
Test scale   |                                             .0321355      0.7137
-------------------------------------------------------------------------------
. kr20 worda wordb wordc wordd worde wordf wordg wordh wordi wordj if !missing(wordsum)
Kuder-Richarson coefficient of reliability (KR-20)
Number of items in the scale = 10
Number of complete observations = 23817
Item       Item     Item-rest
  Item   |  Obs   difficulty  variance  correlation
---------+------------------------------------------
 worda   | 23817    0.8240     0.1450     0.2727
 wordb   | 23817    0.9153     0.0776     0.3710
 wordc   | 23817    0.2199     0.1716     0.3587
 wordd   | 23817    0.9280     0.0668     0.3496
 worde   | 23817    0.7380     0.1934     0.4495
 wordf   | 23817    0.7792     0.1721     0.4543
 wordg   | 23817    0.3230     0.2187     0.3682
 wordh   | 23817    0.2895     0.2057     0.4086
 wordi   | 23817    0.7685     0.1779     0.3059
 wordj   | 23817    0.2384     0.1815     0.4262
---------+------------------------------------------
   Test  |          0.6024                0.3765
KR20 coefficient is 0.7138

Here is an analysis of (internal consistency) reliability. The Cronbach's alpha can be written as a function of the number of test items and the average inter-correlation among the items. We can use 2 variables or more. The "casewise" function will delete cases with missing values. The function "item" shows the effect of removing a variable from the scale. The std function allows us to standardize the items in the scale to mean 0 and variance 1 (we will get the average interitem correlation instead of the average interitem covariance). The function "generate" serves to save the generated scale. The last column gives the alpha reliability for the test scale. The alpha of each variable gives the reliability of the test scale when the corresponding variable is removed. However, for dichotomous variable such as IQ tests' items, we preferably use the Kuder-Richardson reliability (KR-20) which needs to be installed in Stata first. Above, we see that the internal reliability of the Wordsum vocabulary test is 0.71.

Correlation

A good advice. Never use correlate. Use pwcorr instead. While pwcorr uses pairwise deletion, correlate uses listwise deletion. So this is pretty useless if you have more than 2 variables.

correlate age race gender wordsum inc educ
by race, sort : correlate age race gender wordsum inc educ [aweight = sweight] if age>=18, means covariance
by sex, sort : pwcorr wordsum inc educ age [aweight = sweight] if age<70, obs sig
by sex, sort : spearman wordsum inc educ age if age<70, stats(rho obs p) star(0.05) bonferroni pw
by sex, sort : ktau wordsum inc educ, stats(taua taub p) pw
by sex, sort : pcorr wordsum degree inc age race [aweight = sweight] if age<70

This is how to generate correlation matrix. We can use sampling weight with aweight. In some analysis, it aweight is not available, sometimes it is pweight which is not available. It is pweight that is considered as the sampling weights (observations weighted according to their different probabilities of being in a sample). The difference between aweight and pweight is that they generate the same point estimates but different variance (Standard Error) estimates. The calculation of aweight is given here. If we need to display the mean, standard deviation and min/max values, we use "means". And if we need covariances, we type "covariance". We can make the correlation for a subset of the data, e.g., people aged 18+. Furthermore, the command "by [categorical_variable], sort :" allows us to repeat the analysis for each of the categories in the variable after "by". The pwcorr function is a pairwise Pearson correlation, and the obs and sig give the sample size and p-values. Spearman correlation does not allow sampling weight, star allows to specify the level of significance for correlation marked with a star, pw requires pairwise analysis, and bonferroni specifies the Bonferroni's correction of p-values. ktau is for kendall's rank correlation. The pcorr function is partial correlation, where the first variable (wordsum) is correlated with degree (all other vars held constant) and then with income (all other vars held constant) and then with age (all other vars held constant) and then with race (all other vars held constant).

Analysis of variance

by race, sort : anova wordsum cohort sex cohort#sex [aweight = sweight] if age<70

Multiple regression

Now, I show how to make regression. Below is an OLS regression with dummy variables of cohort and its interaction with race (this allows us to look at the trend in black-white gap in IQ by cohort categories, although one dummy variable must be dropped in order to avoid perfect multicollinearity, and it will serve as reference value from which all other variables are expressed, and the choice of the reference dummy will affect the intercept but not the coefficient of the independent variables). The first variable is always the dependent variables. Unless we specify "beta" the program will not give us the standardized coefficients. As in all other analyses, we can use a subset of data and apply sampling weight. We can also select which kind of standard errors (robust, bootstrap, jackknife) we need with "vce".

regress wordsum bw1 gender age educ cohortdummy2 cohortdummy3 cohortdummy4 cohortdummy5 cohortdummy6 bw1C2 bw1C3 bw1C4 bw1C5 bw1C6 if age>=18 & age<=50 [pweight = sweight], vce(robust) beta

predict predicted_values if e(sample), xb
predict residualized_values if e(sample), res
predict standardized_residuals if e(sample), rstandard
graph twoway (lfitci residualized_values predicted_values) (scatter residualized_values predicted_values)
pnorm residualized_values
qnorm residualized_values
histogram residualized_values, kdensity normal
kdensity residualized_values, normal

The "regress" command must be immediately followed by "predict" command if we want to examine the normality of residuals (an assumption generally but wrongly omitted by researchers when doing regressions). In most softwares, it seems that either or both the predicted values and/or residuals are calculated even for cases not analyzed in the regression model; so, the function e(sample) should be specified. xb will generate a variable of predicted/fitted values of the dependent variable whereas res or resid will generate the unstandardized residuals (Y minus Yhat). We use rstandard to request the standardized residuals, and rstudent to request the studentized (jackknifed) residuals. Correlating the residualized and fitted values allows us to examine the normality of the residuals (lfitci can be specified if we are interested in confidence intervals).

It is not necessary to specify vce(robust) when pweight is already activated because Stata automatically estimates the robust SE when pweight is specified in regression, because robust SE, according to the Stata manual, is the correct specification. This is true also for other types of regression (i.e., logit, tobit, xtmixed, ...) that uses pweight. Estimation commands that do not have a vce(robust) option (there are a few) do not allow pweights.

nestreg: reg zeduc (bw1) (wordsum logincomec) [pweight = sweight], vce(robust) beta

Hierarchical (or nested) regression is done with nestreg. Then wordsum and log income are both added after bw1.

test x5=x6=x7=x8=x9=0

After regression, it is possible to perform an F-test (test statistic for F-change) and compare nested models. Say, model 1 has four independent variables x1-x4, and model 2 adds variables x5-x9. Ideally, models 1 and 2 should have identical sample sizes. To know if these variables should be added to the initial model, F-test is performed. We are actually testing the null hypothesis that the regression coefficients x5 through x9 are all zero. Significant p-value indicates they should be added (but as always, p-value depends on sample size, so it is not very useful). It is better to look at the Cohen's f² effect size.

regress read write math science
estimates store m2
generate samplem2 = e(sample)
regress read write if sample==1
estimates store m1
lrtest m1 m2

We can perform a likelihood ratio test between two nested models, as shown above. The requirement of LRT is that the models need to sample exactly the same cases. The function e(sample), which can be used each time we run a new model, is used for this purpose. The model m1 will then be run only for cases that appear in m2. However, if we have used vce(robust) for robust standard errors, we will get the following message : "LR test likely invalid for models with robust vce".

Errors-in-variables (EIV) regression

eivreg wordsum educ logincomec [aweight = sweight] if race==1, reliab(educ .95 logincomec .80)

Here, we have an analysis of error-in-variables regression. The "reliab" (or "r" which is equivalent) function allows us to specify the level of assumed reliability (we can write 0.90 or .90) for each of the independent variables. Standardized coefficients are not available.

Tobit/Truncated regressions

. tobit wordsum bw1 cohortdummy2 cohortdummy3 cohortdummy4 cohortdummy5 cohortdummy6 bw1C2 bw1C3 bw1C4 bw1C5 bw1C6 agedummy1 agedummy2 agedummy4 agedummy5 sex [pweight = sweight], ll ul
Tobit regression                                  Number of obs   =      22156
                                                  F(  16,  22140) =     103.09
                                                  Prob > F        =     0.0000
Log pseudolikelihood = -47841.919                 Pseudo R2       =     0.0175
------------------------------------------------------------------------------
             |               Robust
     wordsum |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         bw1 |   2.022845    .155358    13.02   0.000     1.718332    2.327358
cohortdummy2 |   .3783523   .1814003     2.09   0.037     .0227949    .7339097
cohortdummy3 |   1.033257   .1718304     6.01   0.000     .6964568    1.370056
cohortdummy4 |   .7742905   .1708343     4.53   0.000     .4394431    1.109138
cohortdummy5 |     1.0802   .1747871     6.18   0.000     .7376051    1.422795
cohortdummy6 |   1.308605   .1937606     6.75   0.000     .9288205     1.68839
       bw1C2 |  -.2276471   .1918261    -1.19   0.235    -.6036399    .1483457
       bw1C3 |  -.5661765   .1790605    -3.16   0.002    -.9171478   -.2152051
       bw1C4 |  -.6056165   .1763168    -3.43   0.001      -.95121    -.260023
       bw1C5 |  -1.008771   .1797605    -5.61   0.000    -1.361115   -.6564281
       bw1C6 |  -.9942551   .1999675    -4.97   0.000    -1.386206   -.6023045
   agedummy1 |  -.7602832   .0534722   -14.22   0.000    -.8650924   -.6554739
   agedummy2 |   -.247756   .0475324    -5.21   0.000    -.3409229    -.154589
   agedummy4 |   .0323851   .0517914     0.63   0.532    -.0691299       .1339
   agedummy5 |   .0456177   .0623807     0.73   0.465    -.0766528    .1678883
         sex |   .1725331   .0317426     5.44   0.000     .1103153    .2347509
       _cons |   4.008042   .1630947    24.57   0.000     3.688365    4.327719
-------------+----------------------------------------------------------------
      /sigma |   2.107777   .0135762                      2.081167    2.134388
------------------------------------------------------------------------------
  Obs. summary:        140  left-censored observations at wordsum=10
                     20698     uncensored observations
                      1318 right-censored observations at wordsum>=10

Tobit regression is displayed above. ll and ul stand for lower and upper level for value censoring. If the IQ has minimum of 0 and maximum of 10 answers correct, we can set ll(0) ul(10). Although ll ul automatically does the job, Cameron & Trivedi (2009, p. 523) always recommend specification of these values. There is no reason to do it by hand, unless we want to censor at a value less (more) than the maximum (minimum). Standardized beta not available because the tobit regression estimates the expected value of the latent (i.e., not observed) variable Y. In other words, the SD of the latent Y is not given.

truncreg wordsum bw1 gender age educ cohortdummy2 cohortdummy3 cohortdummy4 cohortdummy5 cohortdummy6 bw1C2 bw1C3 bw1C4 bw1C5 bw1C6 if age>=18 & age<=50 [pweight = sweight], ll(80)

Above is the truncated regression for values truncated from below 80. If the (survey) data has sampled persons having at least 80 IQ points, there is a truncation from below, and we would have ll(80). Similarly, if the data selects only the persons with a maximum of 120 points, the data would appear to be truncated from above 120, then we set ul(120). For the same reason as tobit, truncreg does not provide standardized betas.

Logistic regression

logit racism bw1 wordsum bw1wordsum if age>=25 & age<60 [pweight = sweight], or
predict logit if e(sample), xb
graph twoway (scatter logit wordsum if bw1==0, msymbol(Oh)) (scatter logit wordsum if bw1==1, msymbol(O)), legend(label(1 black) label(2 white))

Above is a logistic regression. If we don't request for the odds ratio ("or"), then only the b coefficient is displayed. Unlike other statistical packages such as SPSS, the logit command in Stata only recognizes a dependent variable that is coded 0 and 1. If coded otherwise (e.g., 1 and 2), we can get the message "outcome does not vary; remember: 0 = negative outcome, all other nonmissing values = positive outcome". The predict, xb function allows a graphical representation of the predicted values of the model.

Multilevel regression (or linear mixed effect model)

xtmixed wordsum bw1 aged2-aged9 bw1aged2-bw1aged9 [pweight=sweight] || cohort21: bw1, mle cov(un) var pweight(1) pwscale(size)
predict xtfitted if e(sample), fitted
predict xtfixed if e(sample), xb
predict xtres if e(sample), residuals
predict xtu* if e(sample), reffects
estimates store lme
graph twoway (lfitci xtres xtfitted) (scatter xtres xtfitted)
qnorm xtres
swilk xtres
histogram xtres, kdensity normal
kdensity xtres, normal
graph twoway (lfitci xtres xtu1) (scatter xtres xtu1)
qnorm xtu1
swilk xtu1
histogram xtu1, kdensity normal
kdensity xtu1, normal
graph twoway (lfitci xtres xtu2) (scatter xtres xtu2)
qnorm xtu2
swilk xtu2
histogram xtu2, kdensity normal
kdensity xtu2, normal

This is a multilevel regression, also called linear mixed effect model, where IQtest is the dependent var, the race dichotomy, age dummies and race*age (dummies) interaction specified as fixed effect variable in the fixed component to serve as covariate, with cohort (21-category) as random intercepts to allow differences in cohort across values/categories of race and age, and race as random slope to allow race difference to differ across cohorts. mle is for (full) maximum likelihood. reml is for restricted (or residualized) maximum likelihood and is a better estimator than mle if one wants to focus on variance components. cov(un) or cov(unstructured) allows the model to assume the presence of intercept-by-slope covariance. But we can use cov(independent), which allows a distinct variance for each random effect but covariances to be zero, if we assume random intercept and slopes to be uncorrelated. An alternative option is cov(identity) which assumes that all variances are equal and all covariances are zero. Sometimes, the model reaches convergence but Stata displays the message "standard-error calculation failed". This means our model is incorrectly specified.

reffects function produces two random effects variables, u1 (slope) and u2 (intercepts). estimates store will save the results for later use, e.g., lrtest for comparing the significance between two models.

Sampling weight can be applied using multilevel regression. The problem is that the sampling weight for level-2 cluster variable usually does not exist in nearly all survey data. However, we can specify the second-level sampling weight to be just 1 (or create a variable "one" using gen and weight the level-2 variable using variable "one" which has a value of 1 for all cases), if we can safely assume that the categories of the level-2 cluster variable have no different probabilities of being selected into the sample

Factor analysis

by race, sort : pca educ logincomec degree [aweight = sweight] if age<70, components(3) means
by race, sort : factor educ logincomec degree [aweight = sweight] if age<70, factors(3) blanks(.1)
by race, sort : factor educ logincomec degree [aweight = sweight] if age<70, ml factors(3) blanks(.1)

Above is the syntax for factor analysis and principal component analysis. We can request the descriptive stats with "means" and specify the minimum loading to be displayed with "blanks" and the maximum number of factors to be retained with "factors" for FA and "components" for PCA. We can get maximum likelihood factor analysis by checking the "ml" option.

Good references.

Stata13 manual

Stata FAQ: How can I estimate effect size for xtmixed?
Stata FAQ: How can I identify cases used by an estimation command using e(sample)?
Why should I not do a likelihood-ratio test after an ML estimation (e.g., logit, probit) with clustering or pweights?

Cameron, A. C., & Trivedi, P. K. (2009). Microeconometrics Using Stata. Stata Press books.
Everitt, B. S., & Rabe-Hesketh, S. (2006). Handbook of Statistical Analyses Using Stata. CRC Press.
Hamilton, L. (2012). Statistics with STATA: Version 12. Cengage Learning.
West, B.T., Welch, K.B., & Galecki, A.T. (2014). Linear Mixed Models: A Practical Guide Using Statistical Software, Second Edition. Chapman Hall/CRC Press: Boca Raton, FL.

Meng Hu on HBD and Austrian Economics

Discussion about this post