Using Survey data : Some Technical Notes

May 15, 2013

When using and studying survey data, some difficulties may emerge especially regarding the specific variables being used. And so, several things must be kept in mind.

The advantage of averaging the scores in questionnaires administered several times, or questionnaires measuring (more or less) the same dimension of say, personality, is that it improves the reliability of the measurement. A Cronbach's Alpha test can be used in order to check the reliability of the relevant items being combined. Higher alphas indicate of course a strong internal consistency. With the coefficients of reliability at hand, correction for attenuation or unreliability is possible.

One problem with computing an index longitudinally, by averaging or summing the relevant variables is, that the sample size is going to be smaller year after year. And so, when summing the variables at different points in time, we need to look at the sample size for each variables. Adding the scores for people with unequal number of completed questionnaires is a complete nonsense. A way of overcoming this difficulty is to use the SPSS syntax "do if not missing(...) and not missing(...)" although it has the problem of restricting the sample size. But another method consists in averaging the scores. It doesn't have the drawbacks of the SUM command.

If we need to put together two or more questionnaires administered at two different points in time because of missing values in the first questionnaire(s), averaging is a better method than summing since the SUM command adds the scores of respondents who responded to that question twice but not those who have responded once. Alternatively, the syntax "do if missing(first questionnaire)" can be used.

Sometimes, when averaging the items of, say, personality, the higher values of one item may indicate lower risk aversion and another item a higher risk aversion. In other words, summing or averaging the original scores/values does not make any sense. Some items need to be reversed in multiplying them by -1. Also important, do not forget to correlate the variables being averaged, and see what the correlation looks like, that is, significantly positive or not.

And some variables are just horrible. For example, in the NLSF, the SAT scores have a large portion of respondents at values 998 and 999 not because these respondents scored 999 but because the numbers denote the respondents who haven't taken the SAT, so these variables need to be recoded so that values equal and higher than 998 are set as missing values. In the NLSY, there are also some variables with similar features. Needless to say, this is likely to mess up all the analysis. Finally, running a frequency table can also help us to see what values are originally set as missing values.

It is necessary to look at the frequency of distribution of the variables being used (using histograms). When a variable is skewed, Pearson correlations seem to be weakened. So Spearman should be used too. When doing a point-biserial correlation, we need to look at the frequency distribution, because a departure from a 50/50 (e.g., a questionnaire asking whether or not the respondent has been arrested by the police in the past) split would artificially reduce the obtained correlation. Correction for unequal sample sizes must be done using the appropriate formulas, for example like I did here.

When creating a variable by standardizing, regressing out a confounding variable, or using factor analysis, we need to look first at the filter. If it is actually restricting the sample to a particular group, it will only concern the restricted sample.

In the NLSY, a variable with the survey year XRND, meaning year of last interview, collects the most recent information from participants. A problem could be that, for some persons, the last year of interview may not be recent. A good example is children ever born. The question is asked every year (or, almost). At each subsequent year, the N diminishes. We need to check the sample size. For example, the XRND variable contains 12686 respondents. We can select the survey year 2006, 2008 and 2010, and summing them. If they reach near 12686 (which is not the case since they averaged ~8000 or so) this means that the question regarding the non-interviewed does not constitute a problem. Nonetheless, if we try to correlate two variables with one having XRND and the other with survey year 2006, there is a probability that the participants being dropped in the most recent survey years will not be selected. So, this is unlikely to be a problem.

Some variables in survey data might originally be configured as a 'string' variable. And because of this, the variable will not work if we try to correlate it with another variable. The "type" of the variable must be changed in the data window into 'numeric'.

Another concern is about the use of weights because it will inflate the sample size and the p-values are no longer informative. The method of scaled weights can be used. Unlike Pearson's r, Spearman rho is very sensitive to sample size. And because correlation matrices displayed the pairwise deletion option rather than the listwise deletion, the N varies across cells and thus the use of scaled weights might be rendered complicated. It is better to use pairwise when the N is not large enough. And so, one way to overcome this problem is to multiply the sampling weight variable by one and divide by the factor of inflated N. If the weights inflated the original N by a factor of, say, 100 000, we must divide by 100 000. I also noticed that the N inflation may differ by racial/ethnic groups.

When the dataset contains a large number of variables (NLSF, Add Health, ECLS, ...) it is sometimes very difficult, if not impossible to find out where the variable we need to insert in multiple regression, or else, is located in the never-ending list of variables. But we can recode the variable name so that it would appear at the very bottom of the list. In SPSS, the syntax should look like "RECODE ... (lowest thru highest=COPY) INTO ..." and then type EXECUTE.

Copy-pasting a variable from data window to another is still possible. But not always evident especially if we want to work in Excel. For example, in SPSS, the data window does not display decimal point but comma. And so, copy pasting an SPSS data (window) column to an Excel sheet will may pose a problem because in Excel, we cannot correlate coefficients with a comma "," instead of a dot "." and so this needs to be kept in mind. The easiest way to do this is to click on Edit, Find, Replace, and then in the row "Find what" insert a comma, in the row "Replace with" insert a point, click on "Find all", and then click on "Replace all".

We can also highlight the specific cells for which we want to apply this conversion. However, if we want to copy past from Excel to, say, SPSS, we have to change the decimal points into comma. Therefore, we have to configure the variable column "Type" from 'String' to 'Numeric' and to configure the number of decimals.

Meng Hu on HBD and Austrian Economics

Discussion about this post