The OLS regression is a widely applied technique, and many variants of the classical regression exist. Among them, are the tobit and truncated regressions. Their use is recommended when the dependent (Y) variable is constrained in some ways. Both have a common feature. The Y variable is treated as latent variable (denoted Y*) rather than observed variable. This raises several complications compared to the classical OLS.
I decided to cover this topic because I have applied this kind of analysis in my paper on the black-white score changes in the GSS Wordsum test. These techniques are not available in SPSS. One reason may be that these techniques are applied mainly by economists (who use mainly Stata), not by psychologists (who use mainly SPSS and may not be even aware of these techniques). However, the problem raised by data censoring and data truncation is also relevant in the field of psychology.
The tobit (or censored) regression is proposed for a dependent variable censored either at the lower end or the upper end of its distribution. Or both. Censoring is essentially a problem of floor and ceiling effects. For instance, some individuals are stacked at a certain threshold value (τ) because they cannot have a higher or lower score on the variable. This may be due to difference causes; the test may be too easy or too difficult. But censoring can take on another form. An income variable may have been coded into categories, e.g., $10,000-$20,0000, etc. ..., but then at the very end, our last category may be something like “$100,000 and above”. In this case, the variable is censored at the upper end. As mentioned earlier, it is possible to have a data censored at both end, and in this case, we are specifying a two-limit tobit regression (by setting the value for lower and upper censored values); see Long (1997, pp. 212-213) for a development. For instance, in insurance coverage, there is a minimum coverage, a maximum coverage, and values in between.
The truncated regression is proposed for a dependent variable for which its distribution is not representative of the entire population. Truncation is essentially a problem of range restriction (although it is inaccurate to equalize truncation with range restriction). For instance, the data may have been collected for people having purchased durable goods. But people who did not purchase these goods due to, e.g., their price levels, are thus said to be truncated from below (instead of above). This is not to say that OLS is necessarily biased. It depends on the goal of the analysis. If we are interested in the value of Y for the entire population, OLS is biased. But if we are merely interested in our subsample, the OLS is sufficient (see the Stata manual). However, we must be aware that when we omit a portion of the data in this manner, the truncated data points are also missing not at random (because the value of Y for truncated and untruncated observations is different).
A graphical representation of censoring and truncation is given by Long (1997) :
In Panel A is the “latent” variable Y* that tobit and truncated regressions are trying to estimate (based on the set of independent variables). In censoring, the observations are censored and stacked at zero when τ=1. But, for truncation, the obervations literally disappear when they are below (or equal to) the threshold value τ=1.
Both techniques use maximum likelihood (ML) to estimate the effect of the changes in independent variables (Xs) on the expected (i.e., “potential”) value of the dependent variable (Y) given a gaussian (i.e., normal) distribution. Because the expected value of the dependent variable is latent (i.e., not observed), it is not possible to obtain standardized coefficients, unless we apply a special procedure (Long, 1997, pp. 207-208).
As for tobit, the technique allows a decomposition of the effect of X on the latent Y (i.e., the tobit coefficient) into two parts : the change in the probability of being above the censored value multiplied by the expected value of Y if above plus the change in the expected Y for the cases above the censored value multiplied by the probability of being above the censored value (McDonald & Moffitt, 1980). Mathematically, the latent Y* variable in tobit model is given by :
δEy/δXi = F(z)(δEy*/δXi) + Ey*(δF(z)/δXi)
where F(z) is the proportion of cases (i.e., probability) being above the threshold, δEy*/δXi is the change in the expected value of Y for cases above the threshold associated with an independent variable, δF(z)/δXi is the change in the probability of being above the threshold associated with an independent variable.
Long (1997, p. 196) presents the formula in a more intuitive way :
E(y) = [Pr(Uncensored) x E(y|y>τ)] + [Pr(Censored) x E(y|y=τy)]
Pr for probability, E(y) for expected y, and | y>τ for conditional on y above τ, and τy is the value of y if y* is censored (in Long’s book (see p.197) at least).
If we are only interested in the changes of the Xs on the latent Y, the coefficients obtained from tobit regression can be interpreted in the same way as those obtained from OLS regression (Roncek, 1992).
The formula for truncated regression can be found in Long (1997, p. 194) and in the Stata manual for truncreg function.
We haven’t provided a detailed answer of why OLS is inconsistent with truncated data when our interest focuses on the population estimates. One crucial assumption of OLS regression is the independence of the errors (residuals). The residuals must have mean zero and be uncorrelated with all explanatory variables. The problem here is that truncated data causes the sample selection (s) to be correlated with the error term (u). Wooldridge (2012, pp. 616-617) provides an example with a selection indicator s, i.e., s=1 if we observe all of the data or s=0 otherwise, where s=1 if the Yhat is lower or equal to the threshold (considering that the data is truncated from above). Equivalently, s=1 if u≤τ-Xβ, where Xβ is a shorthand for β0 + β1X1 + β2X2, … . This means that the value of s covaries with u.
Long (1997) illustrates the consequences of censoring and truncation for OLS estimation with Figure 7.2. The solid line is given by the OLS estimate of Y that is not censored. The long dashed line, OLS with censored data, has a lower intercept and a steeper slope because of the many values set at zero (shown as triangles), just below the threshold horizontal line τ=1, that pull down the left side of the long dashed line. The short dashed line is given by an OLS estimate with data points below τ=1 being truncated (i.e., removed) instead of being censored and shows a higher intercept and smaller slope.
Figure 7.7 (page 202) also shows in a very simple manner the effects of censoring and truncation. The difference here is that the censoring data points are equal to the threshold rather than being below it. The dots below the threshold τ=2 are truncated data points. E(y*|x) in the solid line is the correct estimate. E(y|y>2|x) is given by the long dashed line. We see that the long dashed line is indistinguishable from the solid line as we move toward the right side, but the long dashed line is above the solid line as we move to the left side. This is because there are few (many) data points truncated at the right (left) side. The long dashed line becomes closer and closer to τ as we move to the left. We also see there are circles along the horizontal line τ=2. These are censored data points. The short dashed line represented by E(y|x) is slightly below the long dashed line at the left side of the x axis, because the censored cases were not eliminated.
Both types of regression require normality and homoscedastic of residuals, even in the case of tobit which always considers a censored distribution to be non-normal. But since the Y* variable is not an observable one, we cannot get our residual variable by doing Y minus Yhat because we have to use Y* instead of Y. In tobit regression, a complex procedure must be applied to get the generalized residuals and conduct the test of normality (Cameron & Trivedi, 2009, pp. 535-538).
A particular feature of these kinds of regressions is that a standardized coefficient is usually not reported in statistical softwares because its calculation is not straightforward. Normally, the fully standardized coefficients are obtained with the operation coeff(X)/SD(Y)*SD(X). In the case of tobit regression, Roncek (1992, p. 506) shows that the standardized tobit coefficient can be obtained by coeff(X)*f(z)/“sigma”. f(z) is the unit normal density; this is (in my opinion) a complicated way of presenting the formula because one could have replaced the ambiguous f(z) by the more intuitive notation SD(X). “Sigma” is the estimated standard error of the tobit regression model (usually reported by the software) and is comparable with the estimated root mean squared error in OLS regression. But since sigma is the variance of Y* conditional on the set of X variables and that it needs not be equal to the unconditional Y* which is what we need, Long (1997, pp. 207-208) argues that the unconditional variance of Y* should be computed with the quadratic form :
where Var^(x) is the estimated covariance matrix among the x’s and σ^ε² is the ML estimate of the variance of ε. Thus, Long suggests we use the formula coeff(X)*SD(X)/σ^y*².
Even though the standardized coefficients seem usually preferred by psychologists, the economists (and particularly econometricians) dislike standardized coefficients and probably won’t recommend its use.
Finally, it should be noted that OLS is not always inconsistent with data having sample selection (Wooldridge, 2012, pp. 615-616). We will re-use his example of the s indicator of sample selection. If sample selection (s) is random in the sense that s is independent of X and u, the OLS is unbiased. But OLS remains unbiased even if s depends on explanatory X variables and additional random terms that are independent of X and u. If IQ is an important predictor but is missing for some people, such that s=1 if IQ≥v and s=0 if IQ<v, where v is an unobserved random variable that is independent of IQ, u and the other X variables, then, s is still independent of u. It is not a requirement that s is uncorrelated with X independent variables, on the condition that X variables are uncorrelated with u because it implies that the product of s and X must also be uncorrelated with the residuals u.
References.
Cameron, A. C., & Trivedi, P. K. (2009). Microeconometrics Using Stata. Rev. ed. College Station, TX: Stata Press.
Long, J. Scott (1997). Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks, CA: Sage.
McDonald, J. F., & Moffitt, R. A. (1980). The uses of Tobit analysis. The review of economics and statistics, 318-321.
Roncek, D. W. (1992). Learning more from tobit coefficients: Extending a comparative analysis of political protest. American Sociological Review, 503-507.
Wooldridge, J. (2012). Introductory econometrics: A modern approach. Cengage Learning.