Data Editing and Imputation

Data Editing

All respondent-reported data are edited for logical consistency. For example, say the respondent reported an age of 67 years old. However, the respondent is enrolled in 11th grade and is the biological child of a 47-year-old household member, indicating that the respondent’s actual age is likely to be 17 and not 67. This is a simple example of when data editing would correct the reported age. Variables whose names start with an ‘E’ are edited variables.

During processing, there are also variables created that are based on the values of one or more other variables. These are recoded variables and have names starting with an ‘R.’ Some variables are topcoded, bottomcoded, or collapsed before they are put on the public-use file. These variables have names starting with a ‘T.’

Missing Data

As in all surveys, there are two general types of missing data in SIPP: unit nonresponse and item nonresponse. Unit nonresponse occurs when one or more of the people residing at a sample address is not interviewed and no proxy interview is obtained. Most types of unit nonresponse are dealt with through weighting adjustments.

Item nonresponse occurs when a respondent completes most of the questionnaire but does not answer one or more individual questions. Item nonresponse data in SIPP occur under the following circumstances:

Responding sample persons refuse or are unable to provide requested information
Interviewers fail to ask a question or incorrectly record a response
A response is inconsistent with related responses or is incompatible with response categories

Three different approaches are used for dealing with missing data in SIPP:

Data editing, as described above
Statistical (or stochastic) imputation for some types of item nonresponse.
Weighting adjustments for some types of noninterviews

Imputation

There are two key problems caused by missing data:

A lack of consistency across analyses because data users compensate for missing data in different ways, and their analyses may be based on different subsets of data
Nonresponse is unlikely to be completely random, so estimates of population parameters are biased due to a potential non-representative sample

Because missing data are always present to some degree, analyses of survey data must be based on assumptions about patterns of missing data. When missing data are not imputed or otherwise accounted for in the model being estimated, the implicit assumption is that data are missing at random after controlling for other variables in the model. The imputation procedures used for SIPP are based on the assumption that data are missing at random within subgroups of the population. The statistical goal of imputation is to reduce the bias of survey estimates. This goal is achieved to the extent that systematic patterns of item nonresponse are correctly identified and modeled. In SIPP, the statistical goals of imputation are general, rather than specific. Instead of addressing the estimation of specific parameters, SIPP procedures are designed to provide reasonable estimates for a variety of analytical purposes.

SIPP uses three main imputation strategies:

Model-Based Imputation
Sequential Hot-Deck Imputation
Cold-Deck Imputation

Model-Based Imputation

Model-based imputation creates topic flags that determine whether a respondent should have answered questions about a specific content area (e.g., Social Security or TANF) if the respondent originally did not report information for that topic. The output of this prediction is a Y/N topic flag variable, with ‘Y’ indicating that there should be data related to the topic.

In addition to topic flag variables, select variables pertaining to earnings, assets, liabilities, employment characteristics, and retirement content also use model-based imputation to fill in missing information. The output of these specific variables can be binary (e.g., owning an IRA retirement account), nominal (e.g., wageworker/self-employed/other), or continuous (e.g., earnings).

This modeling method has several advantages over hot-deck imputation. We can include many more explanatory variables in the models than can be included as stratifiers in a hot-deck. For topic flag variables, this means we can condition the imputation for a given topic flag on the imputed values for every other topic flag, hopefully approximating a joint distribution of values instead of a series of independent imputations. Parent and spouse variables can also be used as regressors or conditioning variables in the models, which allows us to better preserve the relationships among the MBIVs of household members. Finally, non-SIPP data can be used to mitigate the problem of respondents with missing values being different in unobservable ways from respondents with non-missing values. The use of rich administrative data from the Social Security Administration (SSA) and Internal Revenue Service (IRS) in model-based imputation is particularly helpful in predicting values not only for missing topic flags, but for missing information regarding earnings, assets, liabilities, employment, and select retirement content. This is why these select non-topic flag variables use model-based imputation.

Sequential Hot-Deck Imputation

The statistical imputation method used to impute most missing items in SIPP is known as a sequential hot-deck imputation. For many topics, SRMI models determine whether a respondent with missing data should have data for a topic (e.g., receipt of unemployment insurance any time during the reference year), whereas detailed information about the topic (e.g., months of receipt or monthly amount received) is imputed using sequential hot-deck imputation. In some cases a ratio is imputed. This ratio is used to derive the value instead of imputing the value itself. This is done to preserve relationships between certain variables (e.g., asset value and income).

In a general sense, the sequential hot-deck procedure matches a record with missing data to that of a donor with similar background characteristics and uses the donor’s values. This procedure differs from data editing, which replaces missing data with inferred values based on non-missing data from the same case.

Hot-decks are cross-sectional; only values from current wave responses are used in the definition of the hot-deck cells. SIPP hot-deck procedures are designed to preserve the univariate distribution of each variable subjected to imputation. However, they do not generally preserve the covariances among variables. One consequence is that imputation can introduce inconsistencies into the data. For example, if a respondent has reported program participation, but his or her income is too high for that program, it is possible that the income data have been imputed. Whenever users detect inconsistencies, it is wise to check the status flag to see if the inconsistent data might have been imputed.

Cold-Deck Imputation

Cold-deck values are the values to which each cell in the hot-deck matrix is initialized. If a value cannot be assigned via logical imputation or the hot-deck imputation process, cold-deck imputation is used as a last resort.

Please see the SIPP Users’ Guide specific to your year or panel of analysis for more information on data editing and imputation.

Page Last Revised - August 18, 2022