Skip to content

Skip to looking for section

Data Editing and Imputation

Component ID: #ti1340336641

Component ID: #ti1974022072

This section describes the data editing and imputation procedures applied to data from the Survey of Income and Program Participation (SIPP) after completion of the interviews. Three different approaches are used for dealing with missing data in SIPP:

  • Weighting adjustments are used for some types of noninterviews;
  • Data editing (also referred to as logical imputation) is used for some types of item nonresponse; and
  • Statistical (or stochastic) imputation is used for some types of unit nonresponse and some types of item nonresponse.

Weighting is discussed in Chapter 8 of the SIPP Users' Guide.

The section begins with a brief discussion of the types of missing data and the goals of imputation in SIPP. It then presents an overview of the editing and imputation procedures used to deal with missing and inconsistent data. Next, the chapter provides a detailed description of each of the major steps used by the Census Bureau when creating its internal files and the files that are released for public use. Prior to 1996 the development of cross-sectional wave files involved mainly cross-sectional editing and imputation. The longitudinal files involved longitudinal editing. Beginning with the 1996 Panel, the processing procedures for the wave files were replaced with methods that use prior wave information to inform the editing and imputation of a current wave (after Wave 1). The generic imputation technique, that is, the hot-deck method, is still used in the 1996+ Panels, but the donors are now chosen on the basis of similarities in reported prior wave information when that reported information exists.

Component ID: #ti1340336634

Types of Missing Data

As in all surveys, there are two general types of missing data in SIPP: unit nonresponse and item nonresponse. Unit nonresponse occurs in SIPP when one or more of the people residing at a sample address are not interviewed and no proxy interview is obtained. This can happen for a number of reasons, described in Chapter 2 of the SIPP Users' Guide. Most types of unit nonresponse are dealt with through weighting adjustments (see Chapters 2 and 8 of the SIPP Users' Guide). However, the data editing and statistical imputation procedures described in this chapter are used with one type of unit nonresponse: Type Z noninterviews, which occur when an interview is obtained from at least one household member but interviews are not obtained from one or more other sample persons in that household.1 Prior to the 1996 Panel and in some instances in the 1996 Panel, the method used to adjust for person-level noninterviews in the core wave files is known as Type Z imputation, which is discussed below.

Item nonresponse occurs when a respondent completes most of the questionnaire but does not answer one or more individual questions. Item nonresponse data in SIPP occur under the following circumstances:

  • Responding sample persons refuse or are unable to provide requested information;
  • Interviewers fail to ask a question or incorrectly record a response;
  • A response is inconsistent with related responses or is incompatible with response categories; and
  • Interviewers make an error when recording or keying in the data.2 Item nonresponse data are generally imputed for core items, as well as for many topical module items.

Back to top

Component ID: #ti1340336639

Goals of Imputation

Missing data cause a number of problems: analyses of data sets with missing data are more problematic than analyses of complete data sets; there is a lack of consistency among analyses because analysts compensate for missing data in different ways and their analyses may be based on different subsets of data; and, in the presence of nonresponse that is unlikely to be completely random, estimates of population parameters are biased.

Because missing data are always present to some degree, analyses of survey data must be based on assumptions about patterns of missing data. When missing data are not imputed or otherwise accounted for in the model being estimated, the implicit assumption is that data are missing at random after controlling for other variables in the model. The imputation procedures used for SIPP are based on the assumption that data are missing at random within subgroups of the population. The statistical goal of imputation is to reduce the bias of survey estimates. This goal is achieved to the extent that systematic patterns of item nonresponse are correctly identified and modeled. In SIPP, the statistical goals of imputation are general, rather than specific. Instead of addressing the estimation of specific parameters, SIPP procedures are designed to provide reasonable estimates for a variety of analytical purposes.

Data editing is generally preferred over statistical imputation, and it is used whenever a missing item can be logically inferred from other data that have been provided. When information exists on the same record from which missing information can logically be inferred, that information is used to replace the missing information. The advantage of data editing is that it avoids the increase in variance that occurs when missing items on one record are imputed with nonmissing responses from other records.

Back to top

Component ID: #ti1340336638

Effects of Imputed Data on Analysis

Users of SIPP data interested in assessing the influence of imputed data on their analyses should consider whether SIPP imputation procedures have properties that affect their specific analytical requirements. A general discussion of the treatment of missing data in sample surveys is given in Kalton and Kaspyrzyk (1986). Sedransk (1985), Little (1986), and Jinn and Sedransk (1987) discuss properties of commonly used imputation processes. An example of the impact of imputation procedures on the distributional characteristics of a low-income population is discussed in Doyle and Dalrymple (1987).

An evaluation of the effects of imputed data should include a review of rates of unit nonresponse and an assessment of the extent of item nonresponse. Unit nonresponse tends to increase over the life of a panel, as does the likelihood that nonresponse is not a random effect. And as the percentage of eligible sample members re-interviewed decreases, the pool from which donors3 are selected shrinks accordingly. This smaller pool of donors leads to an increased likelihood that individual donors will be used more than once, which in turn increases the variance of an estimate.

The effects of imputation will likely be small for items with low rates of missing data as long as rates of item nonresponse are not high among important subclasses. Lepkowski et al. (1987), using data from a large federal survey, provide a framework for evaluating the effect of imputed values on analyses. This framework can be readily adapted to SIPP analyses.

Back to top

Component ID: #ti1340336637

Processing SIPP Data

At the conclusion of each wave of interviewing, the data collected during that wave are processed, creating the core wave and topical module files.

Figure 4-1 illustrates the steps that generate the Census Bureau's internal core wave and full panel files.

Component ID: #ti1340336635

Figure 4-1. Sequence of Cross-Sectional Imputation and Longitudinal Editing Procedures

The steps that generate the Census Bureau's internal core wave and full panel files.
Component ID: #ti1340336636

Summary of Processing

There are six steps in the first phase of SIPP data processing:

  1. As each wave of interviewing is completed, core data collected during the wave are edited for internal consistency.
  2. Following data editing, the statistical matching and hot-deck procedures described later in this chapter are used to impute missing data from the core wave file.
  3. A public use version of the core wave file is then created from the resulting internal core wave file. The public use file is the same as the Census Bureau's internal file except that it has certain information suppressed or topcoded to protect the confidentiality of survey respondents (see sections on Topcoding and Suppression of Geographic Information, at the bottom of this page).
  4. On a separate production track from the core data, data from the topical module file administered with the wave are edited for internal consistency. The extent of data editing varies across the topical modules, and some topical modules receive almost no editing.
  5. Next, hot-deck procedures are used to impute missing data in the topical module. The extent of imputation varies across the topical modules; some topical modules have no missing data imputed.
  6. A public use version of the topical module file is created from the resulting internal file. As with the public use core wave files, the public use topical module files have certain information suppressed to protect the confidentiality of survey respondents.

These steps are repeated at the conclusion of each wave of interviews. Prior to the 1996 Panel, each wave was processed independently of other waves of data. Thus, when multiple core wave files are linked, apparent changes in a respondent's status could be due to different applications of data edits and imputations to the files being combined (file linkage is the subject of Chapter 13 of the SIPP Users' Guide). With the 1996 data, the hot-deck procedure was redesigned to rely on historical information reported in prior waves. In addition, other forms of longitudinal imputation, such as carryover methods, were adapted.

Back to top

Component ID: #ti1340336640

Confidentiality Procedures for the Public Use Files

All of the editing and imputation procedures described in the preceding sections are part of the process of preparing the data for internal Census Bureau use. Before the files are released for public use, they undergo additional editing to protect the confidentiality of respondents. Two procedures are used: topcoding of selected variables (income, assets, and age) and suppression of geographic information. Because of these procedures, estimates based on data from the public use files will differ slightly from the Census Bureau's published estimates.

Topcoding

One piece of information that might reveal a respondent's identity is a very high income. For that reason, the Census Bureau topcodes income before making that information publicly available, recoding any income amounts over a certain maximum value to that maximum. In other words, income on the public use data files has a ceiling value. Although income is the primary variable that is topcoded, other variables that may disclose a respondent's identity, such as age, are also topcoded. A few variables, such as starting dates for employment, may be bottomcoded if they pose a disclosure risk.

Suppression of Geographic Information

Geographic information that can be used to directly identify survey respondents, such as an address, is removed from the public use files. In addition, states and metropolitan areas with populations less than 250,000 are not identified. Specific nonmetropolitan areas (such as counties outside of metropolitan areas) are never identified. In certain states, when the nonmetropolitan population is small enough to present a disclosure risk, a fraction of that state's metropolitan sample is recoded to nonmetropolitan status. For that reason, the SIPP data cannot be used to estimate characteristics of the population residing outside metropolitan areas. Chapter 10 of the SIPP Users' Guide provides details.

For the 1996 Panel, state-level geography is shown for 45 states and the District of Columbia. The remaining five states are combined as follows:

  1. Maine, Vermont; and
  2. North Dakota, South Dakota, Wyoming.

For the 1984 through 1993 Panels, state-level geography is shown for 41 individual states and the District of Columbia; the nine other states are combined into three groups:

  1. Maine, Vermont;
  2. Iowa, North Dakota, South Dakota; and
  3. Alaska, Idaho, Montana, Wyoming

Back to top

Component ID: #ti1340336642

Footnotes

1 That can happen either because people refuse to be interviewed or because they are unavailable for the interview and a proxy interview is not obtained.

2 Prior to the 1996 Panel, errors could also occur when data-entry workers were keying in results from the paper survey.

X
  Is this page helpful?
Thumbs Up Image Yes    Thumbs Down Image No
X
No, thanks
255 characters remaining
X
Thank you for your feedback.
Comments or suggestions?