Imputation of Unreported Data Items

Cosponsored By:

Imputation of Unreported Data Items

The CPS is subject to two sources of nonresponse. The largest is noninterview households. To compensate for this data loss, the weights of noninterviewed households are distributed among interviewed households. The second source of data loss is from item nonresponse, which occurs when a respondent either does not know the answer to a question or refuses to provide the answer. Item nonresponse in the CPS is modest.

One of three imputation methods are used to compensate for item nonresponse in the CPS. Before the edits are applied, the daily data files are merged and the combined file is sorted by state and PSU within state. This sort ensures that allocated values are from geographically related records; that is, missing values for records in Maryland will not receive values from records in California. This is an important distinction since many labor force and industry and occupation characteristics are geographically clustered.

The edits effectively blank all entries in inappropriate questions (e.g., followed incorrect path of questions) and ensure that all appropriate questions have valid entries. For the most part, illogical entries or out-of-range entries have been eliminated with the use of electronic instruments; however, the edits still address these possibilities, which may arise from data transmission problems and occasional instrument malfunctions. The main purpose of the edits, however, is to assign values to questions where the response was "Don’t know" or "Refused." This is accomplished by using 1 of the 3 imputation techniques described below.

The edits are run in a deliberate and logical sequence. Demographic variables are edited first because several of those variables are used to allocate missing values in the other modules. The labor force module is edited next since labor force status and related items are used to impute missing values for industry and occupation codes and so forth.

The three imputation methods used by the CPS edits are described below:

Relational imputation infers the missing value from other characteristics on the person’s record or within the household. For instance, if race is missing, it is assigned based on the race of another household member, or failing that, taken from the previous record on the file. Similarly, if relationship data is missing, it is assigned by looking at the age and sex of the person in conjunction with the known relationship of other household members. Missing occupation codes are sometimes assigned by analyzing the industry codes and vice versa. This technique is used as appropriate across all edits. If missing values cannot be assigned using this technique, they are assigned using one of the two following methods.
Longitudinal edits are used in most of the labor force edits, as appropriate. If a question is blank and the individual is in the second or later month’s interview, the edit procedure looks at last month’s data to determine whether there was an entry for that item. If so, last month’s entry is assigned; otherwise, the item is assigned a value using the appropriate hot deck, as described next.
The third imputation method is commonly referred to as ‘‘hot deck’’ allocation. This method assigns a missing value from a record with similar characteristics, which is the hot deck. Hot decks are defined by variables such as age, race, and sex. Other characteristics used in hot decks vary depending on the nature of the unanswered question. For instance, most labor force questions use age, race, sex, and occasionally another correlated labor force item such as full-or part-time status. This means the number of cells in labor force hot decks are relatively small, perhaps fewer than 100. On the other hand, the weekly earnings hot deck is defined by age, race, sex, usual hours, occupation, and educational attainment. This hot deck has several thousand cells.

All CPS items that require imputation for missing values have an associated hot deck . The initial values for the hot decks are the ending values from the preceding month. As a record passes through the editing procedures, it will either donate a value to each hot deck in its path or receive a value from the hot deck. For instance, in a hypothetical case, the hot deck for question X is defined by the characteristics Black/non-Black, male/female, and age 16−25/25+. Further assume a record has the value of White, male, and age 64. When this record reaches question X, the edits determine whether it has a valid entry. If so, that record’s value for question X replaces the value in the hot deck reserved for non-Black, male, and age 25+. Comparably, if the record was missing a value for item X, it would be assigned the value in the hot deck designated for non-Black, male, and age 25+.

As stated above, the various edits are logically sequenced, in accordance with the needs of subsequent edits. The edits and codes, in order of sequence, are:

Household edits and codes. This processing step performs edits and creates recodes for items pertaining to the household. It classifies households as interviews or noninterviews and edits items appropriately. Hot deck allocations defined by geography and other related variables are used in this edit.
Demographic edits and codes. This processing step ensures consistency among all demographic variables for all individuals within a household. It ensures all interviewed households have one and only one reference person and that entries stating marital status, spouse, and parents are all consistent. It also creates families based upon these characteristics. It uses longitudinal editing, hot deck allocation defined by related demographic characteristics, and relational imputation.

Page Last Revised - October 8, 2021