On March 1, 2001, the U.S. Census Bureau issued the recommendation of the Executive Steering Committee for A.C.E. Policy (ESCAP) that the Census 2000 Redistricting Data not be adjusted based on the Accuracy and Coverage Evaluation (A.C.E.). By mid-October 2001, the Census Bureau had to recommend whether Census 2000 data should be adjusted for future uses, such as the census long form data products, post-censal population estimates, and demographic survey controls. In order to inform that decision, the ESCAP requested that further research be conducted.
Between March and September 2001, the Demographic Analysis-Population Estimates (DAPE) research project addressed the discrepancy between the demographic analysis data and the A.C.E. adjusted estimates of the population. Specifically, the research examined the historical levels of the components of population change to address the possibility that the 1990 Demographic Analysis understated the national population and assessed whether demographic analysis had not captured the full population growth between 1990 and 2000. Assumptions regarding the components of international migration (specifically, emigration, temporary migration, legal migration, and unauthorized migration) contain the largest uncertainty in the demographic analysis estimates. Therefore, evaluating the components of international migration was a critical activity in the DAPE project.
This report addressed the question: "How do edit and imputation procedures affect the consistency of foreign-born and Hispanic populations?" Comparisons were made between the edit and imputation specifications for the 1990 census and Census 2000 for the questions on place of birth and Hispanic origin to determine what impact, if any, such differences might have had on comparisons of numbers between the censuses. There were few significant differences in the specifications for the question on place of birth. The most significant difference - "hot deck" imputation of specific countries of birth in Census 2000 but not in 1990 - did not affect the overall total of foreign-born people. Regarding the specifications for the Hispanic question, several important differences were noted, the most important of which was the use of surname-assisted "hot decks" in assigning an origin. Overall, the Census 2000 edit and imputation procedures seemed to be more accurate than the 1990 procedures in assigning an origin. The improvement in assigning an origin was assisted by a substantial decline between 1990 and 2000 in the level of nonresponse to the question on Hispanic origin.
This paper reports the results of research and analysis undertaken by Census Bureau Staff. It has undergone a more limited review than official Census Bureau publications. This report is released to inform interested parties of research and to encourage discussion.
An extremely important context for understanding the impact of these differences is the fact that the number of allocations for the origin question dropped by 34 percent between 1990 and 2000. This translated into a drop from 25.5 million allocations in 1990 to 16.8 million allocations in 2000. In addition to the drop in overall allocations, there was a fundamental shift in the type of allocation made. In 1990, 75.6 percent of allocations occurred through the "hot deck" (nearest neighbor) method. By contrast, only 41.2 percent of allocations required hot deck allocation in Census 2000. This is an important point, because of the techniques used (imputation based on other information provided by the respondent, allocation from other household members, and hot deck allocation), hot deck allocation is the least reliable. We can attribute this improvement, in large part, to moving the question on origin before the question on race.
There is strong evidence that the less restrictive 1990 edit and imputation procedures and greater reliance on hot deck allocation, combined with a much higher level of nonresponse to the Hispanic origin question in 1990, may have resulted in "over-editing" at least 161,000 people as Hispanic. Although we did not attempt to run the Census 2000 edit and imputation program on 1990 data, we believe the Census 2000 would have imputed fewer people as Hispanic than did the 1990 program.
Edit and imputation procedures attempt to rely as much as possible on sources of information about which there is the most confidence (other information provided by the respondent or responses of other household members) and to rely less on last resort procedures such as hot deck allocation. Even with hot decks, efforts are made to improve the accuracy of allocation by matching donors and donees according to one or more key characteristics. For example, in the 1990 census, origin hot decks used race as a matching variable for donors and donees. In contrast, Census 2000 used not only race, but also age and whether the surname was Spanish or not Spanish, as matching variables. We believe these additional variables improved the accuracy of origin allocation from the hot deck.
While these changes reflect an attempt to provide a more precise allocation of state or country of birth, they do not appear to be of sufficient importance to affect adversely comparisons of levels of foreign born compared with natives between the two censuses. It is unclear, however, how differences between the edit and imputation procedures may have affected comparisons between specific states or countries of birth for the two censuses. We will need to evaluate these issues when we obtain the long form data in Spring 2002.
The use of a native or foreign-born check box in the question may have had some impact for prompting people to report a place of birth. However, because the question relies primarily on a write-in entry for appropriate classification as native or foreign born (in fact, the write-in entry takes precedence over the check box), it is not clear that we would have obtained different results because of the check box categories. The check box categories played a role in the edit and imputation procedures when no write-in response was provided, but this role was a rather limited one. When there was no write-in response, a citizenship response, in some instances, was actually given higher weight in assigning a place of birth than the check box response.
The most important difference between the 1990 and 2000 edit and imputation procedures was in the assignment of a specific country of birth for people not reporting a place of birth who were assigned as foreign born. In 1990, people who were assigned as foreign born were not assigned a specific country of birth. Instead, these people were classified as "Area not reported." By contrast, the edit and imputation procedures for Census 2000 will assign a specific country of birth. While this difference does not affect comparisons of the total foreign born between the two censuses, it does affect any comparison by country of birth. In fact, we had to distribute the "Area not reported" population among countries of birth for intercensal estimates that required detailed country of birth data. Another DAPE task team is analyzing how these distributions were made and will not be discussed further in this report.2
The allocation rate for the place of birth question in 1990 was 5.4 percent. By contrast, the rate for Census 2000 was 9.0 percent.3 The difference in the level of nonresponse between the two censuses can be explained partially by the fact that the 1990 census had a content edit follow-up operation that attempted to obtain answers from census forms that had more than a pre-specified threshold of questions with no answers. Census 2000 did not implement a content edit follow-up operation. The increased level of nonresponse, however, does not necessarily imply that comparisons of data on specific countries of birth between 1990 and 2000 would be adversely affected, especially given the improvements in Census 2000 edit and imputation procedures and the fact that specific country of birth was not assigned in the 1990 census procedures.
Table A summarizes the key differences between the edit and imputation procedures for the Hispanic origin question in 1990 and 2000. First, while multiple responses were not allowed in either census, Census 2000 allowed for the data capture of more than one response and the edit and imputation procedures assigned one origin. In the case of multiple non-Hispanic or multiple Hispanic responses, a respondent remained non-Hispanic or Hispanic, respectively. However, in the case of a conflicting Hispanic/non-Hispanic response, an attempt was made to resolve this conflict by using other information provided by the respondent (for example, an Hispanic response in the race question), responses of other people in the household, or people living near by who are of the same race.
Census 2000 edit and imputation procedures also differed from the 1990 procedures in how origin could be assigned from other people in the household. In 1990, anyone in the household could donate an origin regardless of their race. By contrast, Census 2000 rules only allowed other household members to "donate" an origin if the person needing an origin and the donor had the same race.
One of the most important differences between the two procedures was how "hot deck" allocation was implemented.4 In 1990, hot deck values were stored and assigned by the race of the "donor" and "donee." In Census 2000, the hot decks also were controlled by the race of the donor. However, Census 2000 hot decks also were controlled by four broad age groups.
More importantly, Census 2000 origin hot decks were further differentiated by whether the donor (and donee) had a Spanish or non-Spanish surname. Use of surname in storing and assigning an origin was one of the most important innovations implemented in Census 2000 in that it allowed a much more precise method for assigning an origin from a hot deck. This innovation was cited in a recent evaluation of having a "profound" impact on the assignment of origin.5
Finally, if both race and Hispanic origin were not reported, the edit attempted to assign both a race and an origin from another donor (both within household imputation and hot deck allocation). The 1990 procedures assigned race and origin independently of each other, thus increasing the possibility of creating race/origin combinations that were not that common in the population.
Before assessing the impact of these differences on the Hispanic origin population, it is important to understand the differing contexts within which each edit operated. One of the hallmarks of the Hispanic origin question in 1990 was the relatively high level of nonresponse. Table 1 compares the allocation rates6 for Census 2000 and the 1990 census. It is clear from this table that the allocation rate for this question was almost twice as high in 1990 as it was in 2000 (10.4 percent versus 5.6 percent). What is striking is that the range of allocation rates by region narrowed considerably from 1990 to 2000. In 1990, the rates ranged from 7.2 percent in the West to 11.8 percent in the Northeast - a difference of 4.6 percentage points. Among states and the District of Columbia, the range was even wider with Idaho having the lowest percent (4.2 percent) and the District of Columbia having the highest (18.3 percent) - a difference of 14.1 percentage points. In Census 2000, by contrast, the range by region was much narrower, with the Midwest having the lowest rate (4.7 percent) and the South having the highest rate (6.0 percent) - a difference of only 1.3 percentage points. By state, Minnesota had the lowest rate in Census 2000 (4.0 percent), while the District of Columbia had the highest rate (11.0 percent) - a difference of 7.0 percentage points. It is clear that the biggest improvement in these rates occurred for states that had high allocation rates in 1990. This dramatic improvement in response can be attributed in large part to the placement of the Hispanic question before the question on race in Census 2000.
Tables 2-7 show the impact of the higher level of nonresponse to the origin question in the 1990 census.7 Table 2 shows that at the national level, hot deck allocation was the largest source of origin response after "reported origin." This means that for a significant proportion of the population (8.5 percent), no one in the household answered the Hispanic origin question. This relationship held for all states.
Table 3 shows that for the 1990 Hispanic population alone, there was about equal reliance on "within household" and "hot deck" allocation, with some regions and states having a higher proportion of within-household allocation. This is not surprising since the question is primarily oriented to the Hispanic population. Table 4, by contrast, shows that for non-Hispanics, the proportion of responses coming from hot deck allocation was much higher than that from within household allocation. Tables 5-7 show the distribution of allocated responses by source of allocation and support the same conclusions but from a slightly different perspective.
One of the most important changes made to the Hispanic origin question in Census 2000 to address the problem of nonresponse was to shift the order of the Hispanic origin and race questions. In the 1990 census, the race question appeared first and the Hispanic origin question appeared several questions later. It seems clear that after answering the question on race, many people felt that the Hispanic origin question did not apply and simply skipped the question. Shifting the order of the questions in tests conducted before Census 2000 seemed to improve overall response to the Hispanic origin question with some increased nonresponse to the question on race.
Table 1 and Tables 8-13 show very clearly that not only the level of nonresponse was reduced but also that the relative contribution of within household and hot deck allocation was much more balanced for non-Hispanics in Census 2000 than in the 1990 census. More importantly, allocation from surname-assisted hot decks overall was greater than allocation from non-surname-assisted hot decks (Tables 8-10). Table 10, in particular, shows that for non-Hispanics, allocation from surname-assisted hot decks was about three times the level of allocation from non-surname assisted hot decks (2.0 percent compared to 0.6 percent).
The impact of surname-assisted programs is clearly more dramatic when observing the source of allocations in Tables 11-13. Overall, surname-assisted hot decks represented 31.4 percent of all allocations, while non-surname assisted hot decks accounted for only 9.6 percent of all allocations. For Hispanic allocations, surname-assisted hot decks overall represented 8.1 percent of all allocations while non-surname assisted hot decks represented about 4.0 percent. For non-Hispanics, surname assisted hot decks provided 36.9 percent of all allocations, while non-surname assisted hot decks provided only 10.9 percent of all allocations (Table 13). In some states where the proportion of Hispanics is very low (such as Alabama, Georgia, Mississippi, North Carolina, South Carolina, and West Virginia), the proportion of people receiving an origin from a surname-assisted hot deck is five times the proportion receiving an origin from a non-surname assisted hot deck.
It is clear from Tables 2-13 that there was a significant increase in Census 2000 in the level of substitution, from 0.7 percent of the population in households in 1990 to 1.2 percent of the total population in Census 2000 (Tables 2 and 8). Substitution occurs when there are no data for anyone in the housing unit, and we use data from a neighboring household of similar size, using the hot deck method, to allocate characteristics for the people in that housing unit. Given that the same basic method was used in both censuses, there is no reason to believe that the procedure itself created any upward or downward bias in assigning origin in 1990 and 2000.
Tables 9 and 10 show that the percent substituted is slightly higher for the Hispanic population (1.6 percent) than for the non-Hispanic population (1.2 percent). There was a similar pattern in 1990, however, but at a lower level. Tables 3 and 4 show that in 1990 the percent substituted for the Hispanic population (0.9 percent) was again slightly higher than that for the non-Hispanic population (0.6 percent). In addition, it is also clear that substitution played a much larger role in the source of allocation of origin in 2000, with substitution constituting about 20 percent of allocations overall. Interestingly, as shown in Tables 12 and 13, the share of substitution was higher for the non-Hispanic population (21.1 percent) than for the Hispanic population (17.5 percent). By contrast, Tables 6 and 7 show that in1990 the share of substitution in total allocations was much higher for Hispanics (11.0 percent) than for non-Hispanics (5.9 percent). The reasons for the increase in substitution will be part of the Census Bureau's evaluation of Census 2000.
Finally, to put all these results in a broader perspective by including the results from the Census 2000 Supplemental Survey (C2SS), Tables 14-16 show that the trend toward improved response to the origin question is continuing. Editing procedures were basically the same for Census 2000 and the C2SS, except that there was no substitution in the C2SS. Table 14, in particular, shows that allocation rates are lower for the total population and for the Hispanic and non-Hispanic populations in the C2SS than in Census 2000 and in 1990. Table 15 shows an even greater reliance on surname assisted hot decks in the C2SS, with Table 16 showing a much greater reliance on surname assisted hot decks for the non- Hispanic population than for the Hispanic population. It should be noted, however, that the level of response in C2SS was improved through the use of field follow-up procedures for people who did not fully answer the questions on the questionnaire, a procedure that was not used in Census 2000.
In the 1990 census, there was an unusually high level of dependence on hot deck allocation because many of the people needing an imputed origin had no reported origin for anyone in the household. This greater reliance on hot deck allocation, combined with a relatively high level of nonresponse, meant that most allocations came from the hot deck, especially for the non-Hispanic population. For example, 75.6 percent of non-Hispanic allocations came from a hot deck, excluding substitutions. By contrast, only 29.9 percent of Hispanic allocations came from a hot deck (Tables 5 and 6), again excluding substitutions. This reflects the fact that the 1990 census hot decks matched donors and donees by their race, but did not match by age and by whether the donee had a Spanish or non-Spanish surname as did Census 2000 origin hot decks.
Concerns about the impact of 1990 edit and imputation procedures emerged when the results of the sample data processing, including a separate edit and imputation for sample questionnaires, became available. The Hispanic origin question on the sample form was edited in sample processing independent of the 100-percent edit and imputation program. Although the basic structure of the two procedures were the same, the edit and imputation procedures for the Hispanic origin question during sample processing differed in a very important way from those used in 100-percent processing. Unlike the 100-percent procedures, sample procedures made use of the rich source of ethnic-related questions from the sample form (ancestry, place of birth, language spoken at home) that could assist in imputing for nonresponse. The use of ethnic-related information, combined with a higher response rate for the Hispanic origin question on the sample form, meant a much lower dependence on hot deck allocation.
The estimate of the Hispanic origin population that resulted from sample processing was about 454,000 below the total of Hispanics obtained from 100-percent processing with the 100-percent total exceeding the sample estimate for most states. This difference existed despite the fact that sample estimates were controlled to 100-percent totals, including race and Hispanic origin.8
Thompson (1991) addressed this difference and the difference between 100-percent totals and sample estimates for the American Indian population. He noted that the difference for the Hispanic population could be attributed to three factors: 1) weighting procedures; 2) a form of allocation bias; and 3) sample processing. Thompson attributed the difference between 100-percent totals and sample estimates primarily to undersampling of Hispanics and to a form of allocation bias. He also attributed part of the difference to different data processing procedures.9 His analysis, however, did not quantify how much each factor contributed to this difference.
The "allocation bias" to which Thompson's analysis refers is directly related to the focus of this analysis. Thompson noted that the nonresponse for the Hispanic question on the short form was 10 percent while the nonresponse rate for the same question on the sample form was only 4 percent. This difference was due partly to the fact that during data collection all sample forms were subject to content edit follow-up (field follow-up of cases where the number of non-reported items exceeded a certain threshold). By contrast, only 10 percent of short forms were subject to content edit follow-up.
Thompson reasoned that Hispanics were more likely to answer the Hispanic origin question than were non-Hispanics, making the donor pool more heavily Hispanic than it would have been had both Hispanics and non-Hispanics reported. If the nonresponse rate for the Hispanic question was high, there was an increased risk that an Hispanic origin would be disproportionately assigned. Evidence of this comes from Del Pinal (1994) who noted that the 1990 edit and imputation procedures tended to increase the overlap between various racial groups and the Hispanic population. For example, although there were very few Black Mexican origin persons, about 62 percent of Black Mexicans were created by the edit and imputation procedures.10 Not surprisingly, the Black population had a much higher nonresponse rate (18.4 percent) in the Hispanic origin question than did the White population (9.6 percent). (See Table 17.) The corresponding nonresponse rates for American Indians and Alaska Natives and Asians and Pacific Islanders were 10.2 percent and 9.7 percent, respectively. All these rates were still much higher than the nonresponse rates for other 100-percent questions such as race, age, gender and household relationship - all of which had nonresponse rates below 3 percent - and increased the possibility of a misallocation of respondents as Hispanic. To give a sense of the potential impact on the data, a net misallocation of only 0.1 percent of nonresponses as Hispanic out of a total of 24 million needing an origin would result in a net increase of 240,000 Hispanics.
To attempt to quantify at some minimal level the impact of the potential misallocation of responses as Hispanic, we obtained records from the sample edited detailed file (SEDF) for 1990. On these records, we had not only the origin value from sample processing (along with its allocation flag to indicate whether the value was reported or imputed) but also the origin value from 100-percent processing along with its corresponding allocation flag. In particular, we were interested in determining how people who received an allocated origin in the 100-percent edit had their origin allocated in the sample edit. For the purposes of this analysis, the results of the sample edit are considered the standard for accuracy because sample editing procedures made use of data from additional ethnic-related questions (ancestry, place of birth, and language spoken at home) not available on the short form.
Table 18 shows that, overall, the 100-percent edit produced a net of about 181,000 more Hispanics than did the sample edit when origin was allocated both in 100-percent and sample editing procedures. This net difference in edit outcomes represented only 2.1 percent of the 8.6 million people for whom origin was allocated in both 100-percent and sample processing.
If we take into consideration also the situations in which we imputed a value in the 100-percent procedures but did not impute a value in the sample procedures, the 100-percent edit produced a net overall of about 161,000 more Hispanics than did the sample procedures.11 Assuming that the sample edit and imputation process is more accurate, the 100-percent edit appears to have imputed as Hispanic a net total of 161,000 people who were probably not Hispanic. However, this number represents only 1.8 percent of all people whose origin was imputed. It is also important to keep in mind that both edit procedures agreed on the edit outcome 96 percent of the time.
It is clear from this table that the impact of this potential misallocation is different by race. The apparent degree of over-editing of Hispanics (as measured by taking the ratio of "Hispanic-100%; Not Hispanic - Sample" to "Not Hispanic - 100%; Hispanic - Sample") appeared to be much greater for Blacks (10.0) and Asian and Pacific Islanders (13.1) than for Whites (4.4). Analysis of the unweighted data shows the same pattern, but slightly lower ratios for each group. This finding is consistent with Del Pinal's finding that certain race/Hispanic combinations were more significantly affected by the editing procedures.
It is important to keep in mind that the estimate of 161,000 is probably a lower bound because these data were obtained from sample forms that had a lower nonresponse rate and had much more ethnic-related information than did short form questionnaires. It is possible that the level of misallocation would be higher among the population that received only the short form, which experienced a higher nonresponse rate for origin than did the sample form. However, it is unlikely that the upper bound would be as high as the difference between the 100-percent and sample totals (454,000) because: 1) sample processing changed about 262,000 responses from "Other Spanish/Hispanic" to not Hispanic12 and 2) to an unknown degree there was undersampling of Hispanics for which the sample weighting procedures did not compensate.
It is also very important to keep in mind that the impact on the overall total Hispanic population was very small. Overall, this net difference (161,000) represented only 0.7 percent of the total Hispanic population.
There are no comparable data available at this time from Census 2000 to perform the same type of analysis that was conducted on the 1990 census edit and imputation procedures. However, it is very clear that the Census 2000 procedures operated in an environment that was profoundly different from that in which the 1990 procedures operated. Significantly reduced nonresponse to the question, combined with more restrictions on the conditions under which origin could be assigned to an individual, probably has led to much lower level of erroneous imputations as Hispanic (or non-Hispanic).13 At the same time, innovations, such as the surname-assisted hot deck, has improved the accuracy and, therefore, the quality of data from the Hispanic origin question.
We will continue our analysis of the quality of Census 2000 origin data as sample data and data from other evaluation studies become available.
Sources cited in report:
Demographic Analysis-Population Estimates (DAPE) Research Project Reports Related to Evaluating Components of International Migration (in order of Working Paper Series Number):
|Table 1.||Total Allocation Rates for the Hispanic Question for the United States, Regions, and States: 1990 and 2000.
PDF (63k) | XLS (30k) | CSV (4k)
|Table 2.||Total Household Population for the Hispanic Origin Question by Allocation Status and Type of Allocation Flag for the United States, Regions, and States: 1990.
PDF (66k) | XLS (31k) | CSV (5k)
|Table 3.||Total Hispanic Household Population for the Hispanic Origin Question by Allocation Status and Type of Allocation Flag for the United States, Regions, and States: 1990.
PDF (66k) | XLS (31k) | CSV (5k)
|Table 4.||Total Non-Hispanic Household Population for the Hispanic Origin Question by Allocation Status and Type of Allocation Flag for the United States, Regions, and States: 1990.
PDF (65k) | XLS (29k) | CSV (5k)
|Table 5.||Total Allocation counts for the Hispanic Origin Question by Type of Allocation Flag for the United States, Regions, and States: 1990.
PDF (66k) | XLS (29k) | CSV (5k)
|Table 6.||Total Allocation counts for Hispanics by Type of Allocation Flag for the United States, Regions, and States: 1990.
PDF (65k) | XLS (29k) | CSV (5k)
|Table 7.||Total Allocation counts for Non-Hispanics by Type of Allocation Flag for the United States, Regions, and States: 1990.
PDF (65k) | XLS (28k) | CSV (5k)
|Table 8.||Total Population for the Hispanic Origin Question by Allocation Status and Type of Allocation Flag for the United States, Regions, and States: 2000.
PDF (67k) | XLS (33k) | CSV (6k)
|Table 9.||Total Hispanic Population for the Hispanic Origin Question by Allocation Status and Type of Allocation Flag for the United States, Regions, and States: 2000.
PDF (67k) | XLS (33k) | CSV (6k)
|Table 10.||Total Non-Hispanic Population for the Hispanic Origin Question by Allocation Status and Type of Allocation Flag for the United States, Regions, and States: 2000.
PDF (66k) | XLS (32k) | CSV (6k)
|Table 11.||Total Allocation counts for the Hispanic Origin Question by Type of Allocation Flag for the United States, Regions, and States: 2000.
PDF (67k) | XLS (32k) | CSV (6k)
|Table 12.||Total Allocation counts for Hispanics by Type of Allocation Flag for the United States, Regions, and States: 2000.
PDF (66k) | XLS (32k) | CSV (6k)
|Table 13.||Total Allocation counts for Non-Hispanics by Type of Allocation Flag for the United States, Regions, and States: 2000.
PDF (66k) | XLS (31k) | CSV (6k)
|Table 14.||Allocation Rates by Type of Hispanic Origin for Census 2000, Census 1990, and Census 2000 Supplemental Survey, for the United States.
PDF (42k) | XLS (17k) | CSV (1k)
|Table 15.||Total Edit and Allocation Counts by Type of Allocation Flag for Census 2000, Census 1990 and Census 2000 Supplemental Survey, for the United States.
PDF (44k) | XLS (18k) | CSV (1k)
|Table 16.||Total Edit and Allocation Counts by Type of Hispanic Origin and by Type of Allocation Flag for Census 2000, Census 1990 and Census 2000 Supplemental Survey, for the United States.
PDF (44k) | XLS (19k) | CSV (2k)
|Table 17.||Allocation Rates for the Hispanic Origin Question by Race for the United States: 1990 Census.
PDF (40k) | XLS (16k) | CSV (1k)
|Table 18.||Allocation of Origin - 100% Edit Outcome vs Sample Edit Outcome for the United States: 1990 Census.
PDF (49k) | XLS (27k) | CSV (4k)
|Table A.||Differences Between Census 2000 and 1990 Census Edit and Imputation Procedures for the Questions on Place of Birth and Hispanic Origin.
PDF (72k) | XLS (19k) | CSV (3k)