- Summary of Differences
Table A summarizes the key differences between the edit and imputation procedures for the Hispanic origin question in 1990 and 2000. First, while multiple responses were not allowed in either census, Census 2000 allowed for the data capture of more than one response and the edit and imputation procedures assigned one origin. In the case of multiple non-Hispanic or multiple Hispanic responses, a respondent remained non-Hispanic or Hispanic, respectively. However, in the case of a conflicting Hispanic/non-Hispanic response, an attempt was made to resolve this conflict by using other information provided by the respondent (for example, an Hispanic response in the race question), responses of other people in the household, or people living near by who are of the same race.
Census 2000 edit and imputation procedures also differed from the 1990 procedures in how origin could be assigned from other people in the household. In 1990, anyone in the household could donate an origin regardless of their race. By contrast, Census 2000 rules only allowed other household members to "donate" an origin if the person needing an origin and the donor had the same race.
One of the most important differences between the two procedures was how "hot deck" allocation was implemented.4 In 1990, hot deck values were stored and assigned by the race of the "donor" and "donee." In Census 2000, the hot decks also were controlled by the race of the donor. However, Census 2000 hot decks also were controlled by four broad age groups.
More importantly, Census 2000 origin hot decks were further differentiated by whether the donor (and donee) had a Spanish or non-Spanish surname. Use of surname in storing and assigning an origin was one of the most important innovations implemented in Census 2000 in that it allowed a much more precise method for assigning an origin from a hot deck. This innovation was cited in a recent evaluation of having a "profound" impact on the assignment of origin.5
Finally, if both race and Hispanic origin were not reported, the edit attempted to assign both a race and an origin from another donor (both within household imputation and hot deck allocation). The 1990 procedures assigned race and origin independently of each other, thus increasing the possibility of creating race/origin combinations that were not that common in the population.
- Context for Comparing Edit and Imputation Procedures
Before assessing the impact of these differences on the Hispanic origin population, it is important to understand the differing contexts within which each edit operated. One of the hallmarks of the Hispanic origin question in 1990 was the relatively high level of nonresponse. Table 1 compares the allocation rates6 for Census 2000 and the 1990 census. It is clear from this table that the allocation rate for this question was almost twice as high in 1990 as it was in 2000 (10.4 percent versus 5.6 percent). What is striking is that the range of allocation rates by region narrowed considerably from 1990 to 2000. In 1990, the rates ranged from 7.2 percent in the West to 11.8 percent in the Northeast - a difference of 4.6 percentage points. Among states and the District of Columbia, the range was even wider with Idaho having the lowest percent (4.2 percent) and the District of Columbia having the highest (18.3 percent) - a difference of 14.1 percentage points. In Census 2000, by contrast, the range by region was much narrower, with the Midwest having the lowest rate (4.7 percent) and the South having the highest rate (6.0 percent) - a difference of only 1.3 percentage points. By state, Minnesota had the lowest rate in Census 2000 (4.0 percent), while the District of Columbia had the highest rate (11.0 percent) - a difference of 7.0 percentage points. It is clear that the biggest improvement in these rates occurred for states that had high allocation rates in 1990. This dramatic improvement in response can be attributed in large part to the placement of the Hispanic question before the question on race in Census 2000.
Tables 2-7 show the impact of the higher level of nonresponse to the origin question in the 1990 census.7 Table 2 shows that at the national level, hot deck allocation was the largest source of origin response after "reported origin." This means that for a significant proportion of the population (8.5 percent), no one in the household answered the Hispanic origin question. This relationship held for all states.
Table 3 shows that for the 1990 Hispanic population alone, there was about equal reliance on "within household" and "hot deck" allocation, with some regions and states having a higher proportion of within-household allocation. This is not surprising since the question is primarily oriented to the Hispanic population. Table 4, by contrast, shows that for non-Hispanics, the proportion of responses coming from hot deck allocation was much higher than that from within household allocation. Tables 5-7 show the distribution of allocated responses by source of allocation and support the same conclusions but from a slightly different perspective.
One of the most important changes made to the Hispanic origin question in Census 2000 to address the problem of nonresponse was to shift the order of the Hispanic origin and race questions. In the 1990 census, the race question appeared first and the Hispanic origin question appeared several questions later. It seems clear that after answering the question on race, many people felt that the Hispanic origin question did not apply and simply skipped the question. Shifting the order of the questions in tests conducted before Census 2000 seemed to improve overall response to the Hispanic origin question with some increased nonresponse to the question on race.
Table 1 and Tables 8-13 show very clearly that not only the level of nonresponse was reduced but also that the relative contribution of within household and hot deck allocation was much more balanced for non-Hispanics in Census 2000 than in the 1990 census. More importantly, allocation from surname-assisted hot decks overall was greater than allocation from non-surname-assisted hot decks (Tables 8-10). Table 10, in particular, shows that for non-Hispanics, allocation from surname-assisted hot decks was about three times the level of allocation from non-surname assisted hot decks (2.0 percent compared to 0.6 percent).
The impact of surname-assisted programs is clearly more dramatic when observing the source of allocations in Tables 11-13. Overall, surname-assisted hot decks represented 31.4 percent of all allocations, while non-surname assisted hot decks accounted for only 9.6 percent of all allocations. For Hispanic allocations, surname-assisted hot decks overall represented 8.1 percent of all allocations while non-surname assisted hot decks represented about 4.0 percent. For non-Hispanics, surname assisted hot decks provided 36.9 percent of all allocations, while non-surname assisted hot decks provided only 10.9 percent of all allocations (Table 13). In some states where the proportion of Hispanics is very low (such as Alabama, Georgia, Mississippi, North Carolina, South Carolina, and West Virginia), the proportion of people receiving an origin from a surname-assisted hot deck is five times the proportion receiving an origin from a non-surname assisted hot deck.
It is clear from Tables 2-13 that there was a significant increase in Census 2000 in the level of substitution, from 0.7 percent of the population in households in 1990 to 1.2 percent of the total population in Census 2000 (Tables 2 and 8). Substitution occurs when there are no data for anyone in the housing unit, and we use data from a neighboring household of similar size, using the hot deck method, to allocate characteristics for the people in that housing unit. Given that the same basic method was used in both censuses, there is no reason to believe that the procedure itself created any upward or downward bias in assigning origin in 1990 and 2000.
Tables 9 and 10 show that the percent substituted is slightly higher for the Hispanic population (1.6 percent) than for the non-Hispanic population (1.2 percent). There was a similar pattern in 1990, however, but at a lower level. Tables 3 and 4 show that in 1990 the percent substituted for the Hispanic population (0.9 percent) was again slightly higher than that for the non-Hispanic population (0.6 percent). In addition, it is also clear that substitution played a much larger role in the source of allocation of origin in 2000, with substitution constituting about 20 percent of allocations overall. Interestingly, as shown in Tables 12 and 13, the share of substitution was higher for the non-Hispanic population (21.1 percent) than for the Hispanic population (17.5 percent). By contrast, Tables 6 and 7 show that in1990 the share of substitution in total allocations was much higher for Hispanics (11.0 percent) than for non-Hispanics (5.9 percent). The reasons for the increase in substitution will be part of the Census Bureau's evaluation of Census 2000.
Finally, to put all these results in a broader perspective by including the results from the Census 2000 Supplemental Survey (C2SS), Tables 14-16 show that the trend toward improved response to the origin question is continuing. Editing procedures were basically the same for Census 2000 and the C2SS, except that there was no substitution in the C2SS. Table 14, in particular, shows that allocation rates are lower for the total population and for the Hispanic and non-Hispanic populations in the C2SS than in Census 2000 and in 1990. Table 15 shows an even greater reliance on surname assisted hot decks in the C2SS, with Table 16 showing a much greater reliance on surname assisted hot decks for the non- Hispanic population than for the Hispanic population. It should be noted, however, that the level of response in C2SS was improved through the use of field follow-up procedures for people who did not fully answer the questions on the questionnaire, a procedure that was not used in Census 2000.
- Impact of Editing on Hispanic Origin Population in 1990
In the 1990 census, there was an unusually high level of dependence on hot deck allocation because many of the people needing an imputed origin had no reported origin for anyone in the household. This greater reliance on hot deck allocation, combined with a relatively high level of nonresponse, meant that most allocations came from the hot deck, especially for the non-Hispanic population. For example, 75.6 percent of non-Hispanic allocations came from a hot deck, excluding substitutions. By contrast, only 29.9 percent of Hispanic allocations came from a hot deck (Tables 5 and 6), again excluding substitutions. This reflects the fact that the 1990 census hot decks matched donors and donees by their race, but did not match by age and by whether the donee had a Spanish or non-Spanish surname as did Census 2000 origin hot decks.
Concerns about the impact of 1990 edit and imputation procedures emerged when the results of the sample data processing, including a separate edit and imputation for sample questionnaires, became available. The Hispanic origin question on the sample form was edited in sample processing independent of the 100-percent edit and imputation program. Although the basic structure of the two procedures were the same, the edit and imputation procedures for the Hispanic origin question during sample processing differed in a very important way from those used in 100-percent processing. Unlike the 100-percent procedures, sample procedures made use of the rich source of ethnic-related questions from the sample form (ancestry, place of birth, language spoken at home) that could assist in imputing for nonresponse. The use of ethnic-related information, combined with a higher response rate for the Hispanic origin question on the sample form, meant a much lower dependence on hot deck allocation.
The estimate of the Hispanic origin population that resulted from sample processing was about 454,000 below the total of Hispanics obtained from 100-percent processing with the 100-percent total exceeding the sample estimate for most states. This difference existed despite the fact that sample estimates were controlled to 100-percent totals, including race and Hispanic origin.8
Thompson (1991) addressed this difference and the difference between 100-percent totals and sample estimates for the American Indian population. He noted that the difference for the Hispanic population could be attributed to three factors: 1) weighting procedures; 2) a form of allocation bias; and 3) sample processing. Thompson attributed the difference between 100-percent totals and sample estimates primarily to undersampling of Hispanics and to a form of allocation bias. He also attributed part of the difference to different data processing procedures.9 His analysis, however, did not quantify how much each factor contributed to this difference.
The "allocation bias" to which Thompson's analysis refers is directly related to the focus of this analysis. Thompson noted that the nonresponse for the Hispanic question on the short form was 10 percent while the nonresponse rate for the same question on the sample form was only 4 percent. This difference was due partly to the fact that during data collection all sample forms were subject to content edit follow-up (field follow-up of cases where the number of non-reported items exceeded a certain threshold). By contrast, only 10 percent of short forms were subject to content edit follow-up.
Thompson reasoned that Hispanics were more likely to answer the Hispanic origin question than were non-Hispanics, making the donor pool more heavily Hispanic than it would have been had both Hispanics and non-Hispanics reported. If the nonresponse rate for the Hispanic question was high, there was an increased risk that an Hispanic origin would be disproportionately assigned. Evidence of this comes from Del Pinal (1994) who noted that the 1990 edit and imputation procedures tended to increase the overlap between various racial groups and the Hispanic population. For example, although there were very few Black Mexican origin persons, about 62 percent of Black Mexicans were created by the edit and imputation procedures.10 Not surprisingly, the Black population had a much higher nonresponse rate (18.4 percent) in the Hispanic origin question than did the White population (9.6 percent). (See Table 17.) The corresponding nonresponse rates for American Indians and Alaska Natives and Asians and Pacific Islanders were 10.2 percent and 9.7 percent, respectively. All these rates were still much higher than the nonresponse rates for other 100-percent questions such as race, age, gender and household relationship - all of which had nonresponse rates below 3 percent - and increased the possibility of a misallocation of respondents as Hispanic. To give a sense of the potential impact on the data, a net misallocation of only 0.1 percent of nonresponses as Hispanic out of a total of 24 million needing an origin would result in a net increase of 240,000 Hispanics.
To attempt to quantify at some minimal level the impact of the potential misallocation of responses as Hispanic, we obtained records from the sample edited detailed file (SEDF) for 1990. On these records, we had not only the origin value from sample processing (along with its allocation flag to indicate whether the value was reported or imputed) but also the origin value from 100-percent processing along with its corresponding allocation flag. In particular, we were interested in determining how people who received an allocated origin in the 100-percent edit had their origin allocated in the sample edit. For the purposes of this analysis, the results of the sample edit are considered the standard for accuracy because sample editing procedures made use of data from additional ethnic-related questions (ancestry, place of birth, and language spoken at home) not available on the short form.
Table 18 shows that, overall, the 100-percent edit produced a net of about 181,000 more Hispanics than did the sample edit when origin was allocated both in 100-percent and sample editing procedures. This net difference in edit outcomes represented only 2.1 percent of the 8.6 million people for whom origin was allocated in both 100-percent and sample processing.
If we take into consideration also the situations in which we imputed a value in the 100-percent procedures but did not impute a value in the sample procedures, the 100-percent edit produced a net overall of about 161,000 more Hispanics than did the sample procedures.11 Assuming that the sample edit and imputation process is more accurate, the 100-percent edit appears to have imputed as Hispanic a net total of 161,000 people who were probably not Hispanic. However, this number represents only 1.8 percent of all people whose origin was imputed. It is also important to keep in mind that both edit procedures agreed on the edit outcome 96 percent of the time.
It is clear from this table that the impact of this potential misallocation is different by race. The apparent degree of over-editing of Hispanics (as measured by taking the ratio of "Hispanic-100%; Not Hispanic - Sample" to "Not Hispanic - 100%; Hispanic - Sample") appeared to be much greater for Blacks (10.0) and Asian and Pacific Islanders (13.1) than for Whites (4.4). Analysis of the unweighted data shows the same pattern, but slightly lower ratios for each group. This finding is consistent with Del Pinal's finding that certain race/Hispanic combinations were more significantly affected by the editing procedures.
It is important to keep in mind that the estimate of 161,000 is probably a lower bound because these data were obtained from sample forms that had a lower nonresponse rate and had much more ethnic-related information than did short form questionnaires. It is possible that the level of misallocation would be higher among the population that received only the short form, which experienced a higher nonresponse rate for origin than did the sample form. However, it is unlikely that the upper bound would be as high as the difference between the 100-percent and sample totals (454,000) because: 1) sample processing changed about 262,000 responses from "Other Spanish/Hispanic" to not Hispanic12 and 2) to an unknown degree there was undersampling of Hispanics for which the sample weighting procedures did not compensate.
It is also very important to keep in mind that the impact on the overall total Hispanic population was very small. Overall, this net difference (161,000) represented only 0.7 percent of the total Hispanic population.
- Impact of Edit and Imputation Procedures on Hispanic Origin Population in Census 2000
There are no comparable data available at this time from Census 2000 to perform the same type of analysis that was conducted on the 1990 census edit and imputation procedures. However, it is very clear that the Census 2000 procedures operated in an environment that was profoundly different from that in which the 1990 procedures operated. Significantly reduced nonresponse to the question, combined with more restrictions on the conditions under which origin could be assigned to an individual, probably has led to much lower level of erroneous imputations as Hispanic (or non-Hispanic).13 At the same time, innovations, such as the surname-assisted hot deck, has improved the accuracy and, therefore, the quality of data from the Hispanic origin question.