Census Bureau

Race and Ethnicity Classification Consistency Between the Census Bureau and the National Center for Health Statistics

Larry Sink

Administrative Records and Methodology Research Branch
Population Division
U.S. Bureau of the Census
Washington, D.C. 20233

February 1997


Line Divider


The method of demographic analysis is applied to individual birth and death records obtained from the National Center for Health Statistics (NCHS) to produce a series of estimates of the population of age 0 at the time of the 1990 Census. These estimates differ in the way that race or Hispanic origin is assigned, and they are compared to the corresponding 1990 Census figures to determine the degree of consistency between the race and ethnicity classifications used by the two agencies and the effect on this consistency of changing the rules by which race and Hispanic origin are assigned. The principal findings are that assigning births the race and Hispanic origin of the mother produces the greatest consistency with Census results and that under this rule the agreement between Census and NCHS on Hispanic origin is good and the agreement on race is good except for a problem with American Indians.

Race and Ethnicity Classification Consistency between the Census Bureau and the National Center for Health Statistics 1

by Larry Sink


The National Center for Health Statistics (NCHS) is a prominent supplier of vital statistics and the Bureau of the Census is a prominent supplier of population statistics. It is common practice to construct vital rates using a numerator obtained from NCHS and a denominator obtained from the Census Bureau. Since both sources offer these statistics broken down by racial and ethnic categories that appear to be consistent with one another, it is also common practice to use this same procedure to construct vital statistics broken down into these same racial and ethnic categories. However, the Census Bureau relies on self-identification in assigning racial and ethnic categories whereas in NCHS data these categories may be assigned by an observer; and it is not clear that these two approaches would necessarily produce consistent results. This paper uses the method of demographic analysis to construct estimates of the population less than one year of age at the time of the 1990 census from NCHS vital statistics, and compares these estimates to the corresponding estimate from the 1990 census to determine the degree of comparability between the racial and ethnic categories used by the two agencies. It should be noted that the results presented here only pertain to the race and ethnicity classification systems currently in use by Census and NCHS; this is important because proposed changes to Office of Management and Budget (OMB) Directive 15 could require both agencies to change their classification systems.

Data and Methods

Demographic analysis is a method of population estimation that does not rely on a census or surveys, but rather on birth and death records and estimates of migration (see ref. 1 for a more complete discussion of this topic). Demographic analysis population estimates are constructed by age group for a specified geographic area, usually a nation. Constructing a demographic analysis estimate of, for example, persons 50 years of age in the United States would involve tabulating all births recorded in the United States occurring at least 50 but less than 51 years ago, subtracting all deaths recorded to this cohort and all moves abroad by members of this cohort, and adding all moves from abroad into this cohort. Other data may be used to supplement the vital statistics where they prove to be inadequate, for example, the Census Bureau uses administrative data on Medicare enrollment in its estimates of the population 65 and older. Demographic analysis has been used by the Census Bureau to assess the completeness of coverage of the census following every census since 1960. Because a demographic analysis estimate of the full population requires the use of data covering a long period of time, it is limited in the amount of racial and ethnic detail that can be included owing to the difficulty of finding consistent classification schemes that have been in place over the whole period in question. By focusing on the population of age 0 at the time of the 1990 census, we may construct a demographic analysis estimate entirely from recent birth and death data which contain race and ethnic detail comparable to that found in the 1990 census.

In keeping with OMB Directive 15, all federal agencies have been moving toward a four-race classification system (White, Black, American Indian and Alaska Native, and Asian and Pacific Islander) and an ethnicity classification system that would permit classification of all individuals as either Hispanic or non-Hispanic. The OMB system is designed to promote standardization in federal record-keeping and reporting, and does not permit the use of categories which cannot be aggregated into the specified categories. What is of particular importance for this paper is that the OMB system does not permit the use of categories such as "other" or "unknown". Both the Census Bureau and NCHS deal with missing or otherwise unusable race data by reassigning the value to one randomly selected from observations with valid values (see ref. 1 and 5 for descriptions of the selection processes). However, the NCHS data used here contains the original values, which permits the comparison of a variety of approaches to the missing value problem in order to see how the compatibility of the two data sources may be maximized.

Since 1989, NCHS has had a new birth registration system in effect, which includes detailed racial and ethnic information about both parents. The Census Bureau has received individual record data on all births and deaths recorded in the US since 1989 from NCHS, which includes detailed information on race and ethnicity. The birth data contains race and ethnicity information for both parents, and NCHS has used the parents' race information to impute the race of the child (see ref. 2 for description of imputation procedure). I have used this data to construct demographic analysis estimates of the population less than one year of age at the time of the 1990 Census by race and by Hispanic origin. This paper compares those estimates to the corresponding Census counts.

The Census Bureau data used here is the Modified Age Race Sex (MARS) file, which contains data from the 1990 census modified to correct age and race mis-reporting. The modification methodology is described in Census Report CPH-L-74. There are two aspects of this modification which are important to note here. The 1990 census results included about 10 million persons for whom the race response was not one of the categories listed on the census form, and who thus had to be recoded into one of the four OMB categories in accordance with Directive 15. The intent of the age question on the census was to obtain age in completed years on April 1, 1990, but many respondents gave age at the time they answered the question, and tended to round to the nearest year. This produced a substantial undercount of the population of age 0 on April 1, 1990, which had to be modified in the MARS file.

Table 1 presents a list of the race categories present in the NCHS data and shows how they correspond to the four-race system just mentioned. Table 2 presents the corresponding comparison for ethnicity. Because the intent of this paper is to examine the underlying consistency of the two data sources, a variety of methods of recoding the missing values will be presented to see how the confounding effect of differing recoding schemes may be minimized. Separate comparisons of data from the mother and father, and, in the case of race, for the child, will be presented to determine which gives the closest agreement with the Census Bureau data.

The demographic analysis estimates used in this paper were constructed by tabulating the births occurring between April 1, 1989 and March 31, 1990 and subtracting the deaths to that cohort occurring in the same period, by race and ethnicity. Estimates of net international migration to this cohort were prepared using the methodology described the technical documentation accompanying PE-29 (ref. 6). Since the goal here is to compare NCHS and Census Bureau data, these estimates were subtracted from the corresponding MARS figures to produce an estimate of the native-born population comparable to the demographic analysis estimate constructed from NCHS data. No attempt was made to allow for domestic migration, owing to the lack of a reliable method for estimating the state-to-state migration by race and ethnicity for the age 0 population. This omission will not affect the national-level estimates, and its effect is likely to be small for most states over a one-year period. Consequently, the detailed tables present results at the state as well as the national level to illustrate how the effects differ from state to state, though it should be kept in mind that the state-level results are merely illustrative.

Table 3 presents NCHS- and Census-based population estimates and the percentage by which the Census-based estimate is less than the NCHS-based one (a negative sign indicates that the Census-based estimate is actually larger than the NCHS-based estimate). Demographic analysis of the 1990 census revealed a net undercount of slightly less than 2% for the total population. Since under-registration of births is known to be about 1/2%, we would expect the Census total in Table 3 to be about 1«% below that of the NCHS total if the net undercount for age 0 is the same as for the population as a whole. The fact that the Census total is actually about 2«% below NCHS is probably the result of the particular problem the 1990 census encountered with the age 0 population that was discussed previously, and still constitutes reasonable conformity with our expectations. It is interesting to note that there are 18 states for which the census-based estimate is higher than the demographic analysis estimate. This may be the result of the failure to account for inter-state migration or reflect a net overcount in the Census for these states, but it may also be at least partially attributable to the mis-assignment of state of residence by NCHS in cases where women give birth outside of their state of residence. The large discrepancy between the demographic analysis and census-based estimates for the District of Columbia is probably only partially attributable to census undercount in DC, and is likely to be in large part the result of geo-coding errors by NCHS, since it is known that a substantial number of Virginia and Maryland residents give birth in DC hospitals.

Because demographic analysis is generally considered to be the preferable method of population measurement, it is considered to be the standard by which the census is judged when the two are compared with respect to the level of population. With regard to race and ethnicity distributions, however, the following analysis will show that the error associable with undercount is relatively small when compared to the differences which can be introduced by differing classification systems. Because of the difficulty and the empirical impact of issues related to racial and ethnic classification, the focus of this paper cannot be on which source is correct, but rather it must be the degree to which NCHS and the Census Bureau are consistent with one another. During the research that went into this paper it became clear that the issues which most affect this consistency are mixed parentage and treatment of missing values. Consequently, the following analysis compares the race and ethnicity distributions in the MARS data to those in demographic analysis populations constructed under a variety of assumptions regarding mixed parentage and treatment of missing values.

To measure the degree of comparability between the two sets of estimates, I use the mean absolute percent error (MAPE), which in this circumstance involves computing 100*abs(MARS-DA)/MARS for a given racial or ethnic group [where abs() refers to the absolute value function, MARS denotes the proportion of the MARS population in the group in question, and DA denotes the corresponding proportion from the demographic analysis population], and taking the mean of this quantity over all the groups in the comparison. This calculation effectively treats the MARS distribution as the standard in this comparison, because the demographic analysis distributions are being altered experimentally to investigate the circumstances under which they will and will not resemble the MARS distributions.


Tables 4-21 pertain to the comparability of NCHS and Census Bureau race classifications, and these results are summarized in Summary Table A. These tables compare the race distributions found in the NCHS and Census Bureau data just discussed, both at the national and state levels, and present MAPEs for these comparisons. The same admonitions mentioned earlier regarding the state-level estimates should be applied here. Additionally, the MAPE tends to be unreliable if one of the groups in the comparison contains a small number of observations. None the less, the state-level MAPEs serve to show where the discrepancies between the two sources are the most pronounced, and indicate the states where particular problems arise from the application of certain schemes for dealing with unknown values. In Tables 4-6, the missings are deleted before computing the proportions, which has the same effect as recoding the missings to the four race groups in the same proportions as the non-missing values. In Tables 7-9, the missings are recoded to White. In Tables 10-12, 13-15, and 16-18, the missings are recoded to Black, American Indian, and Asian, respectively. In Tables 19-21 the missings are recoded to the most recently observed valid value in the data, which is basically the same method used by NCHS and should produce results comparable to those that would be obtained by using the data NCHS releases to the public. Tables 4, 7, 10, 13, 16, and 19 use NCHS race of child, Tables 5, 8, 11, 14, 17, and 20 use race of mother, while Tables 6, 9, 12, 15, 18, and 21 use race of father. It is immediately clear that race of father performs poorly. Race of father has more unknowns than either race of mother or race of child, and this is probably the reason for its poor performance. Race of mother with unknowns recoded to American Indian produced the lowest national-level MAPE of any of these approaches. Race of mother substantially underestimates the American Indian population regardless of what is done with the unknowns, and consistently provided the best estimates of the other three race groups.

Summary Table A

National-level MAPEs from NCHS vs. Census Race Comparison

Treatment of
missing values
Race of child Race of mother Race of Father
deleted 5.15 5.54 17.76
recoded to
4.98 5.54 26.89
recoded to
5.15 5.68 31.73
recoded to
6.71 4.07 313.0
recoded to
5.69 6.09 133.6
5.07 5.59 14.67

To further investigate this problem with American Indians, the Table 14 calculations were rerun the same as before except that race of father was used when the father was American Indian. The results are presented in Table 22, which shows that this procedure reduced the average difference from 4.1% to 2.8%. While the comparisons reported in Tables 4-22 apply the same procedure to all missing values, the results presented in Table 23 are based on a procedure which recodes missing values on a state-by-state basis. In this procedure, each state's missing values are recoded to the race whose proportion in the demographic analysis population is furthest below the corresponding proportion in the MARS data. This approach was developed to determine if it would improve upon the results reported in Table 22, and it does, lowering the national-level MAPE to 1.9%. Table 24 displays the number of missing values by state (for race of mother) and the race to which they were recoded under this scheme. Though this approach produces the best overall result, it does particularly badly in the District of Columbia. It is interesting to note that the approach which works the best in the District of Columbia is to use race of father and recode missing values to Black, which is one of the worst approaches for the nation as a whole.

Since NCHS uses race of mother in the tabulations it presents in its official reports and randomly recodes missings as previously discussed, the results presented in Table 20 give a good idea of the level of agreement between the Census Bureau's racial categories and those used in NCHS's official publications. The overall MAPE of 5.6% indicates reasonably good agreement, but closer inspection reveals that almost all the discrepancy comes from the estimates of the American Indian population, where the national-level NCHS-based estimate is some 20% below that obtained from Census Bureau data. Consequently, it would be more accurate to say that Census Bureau and NCHS racial classifications show very close agreement except with regard to American Indians, where a problem clearly exists. This problem probably has to do with the fact that significant privileges and advantages can result from membership in certain American Indian tribes, thus giving parents an incentive to designate their children as American Indian if either parent is American Indian.

Tables 25-34 deal with the comparability of the NCHS and Census Bureau Hispanic/non-Hispanic categories, and these results are summarized in Summary Table B. The comparisons are analogous to those just presented for race. They present all possible combinations of the choices between mother's and father's ethnicity and the choices among omitting unknowns, recoding them to non-Hispanic or Hispanic, and assigning them the last valid value in the data. Additionally, Tables 33 and 34 treat unknowns by recoding them to the value of the other parent. At the national level the best choice appears to be using the mother's ethnicity and omitting unknowns, which produces a disagreement between the NCHS and Census Bureau classifications of less than 1%, though very similar results were obtained using mother's ethnicity with unknowns recoded to father's ethnicity. As can be seen from the tables, however, there is considerable variation in the performance of the various approaches at the state level. Thus, while mother's ethnicity with omitted unknowns fares best at the national level, at the state level this approach produces the best results only for California, with the other states obtaining better results from one of the other approaches. Using ethnicity of father with unknowns recoded to Hispanic is consistently the worst approach.

Summary Table B

National-level MAPEs from NCHS vs. Census Ethnicity Comparison

Treatment of
missing values
Ethnicity of Mother Ethnicity of Father
deleted 0.98 3.07
recoded to Hispanic 15.20 66.70
recoded to non-
1.35 7.77
recoded randomly 4.19 1.92
recoded to value of
other parent
1.00 1.92

In its published data relating to Hispanic origin, NCHS uses ethnicity of mother and recodes missings to non-Hispanic. Consequently, the results presented in Table 29 should show the level of agreement between the Census Bureau's Hispanic/non-Hispanic categories and those used in NCHS's published reports. As can be seen, the agreement is quite good, with a national-level MAPE of only 1.3% and no serious problems in any of the individual states.

Conclusions and Suggestions for Further Research

The overall conclusion to be drawn from this work is that the racial and ethnic classification systems used by the Census Bureau and NCHS show a high degree of agreement with each other, except for a problem with American Indians. A secondary conclusion is that ascribing the racial and ethnic characteristics of the mother to the child seems to afford consistency with Census Bureau classifications except when the father is an American Indian. The cause of the American Indian problem seems to be twofold. First, because they are such a small proportion of the population, American Indians are more likely than other racial groups to marry a person of another race. In the NCHS data used in this analysis, there are more births where one parent was American Indian and one was not than there are where both parents were American Indian. Second, in mixed race couples where one partner is American Indian, there is an incentive to identify the children as American Indian because of the advantages that can result from tribal membership. Consequently, any racial classification scheme that automatically assigns children the race of one parent or the other would report substantially fewer American Indian children than would be reported in a system permitting self-identification. Further, this tendency of children of mixed-race marriages to identify racially with their American Indian parent means that there are self-identified American Indians who are biologically only a small portion American Indian and who would thus be unlikely to be identified as American Indian by an observer. This suggests that this inconsistency with respect to American Indians is likely to be found in mortality data as well as fertility data. As a result, those who are interested in the fertility and mortality of American Indians need to take great care in combining Census and NCHS data and in using data that draws from both sources (e.g. NCHS's published vital rates). Those who are not concerned with this problem or with race-ethnicity cross-classification may regard the current NCHS and Census Bureau race and Hispanic origin classification systems as completely compatible. It is important to keep in mind that these results only pertain to the present systems. If these systems were to be changed, which currently seems likely, this analysis would have to be repeated using the new systems.

It should be stressed that consistency between the two agencies with respect to race classification and Hispanic origin classification does not imply that they are consistent with respect to race/Hispanic origin cross-classification, and research done at the Census Bureau indicates that they are not. My next work in this area will be to extend the analysis of this paper to the issue of the apparent inconsistency in race-ethnicity cross-classification between the Census Bureau and NCHS. Research is already in progress at the Census Bureau on a reliable method of estimating U.S. internal migration of the age 0 population with race and ethnic detail. Such a method would allow us to extend the analysis of this paper to smaller levels of geography, which could be very helpful, given that the illustrative state-level results presented here indicate that classification inconsistency and the reasons behind it may vary considerably from region to region. The issue which is likely to be of the greatest interest, however, is which classification system yields the best estimates. This issue is fraught with difficulties, not the least of which is defining what is meant by "best estimate", and the work presented here is only a small step towards dealing with it. In addition to further work along the lines of inquiry begun here, more work is also needed on how we define race on a scientific level, on how we identify ourselves racially on a personal level, and on the nature and extent of the differences between these two definitions.


  1. National Center for Health Statistics: Vital Statistics of the United States, 1988, Vol. I, Natality. Public Health Service, Washington, DC. U.S. Government Printing Office, 1990.

  2. Instruction Manual Part 3a, Classification and Coding Instructions for Live Birth Records, 1993. Public Health Service, Washington, DC.

  3. Instruction Manual Part 4, Demographic Classification and Coding Instructions for Death Records, 1993. Public Health Service, Washington, DC.

  4. Robinson, J. Gregory, Bashir Ahmed, Prithwis Das Gupta, and Karen A. Woodrow "Estimation of Population Coverage in the 1990 United States Census Based on Demographic Analysis", Journal of the American Statistical Association, Sept. 93, vol. 88 no. 423, 1061-1071.

  5. U.S. Bureau of the Census "Age, Race, and Hispanic Origin Information from the 1990 Census: A Comparison of Census Results Where Age and Race have been Modified". 1990 CPH-L-74.

  6. "Estimates of the Population of States by Age, Sex, Race, and Hispanic Origin: 1990 to 1992". PE-29 (technical documentation).

1 This paper was originally prepared for presentation at the Population Association of America 1996 Annual Meeting.

Line Divider

Source: U.S. Census Bureau, Population Division,
Administrative Records & Methodology Research Branch

Author: Larry Sink
Last Revised: October 31, 2011 at 10:03:08 PM