Demographic analysis is a method of population estimation that does not rely on a census or surveys, but rather on birth and death records and estimates of migration (see ref. 1 for a more complete discussion of this topic). Demographic analysis population estimates are constructed by age group for a specified geographic area, usually a nation. Constructing a demographic analysis estimate of, for example, persons 50 years of age in the United States would involve tabulating all births recorded in the United States occurring at least 50 but less than 51 years ago, subtracting all deaths recorded to this cohort and all moves abroad by members of this cohort, and adding all moves from abroad into this cohort. Other data may be used to supplement the vital statistics where they prove to be inadequate, for example, the Census Bureau uses administrative data on Medicare enrollment in its estimates of the population 65 and older. Demographic analysis has been used by the Census Bureau to assess the completeness of coverage of the census following every census since 1960. Because a demographic analysis estimate of the full population requires the use of data covering a long period of time, it is limited in the amount of racial and ethnic detail that can be included owing to the difficulty of finding consistent classification schemes that have been in place over the whole period in question. By focusing on the population of age 0 at the time of the 1990 census, we may construct a demographic analysis estimate entirely from recent birth and death data which contain race and ethnic detail comparable to that found in the 1990 census.
In keeping with OMB Directive 15, all federal agencies have been moving toward a four-race classification system (White, Black, American Indian and Alaska Native, and Asian and Pacific Islander) and an ethnicity classification system that would permit classification of all individuals as either Hispanic or non-Hispanic. The OMB system is designed to promote standardization in federal record-keeping and reporting, and does not permit the use of categories which cannot be aggregated into the specified categories. What is of particular importance for this paper is that the OMB system does not permit the use of categories such as "other" or "unknown". Both the Census Bureau and NCHS deal with missing or otherwise unusable race data by reassigning the value to one randomly selected from observations with valid values (see ref. 1 and 5 for descriptions of the selection processes). However, the NCHS data used here contains the original values, which permits the comparison of a variety of approaches to the missing value problem in order to see how the compatibility of the two data sources may be maximized.
Since 1989, NCHS has had a new birth registration system in effect, which includes detailed racial and ethnic information about both parents. The Census Bureau has received individual record data on all births and deaths recorded in the US since 1989 from NCHS, which includes detailed information on race and ethnicity. The birth data contains race and ethnicity information for both parents, and NCHS has used the parents' race information to impute the race of the child (see ref. 2 for description of imputation procedure). I have used this data to construct demographic analysis estimates of the population less than one year of age at the time of the 1990 Census by race and by Hispanic origin. This paper compares those estimates to the corresponding Census counts.
The Census Bureau data used here is the Modified Age Race Sex (MARS) file, which contains data from the 1990 census modified to correct age and race mis-reporting. The modification methodology is described in Census Report CPH-L-74. There are two aspects of this modification which are important to note here. The 1990 census results included about 10 million persons for whom the race response was not one of the categories listed on the census form, and who thus had to be recoded into one of the four OMB categories in accordance with Directive 15. The intent of the age question on the census was to obtain age in completed years on April 1, 1990, but many respondents gave age at the time they answered the question, and tended to round to the nearest year. This produced a substantial undercount of the population of age 0 on April 1, 1990, which had to be modified in the MARS file.
Table 1 presents a list of the race categories present in the NCHS data and shows how they correspond to the four-race system just mentioned. Table 2 presents the corresponding comparison for ethnicity. Because the intent of this paper is to examine the underlying consistency of the two data sources, a variety of methods of recoding the missing values will be presented to see how the confounding effect of differing recoding schemes may be minimized. Separate comparisons of data from the mother and father, and, in the case of race, for the child, will be presented to determine which gives the closest agreement with the Census Bureau data.
The demographic analysis estimates used in this paper were constructed by tabulating the births occurring between April 1, 1989 and March 31, 1990 and subtracting the deaths to that cohort occurring in the same period, by race and ethnicity. Estimates of net international migration to this cohort were prepared using the methodology described the technical documentation accompanying PE-29 (ref. 6). Since the goal here is to compare NCHS and Census Bureau data, these estimates were subtracted from the corresponding MARS figures to produce an estimate of the native-born population comparable to the demographic analysis estimate constructed from NCHS data. No attempt was made to allow for domestic migration, owing to the lack of a reliable method for estimating the state-to-state migration by race and ethnicity for the age 0 population. This omission will not affect the national-level estimates, and its effect is likely to be small for most states over a one-year period. Consequently, the detailed tables present results at the state as well as the national level to illustrate how the effects differ from state to state, though it should be kept in mind that the state-level results are merely illustrative.
Table 3 presents NCHS- and Census-based population estimates and the percentage by which the Census-based estimate is less than the NCHS-based one (a negative sign indicates that the Census-based estimate is actually larger than the NCHS-based estimate). Demographic analysis of the 1990 census revealed a net undercount of slightly less than 2% for the total population. Since under-registration of births is known to be about 1/2%, we would expect the Census total in Table 3 to be about 1«% below that of the NCHS total if the net undercount for age 0 is the same as for the population as a whole. The fact that the Census total is actually about 2«% below NCHS is probably the result of the particular problem the 1990 census encountered with the age 0 population that was discussed previously, and still constitutes reasonable conformity with our expectations. It is interesting to note that there are 18 states for which the census-based estimate is higher than the demographic analysis estimate. This may be the result of the failure to account for inter-state migration or reflect a net overcount in the Census for these states, but it may also be at least partially attributable to the mis-assignment of state of residence by NCHS in cases where women give birth outside of their state of residence. The large discrepancy between the demographic analysis and census-based estimates for the District of Columbia is probably only partially attributable to census undercount in DC, and is likely to be in large part the result of geo-coding errors by NCHS, since it is known that a substantial number of Virginia and Maryland residents give birth in DC hospitals.
Because demographic analysis is generally considered to be the preferable method of population measurement, it is considered to be the standard by which the census is judged when the two are compared with respect to the level of population. With regard to race and ethnicity distributions, however, the following analysis will show that the error associable with undercount is relatively small when compared to the differences which can be introduced by differing classification systems. Because of the difficulty and the empirical impact of issues related to racial and ethnic classification, the focus of this paper cannot be on which source is correct, but rather it must be the degree to which NCHS and the Census Bureau are consistent with one another. During the research that went into this paper it became clear that the issues which most affect this consistency are mixed parentage and treatment of missing values. Consequently, the following analysis compares the race and ethnicity distributions in the MARS data to those in demographic analysis populations constructed under a variety of assumptions regarding mixed parentage and treatment of missing values.
To measure the degree of comparability between the two sets of estimates, I use the mean absolute percent error (MAPE), which in this circumstance involves computing 100*abs(MARS-DA)/MARS for a given racial or ethnic group [where abs() refers to the absolute value function, MARS denotes the proportion of the MARS population in the group in question, and DA denotes the corresponding proportion from the demographic analysis population], and taking the mean of this quantity over all the groups in the comparison. This calculation effectively treats the MARS distribution as the standard in this comparison, because the demographic analysis distributions are being altered experimentally to investigate the circumstances under which they will and will not resemble the MARS distributions.