The U.S. Census Bureau will provide two sets of Public Use
Microdata Sample (PUMS) files: a 1 percent national
characteristics file and 5 percent state files. These files will
provide the greatest possible detail, while protecting the
confidential nature of the data. For Puerto Rico, 1 percent
and 5 percent files also will be created.1
Because of the rapid advances in computer technology and the
increased accessibility of census data to the user community,
the Census Bureau has had to adopt more stringent measures to
protect the confidentiality of public use microdata through
disclosure-limitation techniques. At the same time, the Census
Bureau recognizes the needs of data users for greater
characteristic detail and greater geographic specificity. Hence,
two sets of files will be produced: one that provides a fuller
range of detailed characteristics (the 1 percent national
characteristics file) and one that provides greater geographic
detail but less characteristic detail (the 5 percent state files).
This paper describes the confidentiality protection, the content
of the two file types, and the approximate release dates.
Confidentiality will be protected by the use of the following
processes: data-swapping, top-coding of selected variables,
geographic population thresholds, age perturbation for large
households, and reduced detail on some categorical variables.
Data swapping is a method of disclosure limitation
designed to protect confidentiality in tables of frequency data
(the number or percent of the population with certain
characteristics). Data swapping is done by editing the source
data or exchanging records for a sample of cases. Swapping is
applied to individual records and, therefore, also protects
Top-coding is a method of disclosure limitation in
which all cases in or above a certain percentage of the
distribution are placed into a single category.
Geographic population thresholds prohibit the disclosure
of data for individuals or households for geographic units with
population counts below a specified level (see descriptions of
Public Use Microdata Areas (PUMAs) and super-PUMAs in Section III).
Age perturbation, that is, modifying the age of
household members, will be required for large households
(households containing ten people or more) due to concerns
Detail for categorical variables will be collapsed if
the categories do not meet a specified national minimum
- FILE TYPES
- National Characteristics 1 Percent
The national characteristics file will provide the maximum
amount of social, economic, and housing information available.
The goal of this file is to provide as close as possible the
amount of detail that was in the 1990 PUMS files (and, in some
cases, more detail). No national minimum population threshold
for the identification of variable categories is planned, with
the exceptions of race and Hispanic origin. Limits on certain
variables, deemed necessary to protect confidentiality, are
covered in Section C below.
To maintain the level of detail described above, however,
the minimum geographic population threshold must be raised
above 100,000 (the PUMA minimum). A new geographical entity
is being created--the super-PUMA. Super-PUMAs have a minimum
population of 400,000 and are composed of a PUMA or PUMAs
delineated on the companion state-level PUMA file.2
Each state will be identified, and any state with a population
of 800,000 or greater can be subdivided into two or more
- State-Level 5 Percent PUMS Files
State-level 5 percent PUMS files will provide information
for PUMAs that will represent many metropolitan areas,
cities, and more populous counties, as well as groups of
less populous counties. In order to protect confidentiality,
characteristic information for these smaller areas will be
less detailed than in the national 1 percent file.
- Population Thresholds for PUMAs
Each geographic unit in the 5 percent files--PUMAs--must
meet a minimum population threshold of 100,000. The
minimum PUMA threshold will be held at 100,000 people
by increasing the degree of variable collapsing to an
appropriate level to maintain confidentiality. There
are two main arguments favoring this approach.
First, from a user's standpoint, raising the minimum
population threshold for PUMAs above 100,000 would
greatly restrict a wide variety of local-level geographic
analyses, such as studies of nonmetropolitan, metropolitan,
and intrametropolitan areas, conducted by public agencies,
academic researchers, and others in the private sector.
Second, the 100,000 minimum population threshold--the
threshold set for both the 1980 and 1990 PUMS
files--permits historical comparability. Users interested in
time-series analysis were clearly displeased at the
possibility of an increase in the threshold for Census
2000. Those users noted the difficulty in comparing
the results from different decades if the PUMA threshold
was raised. Additionally, the Census Bureau's use of
250,000 as the minimum threshold for PUMAs in 1970 was
criticized by users--an important reason for the decision
to lower the minimum threshold to 100,000 people for the
1980 PUMS files and to maintain it in the 1990 PUMS files.
- Minimum Population Threshold for Categorical Variables
To maintain confidentiality, while retaining as much
characteristic detail as possible, a minimum threshold
of 10,000 in the national population will be set for
the identification of groups within categorical variables
in the state-level PUMS files. At the PUMS Users
Conference held in Alexandria, Virginia, on May 22,
2000, some users suggested a minimum population threshold
of 25,000 in response to concerns about confidentiality.
The Census Bureau subsequently determined that a minimum
threshold of 10,000 would maintain the confidentiality
of responses, while providing greater detail to the user.
The state-level files will require significant
post-processing. Instead of identifying variable
categories based upon pretabulation assumptions about
the composition of the population, the approach develops
variable collapsing requirements after the microdata
samples have been drawn. Each variable will be analyzed,
and only those values that do not meet the 10,000 minimum
national population threshold will be collapsed into more
Post-processing will improve the PUMS products by
offering a more precise means of ensuring confidentiality.
However, this procedure will increase the processing
and analytic work load and delay the release of the 5
percent PUMS products to the public by approximately
- Additional Specifications for
the PUMS Files
Additional PUMS file specifications are included for the
following variables in the national characteristics and
- Dollar Amounts
Dollar amounts will be rounded before all summations,
ratio calculations, or presentations of amounts. The
dollar amounts will be represented, including negative
amounts, as follows:
||round to the nearest $10
||round to the nearest $100
|$50,000 or more
||round to the nearest $1,000
This rule will be applied to income types, utility
costs, mortgage costs, rent, condominium fees, hazard
insurance costs, and mobile home fees.
Implementing income top-coding: An individual's
income will be rounded on a graduated scale and
independently top-coded by variable type. The value
inserted for observations at and above the top-code
will be the state mean of all cases at and above the
top-code minimum value. Incomes will then be summed
across household members to obtain household totals,
without any additional top-coding. The bottom-coding
for all income types that can have negative dollar
values will be set at a maximum negative value of $10,000.
Housing-related dollar amount variables:
Property taxes will be categorized in a similar way to
1990, with the exception of the higher tax categories.
The categories shown below will be used for the 1
percent file. The categories for the 5 percent file
may have to be collapsed in order to protect
- Property tax ranges:
- Not applicable
- $50 increments from $1 to $999
- $100 increments from $1,000 to $4,999
- $500 increments from $5,000 to $5,999
- $1,000 increments from $6,000 to $9,999
- $10,000 or more3
All other housing-related dollar amounts will be treated
similarly to income (see above). That is, the variables
will use the same rounding scale as for income, and each
case will receive the state mean of top-coded cases for
each respective variable. For the items that are
aggregated to create selected monthly owner costs
(SMOC) and gross rent, each item will be rounded
independently and top-coded before summing to the SMOC
or gross rent total. No further rounding will be
performed on the aggregated amount.
- Race and Hispanic Origin Data
Data on race will include "yes/no" variables for the
five Office of Management and Budget (OMB) races4
and Some other race on both the 1 percent and the 5
percent files. This will allow data users to construct
the 63 possible race combinations shown on the
redistricting data file.
In addition, in both the 1 percent and the 5 percent
files, we will show all combinations of the 15 race
categories shown on the census questionnaire, specific
American Indian and Alaska Native tribes alone, and
detailed Asian and Native Hawaiian and Other Pacific
Islander groups alone that meet the relevant thresholds.5
In the 1 percent file, we are planning a national minimum
population threshold of 8,000 for the identification of
categories in the race and Hispanic origin variables;
in the 5 percent files, we are planning a national
minimum population threshold of 10,000 for the
identification of categories in these variables. For
example, the racial category "Black or African American
and Filipino" will be shown on both files because there
are more than 10,000 people in the United States who
reported this combination on Census 2000.
- Age Detail
For both the state-level and national characteristics
files, single-year age categories will be provided
through age 89. There is one nationwide top-code (age
90) and each state receives the mean age of individuals
in the state 90 years and over.
- Ancestry Variables
The Census Bureau codes up to two responses for the
ancestry question. For the state-level files, if the
combined total national population from both of these
responses for an ancestry group is 10,000 or greater,
that group will be identified by itself in both the
first response and second response variables, even if
the total for the category in either or both of the
individual ancestry variables does not meet the 10,000
- Industry and Occupation
Two sets of codes for each occupation and industry will
be provided: (1) the census code and the Standard
Occupational Classification (SOC)-based code for
occupation and (2) the census code and the North
American Industry Classification System (NAICS)-based
code for industry.
- Continuous Variables
Continuous variables are treated the same on both files.
Additional specifications for departure time (when a
person usually left for work in the week before their
census form was filled out) and year of entry into the
U.S. are described below.
Departure time will be categorized as follows:
12 midnight - 2:59 a.m. in 30-minute increments
3 a.m. - 4:59 a.m. in 10-minute increments
5 a.m. - 10:59 a.m. in 5-minute increments
11 a.m. - 11:59 p.m. in 10-minute increments
Year of entry into the country will have a
bottom-code of 1910.
- TIMETABLES FOR PUMS FILES
The 1 percent national characteristics file will be the first
file released to the public. It is planned for release in 2002.
The 5 percent state-level files, requiring more time for
post-processing, will be released to the public in 2003.
||Paul J. Mackun
U.S. Census Bureau
1 For two Island Areas, Guam, and the U.S. Virgin
Islands, 10 percent PUMS files will be created.
2 The super-PUMAs will be identified in the 5 percent
files, as well.
3 Each state receives the mean value of all cases
in that state at and above the national top-code value.
4 The five OMB races include White, Black or African
American, American Indian and Alaska Native, Asian, and Native
Hawaiian and Other Pacific Islander.
5 The following 15 race categories appear on the
form: White, Black or African American, American Indian or
Alaska Native, Asian Indian, Chinese, Filipino, Japanese,
Korean, Vietnamese, Other Asian, Native Hawaiian, Guamanian
or Chamorro, Samoan, Other Pacific Islander, and Some other race.
6 For example, if there are 9,638 individuals who
identify themselves as Alsatian in the "ancestry, first response"
variable and 6,782 individuals who identify themselves as
Alsatian in the "ancestry, second response" variable, Alsatian
will appear as a separate category in both variables--first
response and second response--because the total number of
Alsatian responses nationwide, 16,420, surpasses the 10,000
national minimum population threshold.