Chapter 2. Introduction OVERVIEW Public use microdata sample files are ASCII files which contain individual records of the characteristics for a sample of people and housing units. Information which could identify a household or an individual is excluded in order to protect the confidentiality of respondents. Within the limits of the sample size, the geographic detail, and the confidentiality protection, these files allow users to prepare virtually any tabulation they require. WHAT ARE MICRODATA? Microdata are the individual records which contain information collected about each person and housing unit. They include the census basic record types, computerized versions of the questionnaires collected from households, as coded and edited during census processing. The Census Bureau uses these confidential microdata in order to produce the summary data that go into the various reports, summary files, and special tabulations. Public use microdata samples are extracts from the confidential microdata taken in a manner that avoids disclosure of information about households or individuals. For Census 2000, the microdata are only available to the public through the Public Use Microdata Sample (PUMS) products. PROTECTING CONFIDENTIAL INFORMATION All data released (in print or electronic media) by the Census Bureau are subject to strict confidentiality measures imposed by the legislation under which our data are collected: Title 13, U.S. Code. Responses to the questionnaire can be used only for statistical purposes, and Census Bureau employees are sworn to protect respondents' identities. Because of the rapid advances in computer technology since 1990 and the increased accessibility of census data to the user community, the Census Bureau has had to adopt more stringent measures to protect the confidentiality of public use microdata through enhanced disclosure limitation techniques. At the same time, the Census Bureau recognizes the data user's need for characteristic detail and geographic specificity. Hence, there are two sets of files: one that provides a fuller range of detailed characteristics (the 1-percent files) and one that provides greater geographic detail but less characteristic detail (the 5-percent files). Confidentiality is protected, in part, by the use of the following processes: data-swapping, topcoding of selected variables, geographic population thresholds, age perturbation for large households, and reduced detail on some categorical variables. Data swapping is a method of disclosure limitation designed to protect confidentiality in tables of frequency data (the number or percent of the population with certain characteristics). Data swapping is done by editing the source data or exchanging records for a sample of cases. Swapping is applied to individual records and, therefore, also protects microdata. Top-coding is a method of disclosure limitation in which all cases in or above a certain percentage of the distribution are placed into a single category. Geographic population thresholds prohibit the disclosure of data for individuals or housing units for geographic units with population counts below a specified level. Age perturbation, that is, modifying the age of household members, is required for large households (households containing ten people or more) due to concerns about confidentiality. Detail for categorical variables is collapsed if the number of occurrences in each category does not meet a specified national minimum threshold. 1-Percent Files The 1-percent files give users the maximum amount of social, economic, and housing information available. There is no national minimum threshold for the identification of variable categories, with the exceptions of a national minimum population of 8,000 for race and Hispanic origin. The goal of these files is to provide a similar level of detail as was available in the 1990 PUMS files (and, in some cases, more detail). In order to provide the level of characteristic detail for the 1-percent files described above, the minimum geographic population threshold needed to be raised above 100,000 (the PUMA minimum). A new geographic entity was created-the super-PUMA. Super-PUMAs have a minimum population of 400,000 and are composed of a PUMA or PUMAs delineated on the 5-percent PUMS files. 1 Each state will be identified, and any state with a population of 800,000 or greater can be subdivided into two or more super-PUMAs. 5-Percent Files To maintain confidentiality, while retaining as much characteristic detail as possible, a minimum threshold of 10,000 nationally is set for the identification of variable categories within categorical variables in the 5-percent PUMS files. Each PUMA in the 5-percent files must meet a minimum population threshold of 100,000. The minimum PUMA threshold was held at 100,000 by increasing the degree of variable collapsing as described above. The 100,000 minimum population threshold-the threshold set for both he 1980 and 1990 PUMS files-permits greater historical comparability. USES OF MICRODATA FILES Public use microdata files essentially allow ''do-it-yourself'' special tabulations. The Census 2000 files furnish nearly all of the detail recorded on long-form questionnaires in the census, subject to the limitations of sample size, geographic identification, and confidentiality protection. Users can construct a wide variety of tabulations interrelating any desired set of variables. They have almost the same freedom to manipulate the data that they would have if they had collected the data in their own sample survey, yet these files offer the precision of census data collection techniques and sample sizes larger than would be feasible in most independent sample surveys. Microdata samples are useful to users who are doing research that does not require the identification of specific small geographic areas or detailed crosstabulations for small populations. Microdata users frequently study relationships among census variables not shown in existing census tabulations, or concentrate on the characteristics of specially defined populations. SAMPLE DESIGN AND SIZE Each microdata file is a stratified sample of the population which was created by subsampling the full census sample (approximately 15.8 percent of all housing units) that received census long form questionnaires. Initial sampling was done address-by-address in order to allow the study of family relationships and housing unit characteristics for occupied and vacant units. Sampling of people in institutions and other group quarters was done on a person-by-person basis. There are two independently drawn samples, designated ''5 percent'' and ''1 percent,'' each featuring a different geographic scheme. Nationwide, the Census 2000 5-percent sample provides the user records for over 14 million people and over 5 million housing units. For the 1-percent sample, there are records for over 2.8 million people and over 1 million housing units. Since processing a smaller sample is less resource intensive, some users may want to produce extracts using the subsample numbers provided in the housing record. The sample design is discussed more thoroughly in Chapter 5. Sample Design and Estimation. 1The super-PUMAs will be identified in the 5-percent files as well. Like 1990, each file contains individual weights for both the housing units and the people. The user can estimate the frequency of a particular characteristic for the entire population by summing the weight variables for records with that characteristic from the microdata file. A section of Chapter 5 discusses the preparation and verification of estimates (see page 5-2) and Appendix I provides control counts. Reliability improves with increases in sample size, so the choice of sample size must represent a balance between the level of precision desired and the resources available for working with microdata files. By using tables provided in Chapter 4 (see page 4-3), one can estimate the degree to which sampling error will affect any specific estimate prepared from a microdata file of a particular sample size. Many factors affect the user's decision on which file to use. Users of microdata files for state or Metropolitan Area (MA) estimates would normally use a 1-percent or 5-percent sample, while users concerned only with national figures can frequently get by with a smaller sample, say a 0.1percent (one-in-a-thousand) sample. Although we do not provide a 0.1-percent file, we do provide subsample numbers which allow scientifically designed extracts of various sizes to be drawn. Even national users may need a 1-percent or a 5-percent sample if extremely detailed tabulations are desired, or if users are concerned with very small segments of the population, for example, females 75 years old or over of Italian ancestry. One of the examples in Chapter 4 discusses the selection of the appropriate sample size for a particular study. SUBJECT CONTENT Microdata files contain the full range of population and housing information collected in Census 2000. These files allow users to study how characteristics are interrelated (for example, income and educational attainment of husbands and wives). Information for each housing unit in the sample appears on a 314-character record with geographic, household, and housing items, followed by a variable number of 314-character records with person-level information, one record for each member of the household. Information for each group quarters person in the sample appears on a 314-character pseudo housing unit record. Items on the housing record are listed beginning on page 6-23; items on the person record are listed beginning on page 6-42. Although the subjects are further defined in Appendix B of this document, it is important to note that some items on the microdata file were modified in order to provide protection for individual respondents. The sample questionnaires were edited for completeness and consistency, and substitutions or allocations were made for most missing data. Allocation flags appear interspersed throughout the file indicating each item that has been allocated. Thus, a user desiring to tabulate only actually observed values can eliminate variables with allocated values. Editing and allocation flags are discussed beginning on page 4-17. GEOGRAPHIC CONTENT The Census Bureau offered State Data Centers (SDCs) the opportunity to delineate, or coordinate the delineation of, the super-PUMAs and the PUMAs. The SDCs (or their equivalents) in 48 states, the District of Columbia, and Puerto Rico participated in the delineation program. The Florida and Rhode Island SDCs did not participate; in these two states, the Census Bureau delineated the super-PUMAs and the PUMAs. Super-PUMAs are identified by a 5-digit code. The first two digits of each super-PUMA code within a given state contain that state's federal information processing standard (FIPS) code. A 5-digit number, unique within state, identifies each PUMA; PUMA codes must be used in conjunction with the 2-digit FIPS state codes. Maps of super-PUMAs and PUMAs, as well as a geographic equivalency file, also are provided to the user via File Transfer Protocol (FTP) and on CD-ROM/DVD. To maintain the confidentiality of the PUMS data, minimum population thresholds are set for PUMAs and super-PUMAs. For the 1-percent state-level files, the super-PUMAs contain a minimum population of 400,000 and are composed of a PUMA or a group of contiguous PUMAs delineated on the 5-percent state-level PUMS files. Super-PUMAs are a new geographic entity for Census 2000. The 5-percent state-level files contain PUMAs, each having a minimum population of 100,000; the 5-percent files also will show corresponding super-PUMA codes. Each state is separately identified and may be comprised of one or more super-PUMAs or PUMAs. Large metropolitan areas may be subdivided into super-PUMAs and PUMAs. PUMAs and super-PUMAs do not cross state lines. In addition to super-PUMAs and PUMAs, there also are modified super-PUMAs and PUMAs for two specific variables, place of residence on April 1, 1995 and place of work. The descriptions that follow apply to PUMAs, as well as to super-PUMAs. Migration super-PUMAs and place of work super-PUMAs are the geographic units that contain information on place of residence on April 1, 1995 and place of work, respectively. Outside of the six New England states (Maine, New Hampshire, Vermont, Massachusetts, Rhode Island, and Connecticut), migration super-PUMAs and place of work super-PUMAs are defined only to the whole county (or county equivalent) or groups of counties. In some instances, place of work super-PUMAs are defined to places. In the six New England states, migration super-PUMAs and place of work super-PUMAs are defined to minor civil divisions (MCDs) or groups of MCDs. Appendix K illustrates the relationship between migration super-PUMAs (MIGPUMA1) and super-PUMAs (PUMA1) and Appendix L illustrates the relationship between place of work super-PUMAs (POWPUMA1) and super-PUMAs (PUMA1). CORRESPONDING MICRODATA FROM EARLIER CENSUSES PUMS files exist for the 1960, 1970, 1980, and 1990 censuses. Samples from the 1960 through 1990 censuses employed a 1-percent sample size; the 5-percent sample has only been produced since 1980. In 2000, all states met the minimum population threshold for the 1-percent files so a separate file was produced for each state. Very little comparability exists between geographic identifiers on each of the previous files, but housing and population characteristics are similar. Because of this similarity, microdata files from the most recent censuses are a rich resource for analysis of trends. Items which were added, dropped, or substantially changed between 1990 and 2000 are listed in Chapter 3. How to Use This File. Appendix B discusses historical comparability of items in greater detail.