U.S. flag

An official website of the United States government

Skip Header


Disclosure Avoidance and the 2018 Census Test: Release of the Source Code

Written by:

The 2020 Census marks the first time that any federal government statistical agency applied the rigorous rule of differential privacy – the gold-standard of privacy protection – to defend the confidentiality of survey responses for a nation-wide census. In the design of the 2020 Decennial Response Processing System (DRPS), the differential privacy algorithm is implemented in a system called the Disclosure Avoidance System (DAS), so-named because its purpose is to prevent unauthorized disclosure of confidential data.

In February 2019, we used a prototype of the 2020 DAS to enforce privacy protections for the data that we collected in Providence, R.I., as part of the 2018 Census Test. The DAS ingested the 2018 Census Edited File, ran the differential privacy algorithms, and produced the Microdata Detail File (MDF). The MDF was then used to produce the tabulations that we released on April 15, 2019.

When we announced that we were moving to differential privacy, we also said that the move would be accompanied by dramatically improved transparency for the Census Bureau's disclosure avoidance practices without sacrificing the integrity of the standard itself or respondent confidentiality. For example, while we used swapping to protect the respondent data for the 2010 Census, we have never publicly stated how many households were swapped, or even the details of the algorithm that was used to choose candidate households for swapping.

As part of our transparency initiative, today we are releasing the source code for the version of the DAS that we used for the 2018 Census Test.

The DAS is written with the Census Bureau's DAS framework, an architecture that we have created that allows us to develop and experiment with differential privacy. The DAS framework is a plugin system that supports modules for reading data, performing differential privacy, and casting the results into a form that can be used by the Census Bureau's tabulation systems. The 2018 Census Test code also produces measures to the extent the output differs from the input, and makes those computations in a manner consistent with differential privacy.

The code we are releasing was designed to run with the data we collected in 2018, so it will not be possible for people outside the Census Bureau to run the code as it exists. With that in mind, we are also releasing two additional modules that can read microdata from the 1940 Census that are now available from the Integrated Public Use Microdata Series USA (IPUMS). This means that any interested person can download the IPUMS USA data, run the DAS, and formally evaluate how our differential privacy algorithms compare to alternative methods for protecting confidentiality.

We ran the DAS in the Amazon Web Services GovCloud on a system that had a m4.16xlarge master node with 64 cores and 256 GiB of RAM and two r4.16xlarge core nodes with 64 cores and 488 GiB of RAM each. We used Apache Spark and Amazon's Elastic Map Reduce, with all data stored in an encrypted Amazon Simple Storage System (S3) bucket. On that system, the DAS typically runs on the 1940 data in between 30 and 60 minutes.

Running the DAS also requires the use of the Gurobi optimizer, which requires a license. Free Gurobi licenses are available for academic use, but the free license will not work easily in the Amazon EMR environment. Therefore, we have also provided script that allows the program to run on a single Amazon Elastic Compute Cloud (EC2) node. In our tests, it takes between five and 10 hours to run the DAS on the 1940 data using a four-core EC2 node.

We should note that the accuracy achieved by the DAS algorithm for a given epsilon depends on the size of the histogram (i.e., the product of the number of variable attributes). Therefore, the accuracy achieved using the DAS algorithm for the 1940 data using an epsilon value of 0.25 is not equivalent to the accuracy that will be achieved using an epsilon value of 0.25 on the 2020 data. The 1940 histogram used for demonstration purposes is considerably smaller than what will be processed in 2020 by the Census Bureau.

We are also releasing several runs of the DAS on the 1940 data for different values of epsilon. These data sets can be downloaded from the Census Bureau’s website at <https://www2.census.gov/census_1940/>. Each run is made available in a ZIP file that contains two files: MDF_PER.txt (the person-level file) and MDF_UNIT.txt (the housing unit-level file). These files are microdata, which means that you will need to tabulate them yourself if you wish to determine (for example) the number of people living in each state or produce a protected table to compare with its unprotected version.

Please note that the 2018 Census Test only involved the production of the tables for the PL94-171 redistricting data product. The files based on the 1940 Census were processed to match the output requirements of this test. Hence, the only age categories are 17 (ages 0 to 17) and 18 (ages 18 and older); the race and ethnicity data are based on 1940s reporting standards; and the only use of the housing units file is to provide a geography for the individuals — the grouping of people into households is arbitrary because there are no PL94-171 tables that require this linkage.

The Census Bureau takes its legal and professional obligation to safeguard the information it gathers from the public seriously. From the time the Bureau collects the data, through processing, publication and storage, we are bound by Title 13 of the United States Code to ensure that information about any specific individual, household, or business is never revealed, even indirectly through our published statistics.

Please submit questions about the DAS source code using this link.

This article was filed under:

 
Page Last Revised - October 28, 2021
Is this page helpful?
Thumbs Up Image Yes Thumbs Down Image No
NO THANKS
255 characters maximum 255 characters maximum reached
Thank you for your feedback.
Comments or suggestions?

Top

Back to Header