The world has evolved quite a bit since 1990, the last time we changed the way we protect individual responses in published census statistics. Back in the pre-Internet age, statistics were shared via floppy disk. Most people had to visit a library to look up census and other publicly available information.
Today, that information is now freely available online. The amount of personal data about each of us, from commercial and public databases as well as social media, is massive and growing.
With today’s powerful computers and cloud-ready software, bad actors can easily find and download data from multiple databases. They can use sophisticated computer programs to match information between those databases and identify the people behind the statistics we publish. And they can do it at lightning speed. This is called “re-identification.” It poses a serious threat that didn’t exist even 10 years ago during the last census.
The sheer quantity of statistics we publish makes data matching and re-identification easier. The 2010 Census published billions of statistics, summarizing dozens of specific pieces of information about each of us.
Our research over the past decade confirmed that our traditional (“legacy”) protection methods can no longer defend against these modern threats. We needed to update our methods to comply with Federal law, as well as take a hard look at the quantity and granularity of the statistics we publish.
Conducting an accurate census is more than just a count of individuals; it’s the bedrock of our form of government.
We need to ensure the data we’re releasing are usable – that they are sufficiently accurate and granular enough to support the multitude of data user needs.
At the same time, we also need to ensure that everyone feels safe responding to our censuses and surveys. People need to know that we guard their privacy zealously.
Our challenge is to strike the best balance between the need to release detailed, usable statistics from the 2020 Census with our responsibility to protect the privacy of the people behind those numbers.
Exploring our options in the face of today’s re-identification threats, we were faced with two choices:
We chose the latter.
Unlike legacy methods, Differential Privacy doesn’t assume that we can know in advance whose records will be most at risk, and that merely protecting the “riskier” data will be sufficient to prevent re-identification.
Today we know that every released statistic “leaks” a little bit of information about the people behind it, and that modern technology makes it possible for someone to exploit those leaks.
Differential Privacy plugs the leaks using mathematical principles, applying carefully calibrated statistical noise to a dataset. It allows us to strike a balance between privacy and accuracy in a surgical way.
Unlike legacy methods, Differential Privacy protects against unknown future threats as well as today’s known threats. It also provides a level of transparency about the impact of the privacy protections on data accuracy, which wasn’t possible with the legacy methods.
We released the first “beta” version of the new protection system, which we call the “Disclosure Avoidance System” (DAS), in October 2019. Since then, we’ve gathered extensive feedback from independent experts engaged by the National Academy of Sciences, our own National and Scientific Advisory Committees, the National Conference of State Legislatures, the Civil Rights Division of the Department of Justice, American Indian and Alaska Native tribal leaders, the Federal-State Cooperative for Population Estimates, and many other groups. Their analyses and our own help us build the mechanics of the system.
We’ve applied the incremental system design updates to 2010 Census data to allow an easier analysis of the system’s impact on the data. We’ve published comparison metrics with each release to help with that analysis. We’ve conducted hundreds of briefings with stakeholders to answer questions and gather suggestions to help shape each subsequent iteration. Our focus then shifted to setting the system’s tunable parameters – including the “dial” that adjusts the level of precision versus privacy in the data – “up” or “down.”
Today we released a new set of “demonstration data” that applies the latest DAS design to 2010 Census data. We use 2010 Census results to allow easy comparisons between the latest DAS and the legacy protections, a preview of the potential impact on 2020 Census results.
This release is the fifth such demonstration data set but the first that reflects a change in that adjustable “dial,” known in technical terms as the “privacy-loss budget.” The “budget” has nothing to do with any sort of financial value. It refers to a chosen limit on how much privacy is traded for increased accuracy. This data set uses a higher limit — effectively turning the dial “up”— adding more accuracy or precision in the data in exchange for a comparable, but limited, reduction in privacy protection. The changes in this data set reflect the cumulative feedback received from the data user community throughout the development process.
The demonstration data focus solely on the tables and statistics required for the P.L. 94-171 Redistricting data that will be released in the legacy summary file format in August for states and in a more user-friendly format on the Census Bureau’s website in September. The data meet the accuracy criteria we developed after extensive discussions with the redistricting community and the Civil Rights Division at the U.S. Department of Justice.
We are requesting your feedback on the new demonstration data (via the email address below). Later, in early June, based on that feedback, our Data Stewardship Executive Policymaking Committee (DSEP), composed of senior career executives, will choose the final DAS system design for the redistricting data, including the tunable parameters and privacy-loss budget. While we believe that the accuracy reflected in the latest demonstration data balances the concerns of data users and privacy advocates, the feedback we receive will be reflected in DSEP’s final decisions.
Prior to the September 2021 redistricting data release on data.census.gov we will publish a final set of demonstration data using the chosen design.
Then, after the September redistricting release, we will shift our full attention to building out the algorithms needed to produce the more detailed data products from the 2020 Census. We pledge our continued transparency and engagement as we work with you to develop those products.
We encourage data users to closely analyze today’s demonstration data. Feedback received by May 28, 2021, will be considered by DSEP.
We will provide metrics and educational webinars throughout the month of May to help you with that analysis. (Subscribe to our newsletter for the release and other updates. Email feedback to: 2020DAS@census.gov. Include “April PPMF” in the subject line.)
Please note that as with legacy protection methods, Differential Privacy is not applied to the apportionment census counts.