The U.S. Census Bureau is required by law to protect respondent confidentiality at every stage of the data lifecycle. From the time we collect the data—through processing, publication, and storage—we are bound by the Census Act, codified at Title 13 of the United States Code, to ensure that information about any specific individual, household, or business is never revealed, in our published statistics or otherwise. In fact, there are criminal penalties that apply if Bureau officials fail to do so. 13 U.S.C. § 214.
Title 13 contains two confidentiality provisions. Section 8 of Title 13 provides in part that the Secretary of Commerce “may furnish copies of tabulations and other statistical materials which do not disclose the information reported by, or on behalf of, any particular respondent” (emphasis added). Section 9 of Title 13 prohibits “any publication whereby the data furnished by any particular establishment or individual under this title can be identified” (emphasis added). The Supreme Court decided, in Baldrige v. Shapiro, that these provisions “preclude all disclosure of raw census data reported by or on behalf of individuals.” (emphasis in original)
The plain language of Title 13’s confidentiality provisions evidences Congress’s intent that the Bureau protect from disclosure information that can be used to identify data supplied by a particular establishment or individual. We live in an era where quantum computers are becoming a reality and, using the computing power that exists today, it is possible to reverse-engineer releases of aggregated data to identify individual data. Title 13’s confidentiality provisions provide the Secretary of Commerce the flexibility to address technological advances consistent with the discretion Congress imbued in the Secretary when delegating the responsibility to conduct the census “‘in such form and content as [s]he may determine.’” 13 U.S.C. § 141(a)).
Congress has used Title 13’s protections as a model for legislating confidentiality protections for other federal statistical agencies. In 2002, Congress passed the Confidential Information Protection and Statistical Efficiency Act (CIPSEA), which borrows concepts from Title 13 to provide data confidentiality protections for federal agencies otherwise without such protections. While CIPSEA does not apply to the Census Bureau’s procedures for guarding the confidentiality of responses to Census surveys, some data users have found the description of “indirect identification” in CIPSEA’s implementation guidance to be a particularly accessible way of understanding the concept of indirect disclosure originally codified in Title 13. CIPSEA’s guidance reads:
“Indirect identification refers to using information in conjunction with other data elements to reasonably infer the identity of a respondent. For example, data elements such as a combination of gender, race, date of birth, geographic indicators, or other descriptors may be used to identify an individual respondent.”
The Census Bureau has used a variety of tools to keep your data confidential. We call these tools “disclosure avoidance systems.” In 1920, we began using manual suppression and compression techniques to prevent disclosure of business data. Census Bureau specialists would “eyeball” data tables, looking for suspicious data and manually hide (suppress) it or combine (collapse) it into larger categories. In 1930, we stopped publishing small-area data because we could not prevent disclosure at those levels of geography. In 1940, similar techniques were extended to data about people.
Since 1940, we have modernized our disclosure avoidance system nearly every decade, as threats to the confidentiality of your data have become more complex. From 1950 to 1980, we used various forms of suppression and collapsing to protect confidentiality. In 1990, as our country had become increasingly diverse, data users were frustrated by the number of tables and cells that were required to be hidden under the necessary suppression rules. That decade, we introduced the concept of adding statistical “noise” to the data we published, using techniques called “swapping” and “blank and impute.” These techniques added uncertainty to the data, which allowed us to publish the data at lower levels of geography without putting your data at risk of disclosure.
Data swapping is a disclosure avoidance method that “swaps” data between households in different locations that have similar characteristics on a set of variables. “Blank and impute” is a disclosure avoidance method that identifies outliers in a data set that would make a respondent especially easy to identify, blanks out the data for that variable or set of variables, and imputes a response to take its place using statistical models.
Over the last two decades, as computers have become increasingly powerful and common, new threats to the confidentiality of your data have emerged. In 2008, we began using a new disclosure avoidance system using a framework known as “differential privacy,” to meet growing threats. Differential privacy is explained more thoroughly in other FAQs on this page. It was developed by nongovernmental scientists in 2006. In 2008, the Census Bureau first used differential privacy to protect against disclosure in our OnTheMap data tool. While we were unable to implement differential privacy for the 2010 Census, research over the past decade has prepared us to use it to protect the data we publish from the 2020 Census.
The 2010 Census produced and published more than 150 billion statistics, many on very small populations. The core demographic publications at the person-level consisted of about 8 billion non-redundant numbers; approximately 25 statistics—including block-level location data—for every person in the census.
Our research since 2010 found that the disclosure avoidance methods we used to protect 2010 Census (and earlier) statistics are no longer able to defend against the risk of reconstruction and reidentification posed by today’s technology. The threats of growing computing power, advances in mathematics, as well as easy access to large, public databases could allow attackers to identify common data points between our published statistics, or between our statistics and outside databases. They could use these common threads to potentially identify an individual respondent’s data. A modernized system is necessary to protect against modern threats.
The Census Bureau has a long history of innovations in statistical protection. With each advance in data science, we’ve applied better and stronger protections to keep the statistics we release anonymous. We discuss these new challenges and our response in more detail in a series of blogs and newsletters that can be found on our disclosure avoidance web site. Even with the new safeguards, protecting confidentiality requires us to restructure and reevaluate many of the statistical tables that we publish.
Since October 2019, we have released a series of data sets using 2010 Census data to demonstrate how differential privacy works when applied to the census products we release. These releases have helped data users understand how the system works and has given them an opportunity to work with the new structures and begin to analyze the new system’s fitness-for-use. See: A History of Census Privacy Protections, and Disclosure Avoidance and the 2020 Census.
Data Stewardship is a comprehensive framework designed to protect information over the course of the information lifecycle, from collection to dissemination, and it starts with creating a culture of confidentiality that is based on the law and designed to maintain public trust. Research conducted by both the Census Bureau and non-governmental researchers [PDF] has shown that concerns about privacy and confidentiality are among the reasons most often given by potential respondents for unwillingness to participate in surveys.
In addition to the impact of confidentiality protections on response rates, our disclosure avoidance system protects against direct threats to the disclosure of our respondents’ data. Many vendors collect, sell, and publish data about people living in the United States. While many commercial vendors have access to data on name, address, and data of birth, less have access to the type of rich demographic data the Census collects on characteristics like race, ethnicity, and household relationships.
The information on demographic characteristics these vendors lack is precisely the sort of information collected by the decennial census. The disclosure of these types of characteristics could not only make it easier to target individuals – particularly in vulnerable populations like communities of color, same-sex couples, older adults, or parents of very young children—for fraud, enforcement actions, disinformation, or physical or virtual abuse, but it would also undermine the public’s trust in the confidentiality of its census response, which will cause people to be less likely to respond to future censuses, and the accuracy of the census will necessarily suffer as a result.
Yes, if they have access to additional outside data sources or perform some minimal fieldwork [PDF] to verify their results. This is precisely why the Census Bureau must seriously address the threat of disclosure and apply a comprehensive and coordinated program of disclosure avoidance.
The Census Bureau has the only copy of the confidential microdata, but an adversary could have access to many different outside data sources. Unless we protect the data, an adversary could independently confirm their reidentifications with reasonable certainty.
As the volume and quality of outside data sources such as names, addresses, and birth dates grow and improve, so do adversaries’ presumed and actual matches. Our analysis of 2010 Census reidentification vulnerability used a large database of commercial information available at the time of that census. The risks associated with using 2010 Census disclosure avoidance methods today and into the future will only increase.
To date, we are not aware of successful reidentifications by bad actors, though we would not necessary expect bad actors to publicize their results. We have, however, documented re-identifications that users have brought to our attention here: Reidentification Studies. There has been a dramatic increase in the availability of both large-scale computing resources and commercial-strength optimizers that can solve systems of billions of simultaneous equations. Together, these resource and tools have changed the threat of database reconstruction from a theoretical risk to an issue that the Census Bureau is legally required to address. The adoption of differential privacy for 2020 Census data releases is intended to guard against successful reconstructions and reidentifications by those who seek to reverse-engineer the census data, including those who would be especially difficult to identify like state actors (e.g., foreign governments), corporations, and cybercriminals, all of whom would be unlikely to publicly announce a successful reconstruction or reidentification attack.
Differential privacy was too new to deploy in 2010. It was developed in 2006 at Microsoft and first used by the Census Bureau in 2008 to protect block-level population data in the OnTheMap application. Expanding these hardened confidentiality protections to our flagship products, beginning with the 2020 Census, is a much more complicated task that has been years in the making. Many big tech companies use what is called “local” differential privacy (DP), meaning they apply the confidentiality filter locally, before it reaches their servers. This allows Big Data analysis without access to the “raw,” unfiltered data. However, the Census Bureau uses “central” DP because we collect and process the raw data for tabulations and quality assurance purposes. We must ensure that the differentially private output is fit for its intended purposes. The 2020 Census is believed to be one of the largest applications of central DP.
Our research verified that traditional disclosure avoidance methods leave personal data exposed with today’s faster computers, high-powered machine learning software, and large public databases. This left us with two choices: we could publish significantly less information, or we could adopt a modernized approach to confidentiality protection. We chose the latter, and there is no other technique that can be reliably employed to assure the confidentiality of the underlying data while simultaneously assuring the highest quality statistical product for our data users. The Census Bureau has both a legal and an ethical responsibility to use the strongest data protection technology available. The Census Bureau has a dual mandate to produce quality statistical information while protecting the confidentiality of respondent data. We know that the nation needs timely and accurate information to make informed decisions. People must know that we will guard their privacy zealously if we want them to entrust us with their personal information.
The DAS applies carefully calibrated noise to effectively protect data while preserving valid statistical outcomes. It applies noise to all data points except these “invariants”: