The U.S. Census Bureau’s commitment to data stewardship—protecting respondent privacy and confidentiality at every stage of the data lifecyle—is grounded in law that is straightforward, robust, and strong. From the time we collect the data, through processing, publication and storage, we are bound by Title 13 of the United States Code to ensure that information about any specific individual, household, or business is never revealed, even indirectly through our published statistics.
We call the steps we take to prevent any outside entity from identifying individuals or businesses in the statistics we publish “disclosure avoidance.” This is the first of two Research Matters Blogs where I discuss the ongoing work at the Census Bureau to modernize how we protect respondent confidentiality when we publish statistics on the U.S. population and economy.
Throughout our history, we have been leaders in statistical data protection, which we call disclosure avoidance. Other statistical agencies use the terms “disclosure limitation” and “disclosure control.” These terms are all synonymous. Disclosure avoidance methods have evolved since the censuses of the early 1800s, when the only protection used was simply removing names. Executive orders, and a series of laws modified the legal basis for these protections, which were finally codified in the 1954 Census Act (13 U.S.C. Sections 8(b) and 9). We have continually added better and stronger protections to keep the data we publish anonymous and underlying records confidential.
However, historical methods cannot completely defend against the threats posed by today’s technology. Growth in computing power, advances in mathematics, and easy access to large, public databases pose a significant threat to confidentiality. These forces have made it possible for sophisticated users to ferret out common data points between databases using only our published statistics. If left unchecked, those users might be able to stitch together these common threads to identify the people or businesses behind the statistics as was done in the case of the Netflix Challenge. 1
The Census Bureau has been addressing these issues from every feasible angle and changing rapidly with the times to ensure that we protect the data our census and survey respondents provide us. We are doing this by moving to a new, advanced, and far more powerful confidentiality protection system, which uses a rigorous mathematical process that protects respondents’ information and identity in all of our publications.
The new tool is based on the concept known in scientific and academic circles as “differential privacy.” It is also called “formal privacy” because it provides provable mathematical guarantees, similar to those found in modern cryptography, about the confidentiality protections that can be independently verified without compromising the underlying protections.
“Differential privacy” is based on the cryptographic principle that an attacker should not be able to learn any more about you from the statistics we publish using your data than from statistics that did not use your data. After tabulating the data, we apply carefully constructed algorithms to modify the statistics in a way that protects individuals while continuing to yield accurate results. We assume that everyone’s data are vulnerable and provide the same strong, state-of-the-art protection to every record in our database.
The Census Bureau did not invent the science behind differential privacy. 2 However, we were the first organization anywhere to use it when we incorporated differential privacy into the OnTheMap application in 2008. It was used in this event to protect block-level residential population data. 3 Recently, Google, Apple, Microsoft, and Uber have all followed the Census Bureau’s lead, adopting differentially privacy systems as the standard for protecting user data confidentiality inside their browsers (Chrome), products (iPhones), operating systems (Windows 10), and apps (Uber).
Expanding these hardened and tested confidentiality protections to our flagship products, beginning with the 2020 Census, is a complicated task that the Bureau has taken years to meticulously plan and implement. Nothing on this scope and scale has ever been done before by a statistical agency or a private business.
The first Census Bureau product that will use the new system will be prototype redistricting data from the 2018 Census Test. This confidentiality protection system will provide the foundation for safeguarding all the data of the 2020 Census. It will then be adapted to protect publications from the American Community Survey, economic censuses, and eventually all of our statistical releases.
1. Narayanan, Arvind and Vitaly Shmatikov. 2008. “Robust De-anonymization of Large Sparse Datasets,” SP’08, pp. 111-124. Washington, DC, USA:IEEE Computer Society, DOI:10.1109/SP.2008.33.
2. Dwork, Cynthia, Frank McSherry, Kobbi Nissim, and Adam Smith. 2006. “Calibrating Noise to Sensitivity in Private Data Analysis,” TCC’06, pp. 265-284. Berlin, Heidelberg: Springer-Verlag, DOI: 10.1007/11681878_14.
3. Machanavajjhala, Ashwin, Daniel Kifer, John Abowd, Johannes Gehrke, and Lars Vilhuber. 2008. Privacy: Theory meets Practice on the Map. In Proceedings of the 2008 IEEE 24th International Conference on Data Engineering (ICDE '08). IEEE Computer Society, Washington, DC, USA, 277-286. DOI: 10.1109/ICDE.2008.4497436.