In our last blog, we discussed the feedback we received from the data user community about demonstration data released last fall that were produced using the interim version of the 2020 Disclosure Avoidance System (DAS). It was clear that the fall 2019 version of the DAS TopDown Algorithm (TDA) introduced unacceptable amounts of error and distortion into statistics used for many important use cases. In that blog, we also discussed our ongoing plans to improve the algorithm to address and mitigate this error.
The team responsible for developing the DAS uses an agile development approach, which implements improvements to the system in a series of four-week development sprints. During the sprint that concluded in March 2020, we began implementing changes to address those issues. The most notable change involved how the TDA converts the formally private noisy tabulations taken from the confidential data into the non-negative integer counts that will be published, an operation that we call “post-processing.”
Previously, the TDA conducted the post-processing of all of the statistics for a particular geographic level at the same time. Unfortunately, as we saw in the demonstration data the TDA had difficulty accurately performing this optimization when there were large quantities of statistics with zeros or very small values processed at the same time. The result was distortions in the data that effectively moved individuals from high- to low-density populations (e.g., from cities to rural areas, or from larger race groups to smaller race groups).
During the March sprint, we implemented a change to the algorithm design to address and mitigate this issue. Now, the TDA conducts the post-processing in a series of passes through all the geographic levels.
At the national level, the state level, and finally at each lower level of geography, the first pass of the algorithm solely determines the population count for each unit within that geographic level (e.g., for all census tracts within a county).
Once those total population counts are determined, the second pass of the algorithm processes just the statistics necessary to produce the redistricting data (also known as the Public Law 94-171 data file), constraining those statistics to the sum of the population counts determined in the first pass.
The third pass through the algorithm then processes the core statistics necessary to support population by age, sex, and broad race/ethnicity categories for the demographic analyses that underlie the Population Estimates Program. Third-pass statistics are constrained to the sum of the statistics produced for the redistricting data.
A final pass through TDA processes the remainder of the statistics necessary for the Demographic and Housing Characteristics files and the Demographic Profiles, constraining these values to the sum of the ones produced in the third pass.
At the same time, the team examined options for improving the accuracy of population counts for legal and political entities, including American Indian, Alaska Native and Native Hawaiian areas, minor civil divisions, incorporated places, etc. Census Bureau geography experts determined the optimal geographic entities to prioritize for accuracy within each state based on knowledge gathered from decades of preparing geographic hierarchies in support of state and local government objectives.
While the DAS geographic hierarchy itself was not modified, the way the total population query was handled in the latest version of the DAS demonstrates that population accuracy is now controlled by the privacy-loss budget directly and not by errors induced by post-processing.
Identifying and prioritizing future improvements to the DAS requires ongoing dialogue with our data users. To facilitate that dialogue, we are committed to demonstrating how much each major change to the TDA design improves accuracy and “fitness for use” of the resulting data for many of the priority use cases identified by our data users. As we previously discussed, the Census Bureau has developed a comprehensive suite of error measures to use to evaluate the improvements we are making to the algorithm throughout 2020. We are consulting with a group of experts identified by the Committee on National Statistics to ensure that these are the appropriate accuracy measures to use. We also welcome input from our other data users. You can send suggestions and feedback to <2020DAS@census.gov>.
On May 27, we published Detailed Summary Metrics, which are an evaluation of a full run of TDA from the March sprint that incorporated our new multipass approach to post-processing. Comparing the accuracy of this data set to baseline measures run on the 2010 Demonstration Data Products shows we have substantially reduced the error associated with population counts in the demonstration data.
For example, in the 2010 Demonstration Data Products, the total population count for the average county was off by approximately 82 people (0.78%).With the algorithmic improvements we implemented in March, that error dropped to just 16 people (0.14%). These improvements are also observable at lower levels of geography. In the demonstration product run, total population for the average census tract was off by almost 26 people; now that error has been reduced to just 14.5 people. At the block level, error in the population for the average urban census block dipped from 9.2 to 7.7 people.
These accuracy improvements come without any reduction in the strength of the privacy guarantee. That is, the privacy-loss budget for both DAS runs held constant, so the observed improvements are directly attributable to improvements in our post-processing algorithm. More work remains to be done, however, and we look forward to sharing our progress with you through this blog and additional releases of the accuracy measures on future runs of the DAS.