Synthetic data can mean many different things depending upon the way they are used. Sometimes, as in computer programming, the term means data that are completely simulated for testing purposes. Other times, as in statistics, the term means combining data, often from multiple sources, to produce estimates for more granular populations than any one source can support. An example of this usage is the U.S. Census Bureau’s Small Area Income and Poverty Estimates. In data confidentiality applications, synthetic data are modeled statistical outputs released in a format that closely resembles the confidential data format. Synthetic data can be disaggregated to the individual- or business-record level, or aggregated into tabular format.
The Census Bureau hasn’t made any decisions yet about the use of fully synthetic data in the ACS. During this exploratory research phase, we welcome feedback from our data users about whether this tool should be considered as a strategy to help us mitigate the data quality issues we have begun to see from lower response rates to surveys, and to provide more accurate data when survey respondents are not representative of the broader community. We do this while ensuring your responses to our surveys are kept confidential.
We are also invested in ensuring our data users feel confident in the synthetic data if we do determine that this tool is valuable for the ACS. We know that making synthetic data that will satisfy every user case is impossible. That’s why we’re experimenting with allowing data users to validate the synthetic output against internal data. You can learn more about that process in the presentation available at <acsdatacommunity.prb.org/p/conferences>.
Several products from the Longitudinal Employer-Household Dynamics (LEHD) program use synthetic data, including:
Likewise, other data products also generate synthetic data using administrative data sets. These include:
We've developed fully synthetic datasets with validation systems, as well, notably the SIPP Synthetic Beta (SSB) and the Synthetic Longitudinal Database (SynLBD).
Synthetic data methods are well-known to the statistical community as they have been used in other surveys, such as the SIPP, and within statistical software packages (e.g., synthpop within R). The Census Bureau is researching a new fully synthetic data product to explore whether this method would allow us to produce more accurate data—correcting for known sources of error and potentially allowing for more tabulations at lower levels of geography—for our users while maintaining our respondents’ privacy. We began conversing with data users about this work in 2019 to solicit feedback and engagement on what we expect will be a multiyear process to explore and research alternatives for the ACS’ future in an era of declining survey response rates and increasing reliance on data and statistics that reflect an increasingly diverse country. We will continue to provide public updates in multiple forums as well as in blog posts available at <www.census.gov/newsroom/blogs/research-matters/2020/08/acs-disclosure-avoidance-and-release-plans.html>.
All Census Bureau surveys have to balance the competing requirements of releasing statistics and protecting privacy. Modern privacy theory makes clear that retaining accuracy and privacy in our statistical products requires a trade-off. While sampling in surveys may increase privacy, the interaction between formal privacy methods and surveys is an active research area. Synthetic data are one tool we can use to continue to provide granular data at low levels of geography without sacrificing the privacy of our respondents.