What Are Synthetic Data?

May 27, 2021

Synthetic data can mean many different things depending upon the way they are used. Sometimes, as in computer programming, the term means data that are completely simulated for testing purposes. Other times, as in statistics, the term means combining data, often from multiple sources, to produce estimates for more granular populations than any one source can support. An example of this usage is the U.S. Census Bureau’s Small Area Income and Poverty Estimates. In data confidentiality applications, synthetic data are modeled statistical outputs released in a format that closely resembles the confidential data format. Synthetic data can be disaggregated to the individual- or business-record level, or aggregated into tabular format.

What decisions about the use of synthetic data in the American Community Survey (ACS) have been made?

The Census Bureau hasn’t made any decisions yet about the use of fully synthetic data in the ACS. During this exploratory research phase, we welcome feedback from our data users about whether this tool should be considered as a strategy to help us mitigate the data quality issues we have begun to see from lower response rates to surveys, and to provide more accurate data when survey respondents are not representative of the broader community. We do this while ensuring your responses to our surveys are kept confidential.

We are also invested in ensuring our data users feel confident in the synthetic data if we do determine that this tool is valuable for the ACS. We know that making synthetic data that will satisfy every user case is impossible. That’s why we’re experimenting with allowing data users to validate the synthetic output against internal data. You can learn more about that process in the presentation available at <acsdatacommunity.prb.org/p/conferences>.

How has the Census Bureau researched and used synthetic data in its products?

Several products from the Longitudinal Employer-Household Dynamics (LEHD) program use synthetic data, including:

the LEHD Origin-Destination Employment Statistics (LODES),
the OnTheMap web application (mapping where workers live),
the Post-Secondary Employment Outcomes (PSEO) Explorer data product,
and the Veterans Employment Outcomes (VEO) Explorer.

Likewise, other data products also generate synthetic data using administrative data sets. These include:

Opportunity Atlas (which measures adulthood outcomes of children by tract),
the Small Area Income and Poverty Estimates (SAIPE) Program
and the Small Area Health Insurance Estimates (SAHIE) Program

We've developed fully synthetic datasets with validation systems, as well, notably the SIPP Synthetic Beta (SSB) and the Synthetic Longitudinal Database (SynLBD).

What research is the Census Bureau currently exploring concerning synthetic data use on the ACS?

Synthetic data methods are well-known to the statistical community as they have been used in other surveys, such as the SIPP, and within statistical software packages (e.g., synthpop within R). The Census Bureau is researching a new fully synthetic data product to explore whether this method would allow us to produce more accurate data—correcting for known sources of error and potentially allowing for more tabulations at lower levels of geography—for our users while maintaining our respondents’ privacy. We began conversing with data users about this work in 2019 to solicit feedback and engagement on what we expect will be a multiyear process to explore and research alternatives for the ACS’ future in an era of declining survey response rates and increasing reliance on data and statistics that reflect an increasingly diverse country. We will continue to provide public updates in multiple forums as well as in blog posts available at <www.census.gov/newsroom/blogs/research-matters/2020/08/acs-disclosure-avoidance-and-release-plans.html>.

How are synthetic data and privacy/ confidentiality connected?

All Census Bureau surveys have to balance the competing requirements of releasing statistics and protecting privacy. Modern privacy theory makes clear that retaining accuracy and privacy in our statistical products requires a trade-off. While sampling in surveys may increase privacy, the interaction between formal privacy methods and surveys is an active research area. Synthetic data are one tool we can use to continue to provide granular data at low levels of geography without sacrificing the privacy of our respondents.

Related Information

Page Last Revised - November 18, 2021