Synthetic Data: Public-Use Micro Data for a Big Data World

October 14, 2014

Written by:

Ron Jarmin

Thomas A. Louis, Associate Director, Research and Methodology Directorate
Javier Miranda, Principal Economist, Center for Economic Studies

Businesses, households and policymakers need timely and accurate data to make informed decisions. National statistical offices around the world have a wealth of information from survey and administrative sources to meet these needs. However, they are constrained in their ability to release these data because of the confidentiality pledge to data respondents.

Synthetic data offer a way to expand the amount of information that national statistical offices can publically release while maintaining respondent confidentiality. In synthetic datasets, some or all data values are simulated (synthesized) using statistical models designed to mimic the (joint) distributions of the underlying data.

Researchers at the Census Bureau, in partnership with academic economists and statisticians through the Census Bureau’s secure research data centers, recently produced two synthetic public micro datasets. The SIPP-Synthetic Beta product combines survey data from the Survey of Income and Program Participation with administrative records from the Internal Revenue Service and the Social Security Administration (see Benedetto, Stinson and Abowd 2013). The Synthetic Longitudinal Business Database is the first business establishment-level public-use micro dataset made available by a U.S. statistical agency (see Kinney et. al. 2011).

Research findings on the development and use of synthetic data and future usage of these data were presented in a session of the World Statistical Congress in August 2013 held in Hong Kong. These articles are accessible in the Statistical Journal of the International Society of Official Statistics.

While synthetic data are exciting and hold great promise, there are challenges to expanding their development and use. Creating synthetic data requires significant technical expertise that is not widely available within many statistical agencies. Census Bureau progress on synthetic data has relied on robust collaboration with academic experts. Users also confront challenges. Synthetic microdata are still experimental and not as straightforward to use as conventional microdata. Because users may not understand what is involved in developing apps and online tools constructed using synthetic data, such as OnTheMap, they may understate the variance of estimates supplied by such tools.

Synthetic data are one way for national statistical organizations to take the lead in making high quality and reliable official statistics more accessible and relevant. However, creating and supporting synthetic data requires staffing and resources beyond what are generally available to them. The Census Bureau’s “two-way-street” strategy of developing partnerships with academic and funding institutions offers a way to move forward.

Ron S. Jarmin, Assistant Director, Research and Methodology Directorate