SIPP Home > SIPP Synthetic Beta Data Product
Some of the following documents are in the Portable Document Format (PDF). In order to view these files, you will need the Adobe(R) Acrobat(R) Reader which is available for free from the Adobe web site.
Background on the SIPP Synthetic Beta
The SIPP Synthetic Beta (SSB) is a Census Bureau product that integrates person-level micro-data from a household survey with administrative tax and benefit data. These data link respondents from the Survey of Income and Program Participation (SIPP) to Social Security Administration (SSA)/Internal Revenue Service (IRS) Form W-2 records and SSA records of retirement and disability benefit receipt, and were produced by Census Bureau staff economists and statisticians in collaboration with researchers at Cornell University, the SSA and the IRS. The purpose of the SSB is to provide access to linked data that are usually not publically available due to confidentiality concerns. To overcome these concerns, Census has synthesized, or modeled, all the variables in a way that changes the record of each individual in a manner designed to preserve the underlying covariate relationships between the variables. The only variables that were not altered by the synthesis process and still contain their original values are gender and a link to the first reported marital partner in the survey.
Seven SIPP panels (1990, 1991, 1992, 1993, 1996, 2001, and 2004) form the basis for the SSB, with a large subset of variables available across all the panels selected for inclusion and harmonization across the years. Administrative data were added and some editing was done to correct for logical inconsistencies in the IRS/SSA earnings and benefits data. Thus the SSB is a particularly appealing data set for new SIPP users because little data preparation is needed. A complete list of variables included in SSB version 5.1, along with details about the harmonization and editing, is available in our Codebook.
As part of the synthesis process, data that are missing, either due to missing survey interviews or missing administrative data, were multiply-imputed. The resulting data sets are called the Completed Gold Standard Files and contain all original, non-missing, confidential values and imputed values in place of originally missing data. These files form the basis for evaluating results from the synthetic data. The goal of the SSB is to produce results that are qualitatively the same as results from the Completed Gold Standard Files.
The synthesis process itself involves estimating the joint distribution of all the variables in the data and taking random draws from this modeled distribution. These draws are then used to replace actual data values. This process is repeated multiple times to create a set of 16 files, also called implicates. For more information on the statistical methods used to create the SSB and formulae for combining results across implicates, please see “The Creation and Use of the SIPP Synthetic Beta.”
Before releasing these data, Census staff performed extensive testing for disclosure risk and determined that the probability of linking a record from the SSB to an actual person was negligible. For details on this risk assessment see “ DRB Memo SSB Version 5.1.” Based on the conclusions of this testing, the Census Disclosure Review Board and their counterparts at IRS and SSA have approved these data for use by researchers working outside secure Census facilities.
Announcing Release of Version 5.1 May 2013Version 5.1 of the SSB incoporates modeling improvements and new SIPP variables that expand the scope of analyses that can be performed relative to version 5.0 (released in 2010). In particular we have added SIPP monthly time series variables for the following variables: weeks with a job, weeks with pay, usual hours worked, survey-reported earnings, total personal income, any health insurance coverage, and employer-provided health insurance coverage. We have also added two often-requested variables: first, a categorical variable for state of residence at the beginning of the SIPP panel; second, an indicator for whether the individual linked to administrative records via SSN or whether these records were imputed because no SSN was available. Finally, we have edited the administrative earnings variables prior to the data completion and synthesis process in order to modify some values that we determined to be clerical data error.
How to Access the SSB
Researchers must submit an application to use the Synthetic Data Server. The application requires contact information, a brief description of the project, and a list of variables to be used. File access will be approved or denied based only on the feasibility of the proposal, which is determined by evaluating whether the data necessary to conduct the analysis are included on the file. Census generally expects to be able to approve applications within five business days. To apply please submit “Application to use the SIPP Synthetic Beta File” to email@example.com.
The SSB is housed on the Synthetic Data Server (SDS) at the Virtual RDC at Cornell University. A free account on the SDS will be created for each approved user. Using this account, researchers may run SAS and Stata programs using SSB data. Census staff will email program and log files to researchers upon request and without disclosure review since these data are public-use. Users will receive instructions about accessing their account for the first time once their application is approved.
Analytic Validity of the SSB: Disclaimer
The data synthesis process employed by Census to protect the linked data from the risk of disclosing the identity of individuals is relatively new and substantially changes both the survey and administrative data. The intent of the modeling done as part of the synthesis is to preserve relationships among variables that are of interest to researchers while ensuring that personally identifiable information is not revealed to the data user. It has not been feasible to ensure accuracy by comparing every relationship among SSB variables with the corresponding relationship in the underlying confidential micro-data. Hence, we strongly urge researchers not to publish results produced from the SSB without first requesting that Census validate these results with confidential data housed in a secure environment at the Census Bureau. Census will perform this validation free of charge to researchers, as resources permit and according to the protocol established by the three agencies involved and outlined below.
Without validation of results, Census, SSA, and IRS make no guarantee of the validity of the SSB for any research purpose.
Protocol for Validation of Results
Census will validate results obtained from the SSB on the internal, confidential version of these data (Completed Gold Standard Files). Users who wish to obtain validated results should follow the protocol outlined here.
The Future of the SSB and Feedback from Researchers
The SSB is a product that continues to be developed and refined. Census staff are currently working on adding the 1984 and 2008 SIPP panels as well as additional disability variables, weights, and more employer information. We are always interested in hearing from users about which variables they would like to see added to the file. Similarly, unexpected data patterns or variable values, from either SSB or Gold Standard results, should be reported to Census in order to help us continually improve the file.
We request that researchers who publish results from analyses done using these data cite the SSB as their data source and acknowledge the use of the SDS server at Cornell and the support of Census staff in running any validation programs. These citations will help ensure continued funding for the SDS server and the creation of the Gold Standard File and the SSB.
“This analysis was first performed using the SIPP Synthetic Beta (SSB) on the Synthetic Data Server housed at Cornell University which is funded by NSF Grant #SES-1042181. These data are public use and may be accessed by researchers outside secure Census facilities. For more information, visit www.census.gov/sipp/synth_data.html. Final results for this paper were obtained from a validation analysis conducted by Census Bureau staff using the SIPP Completed Gold Standard Files and the programs written by this author and originally run on the SSB. The validation analysis does not imply endorsement by the Census Bureau of any methods, results, opinions, or views presented in this paper.”
U.S. Census Bureau. SIPP Synthetic Beta: Version 5.1 [Computer file].Washington DC; Cornell University, Synthetic Data Server [distributor], Ithaca, NY, 2013.
Further QuestionsFor further information about the SIPP Synthetic Beta, please email firstname.lastname@example.org.
| Subjects A to Z
| Product Catalog
| Data Access Tools
| Privacy · Policies
| Contact Us