end of header
You are here: Census.govSubjects A to Z › Center for Statistical Research and Methodology (CSRM)
Skip top of page navigation

Center for Statistical Research and Methodology (CSRM)

Record Linkage

Motivation: Record linkage is intrinsic to efficient, modern survey operations. It is used for unduplicating and updating name and address lists. It is used for applications such as matching and inserting addresses for geocoding, coverage measurement, Primary Selection Algorithm during decennial processing, Business Register unduplication and updating, re-identification experiments verifying the confidentiality of public-use microdata files, and new applications with groups of administrative lists. Significant theoretical and algorithmic progress (Winkler 2006ab, 2008, 2009a, 2013b, 2014a, 2014b; Yancey 2005, 2006, 2007, 2011, 2013) demonstrates the potential for this research. For cleaning up administrative records files that need to be linked, theoretical and extreme computational results (Winkler 2010, 2011b, 2018) yield methods for editing, missing data and even producing synthetic data with valid analytic properties and reduced/eliminated re-identification risk. Easy means of constructing synthetic data make it straightforward to pass files among groups.

Research Problem:

  • The research problems are in three major categories. First, we need to develop effective ways of further automating our major record linkage operations. The software needs improvements for matching large sets of files with hundreds of millions records against other large sets of files. Second, a key open research question is how to effectively and automatically estimate matching error rates. Third, we need to investigate how to develop effective statistical analysis tools for analyzing data from groups of administrative records when unique identifiers are not available. These methods need to show how to do correct demographic, economic, and statistical analyses in the presence of matching error. Specific projects conduct methodological research on multiple-list record linkage, error rates, and statistical inference from linked files.

Potential Applications:

  • The projects encompass the Demographic, Economic, and Decennial areas and feature linking administrative records with census (decennial and economic) and sample survey data.

Accomplishments (October 2017 - September 2018):

  • Published results describing new theory and computational algorithms that are many times faster than previous algorithms. The algorithms in the generalized software allow us to clean multiple files with 300 million to 2 billion records in weeks instead of years.
  • Gave advice/software of record linkage methods to Census Bureau Program Divisions.

Short-Term Activities (FY 2019):

  • Provide advice to individuals who plan to update and maintain the programs for record linkage and related data preparation.
  • Conduct research on record linkage error-rate estimation, particularly for unsupervised and semi-supervised situations.

Longer-Term Activities (beyond FY 2019):

  • Develop methods for adjusting statistical analyses for record linkage error. We believe that twenty-two papers in fifty-three years provide between ten and twenty percent of the solution.

Selected Publications:

Alvarez, M., Jonas, J., Winkler, W. E., and Wright, R. “Interstate Voter Registration Database Matching: The Oregon- Washington 2008 Pilot Project,” Electronic Voting Technology.

Herzog, T. N., Scheuren, F., and Winkler, W. E. (2007). Data Quality and Record Linkage Techniques, New York, NY: Springer. Herzog, T. N., Scheuren, F., and Winkler, W. E. (2010). “Record Linkage,” in (Y. H. Said, D. W. Scott, and E. Wegman, eds.) Wiley Interdisciplinary Reviews: Computational Statistics.

Winkler, W. E. (2006a). “Overview of Record Linkage and Current Research Directions,” Research Report Series (Statistics #2006-02), Statistical Research Division, U.S. Census Bureau, Washington, DC.

Winker, W. E. (2006b). “Automatically Estimating Record Linkage False-Match Rates without Training Data,” Proceedings of the Section on Survey Research Methods, American Statistical Association, Alexandria, VA, CD-ROM.

Winkler, W. E. (2008). “Data Quality in Data Warehouses,” in (J. Wang, Ed.) Encyclopedia of Data Warehousing and Data Mining (2nd Edition).

Winkler, W. E. (2009a). “Record Linkage,” in (D. Pfeffermann and C. R. Rao, eds.) Sample Surveys: Theory, Methods and Inference, New York: North-Holland, 351-380.

Winkler, W. E. (2009b). “Should Social Security numbers be replaced by modern, more secure identifiers?”, Proceedings of the National Academy of Sciences.

Winkler, W. E. (2010). “General Discrete-data Modeling Methods for Creating Synthetic Data with Reduced Re-identification Risk that Preserve Analytic Properties,” https://www.census.gov/srd/papers/pdf/rrs2010-02.pdf .

Winkler, W. E. (2011). “Machine Learning and Record Linkage” in Proceedings of the 2011 International Statistical Institute.

Winkler, W. E. (2013). “Record Linkage,” in Encyclopedia of Environmetrics. J. Wiley.

Winkler, W. E. (2013). “Cleanup and Analysis of Sets of National Files,” Federal Committee on Statistical Methodology, Proceedings of the Bi-Annual Research Conference, http://www.copafs.org/UserFiles/file/fcsm/J1_Winkler_2013FCSM.pdf., https://fcsm.sites.usa.gov/files/2014/05/J1_Winkler_2013FCSM.pdf

Winkler, W. E. (2014a). “Matching and Record Linkage,” Wiley Interdisciplinary Reviews: Computational Statistics, http://wires.wiley.com/WileyCDA/WiresArticle/wisId-WICS1317.html,, DOI: 10.1002/wics.1317, available from author by request for academic purposes.

Winkler, W. E. (2014b). “Very Fast Methods of Cleanup and Statistical Analysis of National Files,” Proceedings of the Section on Survey Research Methods, American Statistical Association, CD-ROM.

Winkler, W. E. (2015). “Probabilistic Linkage,” in (H. Goldstein, K. Harron, C. Dibben, eds.) Methodological Developments in Data Linkage, J. Wiley: New York.

Winkler, W. E. (2018 to appear). “Cleaning and Using Administrative Lists: Enhanced Practices and Computational Algorithms for Record Linkage and Modeling/Editing/Imputation,” in (A.Y. Chun and M. D. Larsen, eds.) Administrative Records for Survey Methodology, J. Wiley, New York: NY.

Winkler, W. E., Yancey, W. E., and Porter, E. H. (2010). “Fast Record Linkage of Very Large Files in Support of Decennial and Administrative Records Projects,” Proceedings of the Section on Survey Research Methods, American Statistical Association, Alexandria, VA.

Yancey, W. E. (2005). “Evaluating String Comparator Performance for Record Linkage,” Research Report Series (Statistics #2005-05), Statistical Research Division, U.S. Census Bureau, Washington, DC.

Yancey, W. E. (2007). “BigMatch: A Program for Extracting Probable Matches from a Large File,” Research Report Series (Computing #2007-01), Statistical Research Division, U.S. Census Bureau, Washington, DC.

Contact: William E. Winkler, Edward H. Porter, Emanuel Ben-David

Funding Sources for FY 2018:

  • 0331 – Working Capital Fund / General Research Projectt
    Various Decennial, Demographic, and Economic Projects

Annual and Quarterly Reports

X
  Is this page helpful?
Thumbs Up Image Yes    Thumbs Down Image No
X
No, thanks
255 characters remaining
X
Thank you for your feedback.
Comments or suggestions?
Source: U.S. Census Bureau | Research and Methodology Directorate | Center for Statistical Research & Methodology | (301) 763-9862 (or lauren.emanuel@census.gov) |   Last Revised: October 02, 2018