Record linkage is intrinsic to efficient, modern survey operations. It is used for unduplicating and updating name and address lists. It is used for applications such as matching and inserting addresses for geocoding, coverage measurement, Primary Selection Algorithm during decennial processing, Business Register unduplication and updating, re-identification experiments verifying the confidentiality of public-use microdata files, and new applications with groups of administrative lists. Significant theoretical and algorithmic progress (Winkler 2006ab, 2008, 2009a, 2013b, 2014a, 2014b; Yancey 2005, 2006, 2007, 2011, 2013) demonstrates the potential for this research. For cleaning up administrative records files that need to be linked, theoretical and extreme computational results (Winkler 2010, 2011b) yield methods for editing, missing data and even producing synthetic data with valid analytic properties and reduced/eliminated re-identification risk. Easy means of constructing synthetic data make it straightforward to pass files among groups.

Research Problem

  • The research problems are in three major categories. First, we need to develop effective ways of further automating our major record linkage operations. The software needs improvements for matching large sets of files with hundreds of millions records against other large sets of files. Second, a key open research question is how to effectively and automatically estimate matching error rates. Third, we need to investigate how to develop effective statistical analysis tools for analyzing data from groups of administrative records when unique identifiers are not available. These methods need to show how to do correct demographic, economic, and statistical analyses in the presence of matching error. Specific projects conduct methodological research on multiple-list record linkage, error rates, and statistical inference from linked files.

Potential Applications

  • The projects encompass the Demographic, Economic, and Decennial areas and feature linking administrative records with census (decennial and economic) and sample survey data.

Accomplishments (October 2017 - September 2018)

  • Published results describing new theory and computational algorithms that are many times faster than previous algorithms. The algorithms in the generalized software allow us to clean multiple files with 300 million to 2 billion records in weeks instead of years.
  • Gave advice/software of record linkage methods to Census Bureau Program Divisions.

Short-Term Activities (FY 2019 and FY 2020)

  • Provide advice to individuals who plan to update and maintain the programs for record linkage and related data preparation.
  • Conduct research on record linkage error-rate estimation, particularly for unsupervised and semi-supervised situations.

Longer-Term Activities (beyond FY 2020)

  • Develop methods for adjusting statistical analyses for record linkage error. We believe that twenty-two papers in fifty-three years provide between ten and twenty percent of the solution.

Selected Publications

Yves Thibaudeau, Edward H. Porter, Emanuel Ben-David, Dan Weinberg

Funding Sources for FY 2019/FY 2020

  • 0331 - Working Capital Fund / General Research Project
    Various Decennial, Demographic, and Economic Projects-

Related Information

