Record Linkage & Machine Learning

Motivation:

Record linkage continues to grow in importance as a fundamental activity in statistical agencies. The number of available administrative lists and commercial files has grown exponentially and present statistical agencies with opportunities to accumulate information through record-linkage to support the production of official statistics. In addition to cost, new obstacles to traditional data collection have emerged in the form of possibly recurrent pandemics. These circumstances further motivate the accumulation of information by linking public, private and administrative files.

The solutions developed at the Census Bureau, such as BigMatch, have shown to perform better than competitors in general (see for example Arthun, Gilary, McGinnis, and Zamora 2025) and are highly flexible. The challenge forward is to take advantage of the ever increasing computational power made available and to expand the latest scientific advances in an equally functional set-up.

Thibaudeau (2020) describes the strides the Census Bureau and the Statistical Research Division (now Center for Statistical Research & Methodology) have made over the years. While this is impressive, more is needed. The Census Bureau must remain abreast of the ever improving state-of-the-art in record linkage and be prepared to champion its own methodologies as some of the best in the world. Our goal is to achieve the synergy of methods and software that will benefit most the Census Bureau and its mission. System portability is also an objective. The Census Bureau should have the freedom to upgrade its IT infrastructure knowing record-linkage applications will remain functional.

Research Problems

· Multiple evaluations at the Census Bureau (see for example Arthun, Gilary, McGinnis, and Zamora 2025) have shown the record-linkage software developed in CSRM/SRD, such as BigMatch, perform better than open-source competitors in general. The challenge is to maintain the same versatility as the methodology of BigMatch is improved. An important effort in that direction was initiated to perform “Multi-file Simultaneous Record Linkage.” Sadinle and Fienberg (2013) introduce a formal theory for multi-file record linkage based on comprehensive partitioning. Partitioning accounts for all possible configurations of simultaneous record matches within a set of files, thereby ensuring pairwise transitivity and preventing logical contradictions. This approach transcends traditional attempts at linking multiple files at the Census Bureau and other institutions. Those attempts were mostly based on linking pairs of records in isolation and enforcing business rules to combine the pairs and obtain multi-record matches. Muti-file record linkage, as proposed by Sadinle and Fienberg proceeds, from information on the constructs underlying all logically possible assignment of multi-record matches as a whole. As such, rigorous multi-file record linkage is a “np-hard” problem. The work of Fienberg and Sadinle (2013) and more recent work (Steorts 2015, Marchant et al. 2021, 2023) aim at finessing the computational difficulty of multi-file record linkage through probabilistic algorithms. The prospect is logically valid multi-file record linkage, which cannot be done using traditional methods (Fellegi-Sunter). This raises the potential of retrieving a full spectrum of logically valid record matches between the records of a Census, post-enumeration survey and internal or third-party administrative files simultaneously, rather than piecing together initially independent record pairs, as is mostly done at this time.

· Markov Chains Monte-Carlo (MCMC), like that powered by d-blink, give full probabilistic characterizations of the record- linkage process and are becoming indispensable for full comprehension of a record linkage process. At the same time MCMCs can be tweaked to deliver fast snapshots of the linked population. Research in that direction is crucial. Old-School programs like BigMatch have been greatly optimized for fast linking but lack in nuance. They need to be garnished by richer comparison schemes, such as dictionary-assisted fuzzy string comparisons.

· MCMCs and dynamic processes exclusively offer a full probabilistic characterization of record-linkage but struggle to achieve the scalability of Fellegi-Sunter and other clustering algorithm, such as latent-class analysis. Approximations improving the scalability include large-sample theory approximations and variational approximations. These approximations are known to be accurate and computationally frugal. There are also possible hybrids and compromises that are possible between MCMCs and the static approach. One is dimensionally collapsible models as described in Weinberg Thibaudeau (2025). Models derived from algebraic geometry can be expanded “on the spot” to reflect the dimensionality of the clusters subject to matching (entity resolution. d-blink takes care of dimensionally extending or collapsing structures automatically at the unit level, which is computationally expansive. The approach of W-T proposes fitting several dimensionally collapsed models to a specific situation. This approach offers an advantageous middle-ground if the number of models to be fitted is not too large. As the number of dimensions -matching fields- and the number of possible dimension collapse increasing it also becomes computationally onerous. Identifying the most practical solution in specific situations is the basic challenge of record linkage going forward.

· New data structure for record-linkage of multiple large lists needs to be explored. d-blink is an example of a more efficient data structure: Node-connected structures minimize the number of comparisons, as opposed to a traditional all pairwise comparisons. Other structures are possible, such as cyclical linked lists (Thibaudeau 1992), and should be researched.

· As new techniques continue to be implemented and experimented on various existing software (R, Python, C) and hardware (Windows, OSX, IRE, CAES) platforms, the dominant paradigms are emerging and work toward integration and unification, while maintaining versatility, is moving in high gear.

Current Subprojects:

· Adjusting the Statistical Analysis on Integrated Data (Ben-David)

· Entity Resolution and Merging Noisy Databases (Steorts, Brown/CES, Blalock/DSMD, Thibaudeau, Aleshin-Guendel)

Potential Applications:

· Possible massive concurrent record-linkage implementations for Census 2030. The objective is counting all distinguishable persons in linked and unduplicated administrative and commercial person-level lists.

· Unduplication and record-linkage for frame construction in the demographic and economic areas.

· Re-identification through record-linking for proofing confidentiality of data lists.

· Analysis and estimation based on linked lists.

· Linking probabilistic design-based surveys to large non-probability lists and sample for probabilistic calibration.

Accomplishments (October 2020-September 2024):

· Staff performed IRS1040 to 2020 Census matching nationwide. Matches were placed into six categories based on the degree of address similarity.

· Staff performed analyses of population counts using administrative records, specifically the Demographic Frame dataset. Identified subset of people that have multiple records at different physical addresses. Staff is researching on where they should be counted, or alternatives for handling those cases in model fitting and prediction.

· Staff evaluated software packages SPLINK as compared to BigMatch examining the speed, accuracy, and work requirements.

· Staff evaluated the current LRS model, test if the Census and ACS response rates result in a similar LRS model, update the LRS model with 2020 response rates, develop a new LRS model and additional summary scores for the Planning Data Base (PDB).

· Staff authored the application EM algorithm using generalized linear models for estimating the weights of the Fellegi-Sunter record-linkage model and support the record-linkage engine “BigMatch.” The new software is written in R and replace the SRD FOTRAN programs of Winkler/Thibaudeau.

· Used BigMatch for multiple linkage projects, including the linkage of commercial files, in the construction of a master reference file at the person and housing unit levels for research and experimentation in preparation for Census 2030.

Short-Term Activities (FY 2025 - FY 2027):

· Continuing the roll-out of supporting software -parameter estimation, frequency tables construction- guidance and expanded documentation for implementing and exploiting BigMatch in general application as well as for targeted use at the Census Bureau

· Further documenting the current version of BigMatch (C code) using “Doxigen.”

· Continue to research statistical and data-science methods for record linkage. Explore and compare in-house and “off-the-shelf” packages implementing these methods. Ascertain the competency of record-linkage methods at the Census Bureau.

· Extending record linkage outside the PIK universe.

Longer-Term Activities (beyond FY 2027):

· Develop and implement a generalized and fully supported “user friendly” version of BigMatch. The user friendly version of BigMatch is being developed through artificial intelligence and will comprise Python code so users across the Census Bureau can access and easily modify and customize the code. Full documentation is being developed supported by AI so the user can easily navigate the code.

· Construct census-based equivalence dictionaries of U.S. given names and surnames for cross-referencing and supervised learning in record-linkage.

· Further develop Markov Chain Monte-Carlo applications embedding record-linkage methods in massive parallel processing. Develop methods for extracting record-linkage snapshots from MCMCs.

Selected Publications (Journal Articles, Peer Review):

Thibaudeau (2025). “A Review of Modern Multinomial-Derived and Partition-Based Record Linkage Methods,” Wiley Interdisciplinary Reviews: Computational Statistics, 17e:70015.

Wang, Z., Ben-David, E., and Slawski, M. (2023). “Regularization for Shuffled Data Problems via Exponential Family Priors on the Permutation Group,” (Proceedings of the 26th International Conference on Artificial Intelligence and Statistics), Proceedings of Machine Learning Research, Volume 206, pgs 2939-2959. https://proceedings.mlr.press/v206/wang23a.

Steorts, R. (2023). “A Primer on the Data Cleaning Pipeline,” Journal of Survey Statistics and Methodology, 11, 553-568. Marchant, N.G., Rubinstein, B.I.P., and Steorts, R. (2023), “Bayesian Graphical Entity Resolution Using Exchangeable Random Partition Priors,” Journal of Survey Statistics and Methodology, 11, 569-596.

Deo, N., Sanguthevar R., Joyanta B., Soliman, A., Weinberg, D., and Steorts, R. (In Press). “Novel Blocking Techniques and Distance Metrics for Record Linkage,” Proceedings of the 25th International Conference on Information Integration and Web Intelligence (iiWAS), Lecture Notes in Computer Sciences, Springer.

Basak J., Soliman A., Deo N., Haase, K., Mathur, A., Park, K., Steorts, R., Weinberg, D., Sahni. S., and Sanguthevar R. (2023). “On Computing the Jaro Similarity Between Two Strings,” Proceedings of the 19th International Symposium on Bioinformatics Research and Applications, Springer, 31-44.

Aleshin-Guendel, S. and Steorts, R. (In Press). “Monitoring Convergence Diagnostics for Entity Resolution,” Annual Review of Statistics and Its Applications.

Betancourt, B., Zanella, G., and Steorts, R. (In Press). “Random Partition Models for Microclustering Tasks,” Journal of the American Statistical Association, Theory and Methods.

Mosaferi, S., Ghosh, M., and Steorts, R. (In Press). “Measurement Error Models for Small Area Estimation,” Communications and Statistics: Simulation and Computation.

Wang, Z., Ben-David, E., Diao, G., & Slawski, M. (In Press). “Estimation in Exponential Family Regression Based on Linked Data Contaminated by Mismatch Error,” Statistics and Its Interface.

Wang, Z., Ben-David, E., Diao, G., & Slawski, M. (2022). “Regression with Linked Datasets Subject to Linkage Error,” Wiley Interdisciplinary Reviews: Computational Statistics, 14(4).

Marchant, N., Kaplan, A., Rubenstein, B., Elzar, D., and Steorts, R. (2021). “d-blink: Distributed End-to-End Bayesian Entity Resolution,” Journal of Computational Graphics and Statistics, 30(2), 406-421.

Slawski, M., Diao, G., and Ben-David, E. (2021). “A Pseudo-Likelihood Approach to Linear Regression with Partially Shuffled Data,” Journal of Computational and Graphical Statistics, DOI: 10.1080/10618600.2020.1870482

Thibaudeau, Y., Slud, E., and Cheng, Y. (2021). “Small-Area Estimation of Cross-Classified Gross Flows Using Longitudinal Survey Data,” Advances in Longitudinal Survey Methodology,469-489, Peter Lynn ed., Wiley.

Wang, Z., Ben-David, E., Diao, G., and Slawski, M. (2021). “Regression with Linked Datasets Subject to Linkage Error,” Wiley Interdisciplinary Reviews: Computational Statistics, DOI: 10.1002/wics.1570

Slawski, M. and Ben-David, E. (2019). “Linear Regression with Sparsely Permuted Data,” Electronic Journal of Statistics, Vol 13, No. 1, 1-36.

Slud, E. and Thibaudeau, Y. (2019). “Multi-outcome Longitudinal Small Area Estimation – A Case Study,” Statistical Theory and Related Fields, DOI: 10.1080/24754269.2019.1669360.

Steorts, R.J., Tancredi, A., and Liseo, B. (2018). “Generalized Bayesian Record Linkage and Regression with Exact Error Propagation” in Privacy in Statistical Databases (Lecture Notes in Computer Science 11126) (Eds.) Domingo-Ferrer, J. and Montes, F., Springer, 297-313.

Steorts, R.J. and Shrivastava, A. (2018). “Probabilistic Blocking with an Application to the Syrian Conflict,” in Privacy in Statistical Databases (Lecture Notes in Computer Science 11126) (Eds.) Domingo-Ferrer, J. and Montes, F., Springer, 314- 327.

Winkler, W.E. (2018). “Cleaning and Using Administrative Lists: Enhanced Practices and Computational Algorithms for Record Linkage and Modeling/Editing/Imputation,” in (A.Y. Chun and M. D. Larsen, eds.) Administrative Records for Survey Methodology, J. Wiley, New York: NY.

Thibaudeau, Y., Slud, E., and Gottshalck, A. (2017). “Log-Linear Conditional Probabilities for Estimation in Surveys,” Annals of Applied Statistics, 11, 680-697.

Czaja, W., Hafftka, A., Manning, B., and Weinberg, D. (2015). “Randomized Approximations of Operators and their Spectral Decomposition for Diffusion Based on Embeddings of Heterogeneous Data,” 3rd International Workshop on Compressed Sensing Theory and Its Applications to Radar, Sonar and Remote Sensing (CoSeRa).

Winkler, W.E. (2015). “Probabilistic Linkage,” in (H. Goldstein, K. Harron, C. Dibben, eds.) Methodological Developments in Data Linkage, J. Wiley: New York.

Weinberg, D. and Levy, D. (2014). “Modeling Selective Local Interactions with Memory: Motion on a 2D Lattice,” Physica D 278-279, 13-30.

Winkler, W.E. (2014a). “Matching and Record Linkage,” Wiley Interdisciplinary Reviews: Computational Statistics, http://wires.wiley.com/WileyCDA/WiresArticle/wisId-WICS1317.html, DOI:10.1002/wics.1317, available from author by request for academic purposes.

Winkler, W.E. (2013). “Record Linkage,” in Encyclopedia of Environmetrics. J. Wiley.

Herzog, T. N., Scheuren, F., and Winkler, W. E. (2010). “Record Linkage,” in (Y. H. Said, D. W. Scott, and E. Wegman, eds.) Wiley Interdisciplinary Reviews: Computational Statistics.

Winkler, W.E. (2009a). “Record Linkage,” in (D. Pfeffermann and C. R. Rao, eds.) Sample Surveys: Theory, Methods and Inference, New York: North-Holland, 351-380.

Winkler, W.E. (2009b). “Should Social Security Numbers be Replaced by Modern, More Secure Identifiers?”, Proceedings of the National Academy of Sciences.

Alvarez, M., Jonas, J., Winkler, W.E., and Wright, R. “Interstate Voter Registration Database Matching: The Oregon- Washington 2008 Pilot Project,” Electronic Voting Technology.

Winkler, W. E. (2008). “Data Quality in Data Warehouses,” in (J. Wang, Ed.) Encyclopedia of Data Warehousing and Data Mining (2nd Edition).

Herzog, T. N., Scheuren, F., and Winkler, W.E. (2007). Data Quality and Record Linkage Techniques, New York, NY: Springer.

Thibaudeau, Y. (2002). “Model Explicit Item Imputation for Demographic Categories,” Survey Methodology, 28, 135-143.

Thibaudeau, Y. (1993). “The Discrimination Power in Dependency Structure in Record Linkage,” Survey Methodology, 19, 31-38.

**Selected Publications (CSRM Research Reports, CSRM Studies, Proceedings Papers, and Other):**

Winkler, W.E. (2014b). “Very Fast Methods of Cleanup and Statistical Analysis of National Files,” Proceedings of the Section on Survey Research Methods, American Statistical Association, CD-ROM.

Winkler, W.E. (2013). “Cleanup and Analysis of Sets of National Files,” Federal Committee on Statistical Methodology, Proceedings of the Bi-Annual Research Conference, http://www.copafs.org/UserFiles/file/fcsm/J1_Winkler_2013FCSM.pdf., https://fcsm.sites.usa.gov/files/2014/05/J1_Winkler_2013FCSM.pdf

Winkler, W.E. (2011). “Machine Learning and Record Linkage” in Proceedings of the 2011 International Statistical Institute.

Winkler, W.E. (2010). “General Discrete-data Modeling Methods for Creating Synthetic Data with Reduced Re-identification Risk that Preserve Analytic Properties,” https://www.census.gov/srd/papers/pdf/rrs2010-02.pdf .

Winkler, W.E., Yancey, W. E., and Porter, E. H. (2010). “Fast Record Linkage of Very Large Files in Support of Decennial and Administrative Records Projects,” Proceedings of the Section on Survey Research Methods, American Statistical Association, Alexandria, VA.

Yancey, W.E. (2007). “BigMatch: A Program for Extracting Probable Matches from a Large File,” Research Report Series (Computing #2007-01), Statistical Research Division, U.S. Census Bureau, Washington, D.C.

Winkler, W.E. (2006a). “Overview of Record Linkage and Current Research Directions,” Research Report Series (Statistics #2006-02), Statistical Research Division, U.S. Census Bureau, Washington, D.C.

Winker, W.E. (2006b). “Automatically Estimating Record Linkage False-Match Rates without Training Data,” Proceedings of the Section on Survey Research Methods, American Statistical Association, Alexandria, VA, CD-ROM.

Yancey, W.E. (2005). “Evaluating String Comparator Performance for Record Linkage,” Research Report Series (Statistics #2005-05), Statistical Research Division, U.S. Census Bureau, Washington, D.C.

Thibaudeau, Y. (1992). “Identifying Discriminatory Models in Record Linkage,” Proceedings of the Section on Statistical Computing, American Statistical Association, Alexandria, VA.

Winkler, W. and Thibaudeau, Y. (1991). “An Application of the Fellegi-Sunter Model of Record Linkage to the 1990 Decennial Census,” Research Report Series (Statistics) RR91/09, Statistical Research Division, U.S. Census Bureau, Washington, D.C.