Record linkage continues to grow in importance as a fundamental activity in statistical agencies. The number of available administrative lists and commercial files has grown exponentially and present statistical agencies with opportunities to accumulate information through record-linkage to support the production of official statistics. In addition to cost, new obstacles to traditional data collection have emerged in the form of possibly recurrent pandemics. These circumstances further motivate the accumulation of information by linking public, private and administrative files. Thibaudeau (2020) describes the strides the Census Bureau, a pioneer in record linkage, has made over the years. While this is impressive, more is needed. With its own suite of in-house record-linkage software packages, such as the “SAS (PVS) Matcher,” “BigMatch,” “d-blink” and “MAMBA,” and easy access to open-source packages, such as “fastLink” and “RecordLinkage in R,” the Census Bureau now has access to a wide spectrum of methodologies and the potential to rapidly develop and integrate new ones. The Census Bureau must remain abreast of the ever improving state-of-the-art in record linkage and be prepared to champion its own methodologies as some of the best in the world. Our goal is to achieve the synergy of methods and software that will benefit most the Census Bureau and its mission. System portability is also an objective. The Census Bureau should have the freedom to upgrade its IT infrastructure knowing record-linkage applications will remain functional.
One challenge is continuing to research and experiment with new methodologies on multiple software platforms while also moving toward integration. Description of such experiments are:
Betancourt, B., Zanella, G., and Steorts, R. (In Press). “Random Partition Models for Microclustering Tasks,” Journal of the American Statistical Association, Theory and Methods.
Mosaferi, S., Ghosh, M., and Steorts, R. (In Press). “Measurement Error Models for Small Area Estimation,” Communications and Statistics: Simulation and Computation.
Wang, Z., Ben-David, E., Diao, G., & Slawski, M. (In Press). “Estimation in Exponential Family Regression Based on Linked Data Contaminated by Mismatch Error,” Statistics and Its Interface.
Wang, Z., Ben-David, E., Diao, G., & Slawski, M. (2022). “Regression with Linked Datasets Subject to Linkage Error,” Wiley Interdisciplinary Reviews: Computational Statistics, 14(4).
Marchant, N., Kaplan, A., Rubenstein, B., Elzar, D., and Steorts, R. (2021). “d-blink: Distributed End-to-End Bayesian Entity Resolution,” Journal of Computational Graphics and Statistics, 30(2), 406-421.
Slawski, M., Diao, G., and Ben-David, E. (2021). “A Pseudo-Likelihood Approach to Linear Regression with Partially Shuffled Data,” Journal of Computational and Graphical Statistics, DOI: 10.1080/10618600.2020.1870482
Thibaudeau, Y., Slud, E., and Cheng, Y. (2021). “Small-Area Estimation of Cross-Classified Gross Flows Using Longitudinal Survey Data,” Advances in Longitudinal Survey Methodology,469-489, Peter Lynn ed., Wiley.
Wang, Z., Ben-David, E., Diao, G., and Slawski, M. (2021). “Regression with Linked Datasets Subject to Linkage Error,” Wiley Interdisciplinary Reviews: Computational Statistics, DOI: 10.1002/wics.1570
Thibaudeau, Y. (In progress). “New Record Linkage Solutions for Demographic Methods at the Census Bureau,” Research Report Series (Statistics #2020-??), Center for Statistical Research & Methodology, U.S. Census Bureau, Washington, D.C.
Slawski, M. and Ben-David, E. (2019). “Linear Regression with Sparsely Permuted Data,” Electronic Journal of Statistics, Vol 13, No. 1, 1-36.
Slud, E. and Thibaudeau, Y. (2019). “Multi-outcome Longitudinal Small Area Estimation – A Case Study,” Statistical Theory and Related Fields, DOI: 10.1080/24754269.2019.1669360.
Steorts, R.J., Tancredi, A., and Liseo, B. (2018). “Generalized Bayesian Record Linkage and Regression with Exact Error Propagation” in Privacy in Statistical Databases (Lecture Notes in Computer Science 11126) (Eds.) Domingo-Ferrer, J. and Montes, F., Springer, 297-313.
Steorts, R.J. and Shrivastava, A. (2018). “Probabilistic Blocking with an Application to the Syrian Conflict,” in Privacy in Statistical Databases (Lecture Notes in Computer Science 11126) (Eds.) Domingo-Ferrer, J. and Montes, F., Springer, 314-327.
Winkler, W. E. (2018). “Cleaning and Using Administrative Lists: Enhanced Practices and Computational Algorithms for Record Linkage and Modeling/Editing/Imputation,” in (A.Y. Chun and M. D. Larsen, eds.) Administrative Records for Survey Methodology, J. Wiley, New York: NY.
Thibaudeau, Y., Slud, E., and Gottshalck, A. (2017). “Log-Linear Conditional Probabilities for Estimation in Surveys,” Annals of Applied Statistics, 11, 680-697.
Czaja, W., Hafftka, A., Manning, B., and Weinberg, D. (2015). “Randomized Approximations of Operators and their Spectral Decomposition for Diffusion Based on Embeddings of Heterogeneous Data,” 3rd International Workshop on Compressed Sensing Theory and Its Applications to Radar, Sonar and Remote Sensing (CoSeRa).
Winkler, W. E. (2015). “Probabilistic Linkage,” in (H. Goldstein, K. Harron, C. Dibben, eds.) Methodological Developments in Data Linkage, J. Wiley: New York.
Weinberg, D. and Levy, D. (2014). “Modeling Selective Local Interactions with Memory: Motion on a 2D Lattice,” Physica D 278-279, 13-30.
Winkler, W. E. (2014a). “Matching and Record Linkage,” Wiley Interdisciplinary Reviews: Computational Statistics, http://wires.wiley.com/WileyCDA/WiresArticle/wisId-WICS1317.html, DOI:10.1002/wics.1317, available from author by request for academic purposes.
Winkler, W. E. (2014b). “Very Fast Methods of Cleanup and Statistical Analysis of National Files,” Proceedings of the Section on Survey Research Methods, American Statistical Association, CD-ROM.
Winkler, W. E. (2013). “Record Linkage,” in Encyclopedia of Environmetrics. J. Wiley.
Winkler, W. E. (2013). “Cleanup and Analysis of Sets of National Files,” Federal Committee on Statistical Methodology, Proceedings of the Bi-Annual Research Conference, http://www.copafs.org/UserFiles/file/fcsm/J1_Winkler_2013FCSM.pdf., https://fcsm.sites.usa.gov/files/2014/05/J1_Winkler_2013FCSM.pdf
Winkler, W. E. (2011). “Machine Learning and Record Linkage” in Proceedings of the 2011 International Statistical Institute.
Herzog, T. N., Scheuren, F., and Winkler, W. E. (2010). “Record Linkage,” in (Y. H. Said, D. W. Scott, and E. Wegman, eds.) Wiley Interdisciplinary Reviews: Computational Statistics.
Winkler, W. E. (2010). “General Discrete-data Modeling Methods for Creating Synthetic Data with Reduced Re-identification Risk that Preserve Analytic Properties,” https://www.census.gov/srd/papers/pdf/rrs2010-02.pdf .
Winkler, W. E., Yancey, W. E., and Porter, E. H. (2010). “Fast Record Linkage of Very Large Files in Support of Decennial and Administrative Records Projects,” Proceedings of the Section on Survey Research Methods, American Statistical Association, Alexandria, VA.
Winkler, W. E. (2009a). “Record Linkage,” in (D. Pfeffermann and C. R. Rao, eds.) Sample Surveys: Theory, Methods and Inference, New York: North-Holland, 351-380.
Winkler, W. E. (2009b). “Should Social Security numbers be replaced by modern, more secure identifiers?”, Proceedings of the National Academy of Sciences.
Alvarez, M., Jonas, J., Winkler, W. E., and Wright, R. “Interstate Voter Registration Database Matching: The Oregon- Washington 2008 Pilot Project,” Electronic Voting Technology.
Winkler, W. E. (2008). “Data Quality in Data Warehouses,” in (J. Wang, Ed.) Encyclopedia of Data Warehousing and Data Mining (2nd Edition).
Herzog, T. N., Scheuren, F., and Winkler, W. E. (2007). Data Quality and Record Linkage Techniques, New York, NY: Springer.
Yancey, W. E. (2007). “BigMatch: A Program for Extracting Probable Matches from a Large File,” Research Report Series (Computing #2007-01), Statistical Research Division, U.S. Census Bureau, Washington, D.C.
Winkler, W. E. (2006a). “Overview of Record Linkage and Current Research Directions,” Research Report Series (Statistics #2006-02), Statistical Research Division, U.S. Census Bureau, Washington, D.C.
Winker, W. E. (2006b). “Automatically Estimating Record Linkage False-Match Rates without Training Data,” Proceedings of the Section on Survey Research Methods, American Statistical Association, Alexandria, VA, CD-ROM.
Yancey, W. E. (2005). “Evaluating String Comparator Performance for Record Linkage,” Research Report Series (Statistics #2005-05), Statistical Research Division, U.S. Census Bureau, Washington, D.C.
Thibaudeau, Y. (2002). “Model Explicit Item Imputation for Demographic Categories,” Survey Methodology, 28, 135-143.
Thibaudeau, Y. (1993). “The Discrimination Power in Dependency Structure in Record Linkage,” Survey Methodology, 19, 31-38
Thibaudeau, Y. (1992). “Identifying Discriminatory Models in Record Linkage,” Proceedings of the Section on Statistical Computing, American Statistical Association, Alexandria, VA.
Winkler, W. and Thibaudeau, Y. (1991). “An Application of the Fellegi-Sunter Model of RecordLinkage to the 1990 Decennial Census,” Research Report Series (Statistics) RR91/09, Statistical Research Division, U.S. Census Bureau, Washington, D.C.
Yves Thibaudeau, Edward H. Porter, Emanuel Ben-David, Rebecca Steorts, Dan Weinberg
0331 – Working Capital Fund / General Research Project
Various Decennial, Demographic, and Economic Projects