Skip Header
U.S. flag

An official website of the United States government


Four Cooperative Agreements: Census Bureau Research on Record Linkage and Entity Resolution

October 26, 2021
WRITTEN BY: J. DAVID BROWN, PRINCIPAL ECONOMIST, CENTER FOR ECONOMIC STUDIES; KENNETH HAASE, SENIOR COMPUTER SCIENTIST FOR ARTIFICIAL INTELLIGENCE APPLICATIONS, RESEARCH AND METHODOLOGY DIRECTORATE; ANUP MATHUR, SENIOR COMPUTER SCIENTIST FOR RESEARCH COMPUTING ARCHITECTURE, RESEARCH AND METHODOLOGY DIRECTORATE

The U.S. Census Bureau is delighted to announce the award of four cooperative agreements that will help ensure it can take advantage of advances in entity resolution and record linkage methodology and technology. 

Entity resolution and record linkage is the process of joining or matching records from one data source with another that describe the same entity. These cooperative agreements – focused on improving such capabilities – constitute a mechanism for the Census Bureau to engage with the research community to encourage and promote methodological research and technology development.

These awards are part of an expanded effort by the Census Bureau to use its Cooperative Agreement Authority to partner with top academic and other experts to produce innovative work and ensure it remains a leading source of quality data about the nation’s people and economy.

Cooperative Agreements for Improvement of Methodology Related to Record Linkage, Entity Resolution and Evaluation

The agreements will focus on improving methodological research on record linkage and entity resolution to facilitate greater use of administrative data in producing statistical information and reducing survey respondent burden and related costs. In addition, explicitly incorporating linkage uncertainty into measures of total error will enable better quality inferences from blended estimators. These three projects will help the Census Bureau achieve these goals.

Linkage and Cleaning of Data (University of Arkansas at Little Rock)

Names and addresses are typically cleaned and standardized before attempting to link records. This can result in loss of information potentially useful for the linkage. The Census Bureau has awarded a cooperative agreement to the University of Arkansas to test a procedure that first links records, then cleans the data.

Household and business relationships contain valuable information that can facilitate record linkage. The University of Arkansas team will develop a graph-based approach to incorporate household relationships in the linking process.

Principal investigator John Talburt is executive director of the Center for Advanced Research in Entity Resolution and Information Quality at the University of Arkansas at Little Rock. He has published books, journal articles, and book chapters on entity resolution. Co-principal investigator Xiaowei Xu is a professor in the Department of Information Science at the University of Arkansas at Little Rock, specializing in data mining and machine learning. He received the prestigious Association for Computing Machinery’s Special Interest Group on Knowledge Discovery and Data Mining Test of Time Award for his seminal article about the Density Based Clustering of Applications With Noise (DBSCAN), one of the most popular clustering algorithms. Mariofanna Milanova, also a co-principal investigator, is a professor in the Department of Computer Science at the University of Arkansas at Little Rock. A Fulbright scholar, Milanova conducts research in artificial intelligence and machine learning. All three researchers hold patents.

University of Arkansas at Little Rock – data sharing plan outlines how the data from this project will be managed and shared.

Improved Use of Clustering Methods for Record Linkage (University of Connecticut)

The Census Bureau’s current production record linkage system relies on Fellegi-Sunter, a methodology based on restrictive assumptions that may not hold and are difficult to test. More recent methodologies do not rely on these assumptions but have not been implemented at the speed and scale required for Census Bureau production. A cooperative agreement awarded to the University of Connecticut will facilitate the development of a methodology using clustering that relaxes the Fellegi-Sunter assumptions. It will be designed to handle large numbers of files and complex datasets. The UConn team will develop an efficient sequential algorithm to link more than two datasets together. Its incremental record linkage methodology will address the common Census Bureau use case in which a small number of records need to be linked to a larger, previously linked dataset. The team will also research ways to further decrease processing time through more efficient data blocking strategies.

Sanguthevar Rajasekeaaran, a professor and head of the university’s Computer Science and Engineering Department, is leading the team. Rahasekeaaran is a pioneer on randomized parallel algorithms and big data. Co-principal investigator Ofer Harel is a professor and associate dean of Research and Graduate Affairs at the University of Connecticut. He has specific expertise in incomplete data techniques. Sartaj Sahni, a co-principal investigator, is a distinguished professor in the University of Florida’s Department of Computer and Information Science and Engineering. His research publications and patents are on the design and analysis of efficient algorithms, parallel computing, interconnection networks, design automation, and medical algorithms.

University of Connecticut – data sharing plan outlines how the data from this project will be managed and shared.

Quantification of Record Linkage Error and Uncertainty (University of Michigan)

The Census Bureau aims to quantify and incorporate record linkage error and uncertainty into its data products. A cooperative agreement awarded to the University of Michigan will work toward this objective. The University of Michigan team will develop a multiple imputation approach to quantifying the error and uncertainty and help the Census Bureau extend its record linkage coverage for people without Social Security numbers (SSNs) by introducing a new reference dataset containing records for them. It will also incorporate household relationships into record linkage by matching couples in tax data.

Principal investigator Margaret Levenstein is the director of the Inter-university Consortium for Political and Social Research (ICPSR) and a research professor at the Survey Research Center and School of Information at the University of Michigan. Her recent research develops a probabilistic record linkage methodology that propagates linkage uncertainty when conducting inferences. J. Trent Alexander, co-principal investigator, is the ICPSR’s associate director and a research professor at the Institute for Social Research. He has two decades of experience building data infrastructure projects in academia and the federal government and currently leads the Census Bureau’s Decennial Census Digitization and Linkage Project.

University of Michigan – data sharing plan outlines how the data from this project will be managed and shared.

Cooperative Agreement for Improvement of Technology for Record Linkage and Entity Resolution

Development and Implementation of Improved Algorithms, Architecture and Systems (University of Washington)

The Census Bureau has awarded a cooperative agreement to the University of Washington to develop and implement algorithms, architecture and systems that will:

  • Produce a sequence of software products (within five years) that progress from proof-of-concept, through pilot, and on to fully robust production systems that provide the Census Bureau with a new flexible and scalable record linkage technology platform. In terms of flexibility, the Census Bureau needs a single integrated system capable of resolving entities like people, households, businesses, and establishments. In terms of scalability, the Census Bureau needs a system able to match billions of records simultaneously. Given the scale of data required by the Census Bureau, any solution architecture must be capable of operating on a single instance as well as across multiple servers without impacting the quality of the statistical output.
  • Be a fully open source, have thorough documentation consistent with strong industry standards (NIST 2018), and be readily used and extended by knowledgeable personnel of large-scale statistical federal government, academic and private-sector organizations.
  • Be accompanied by a “sandbox” of multiple large-scale, public-domain datasets that would provide a basis for shared evaluation of methodology and technology in ways that are aligned with high-priority challenges frequently encountered in entity resolution and record linkage. All the data for this sandbox should be entirely within the public domain, and not subject to any confidentiality or intellectual-property constraints.

Principal investigator Abraham Flaxman is an associate professor of health metrics sciences at the Institute for Health Metrics and Evaluation (IHME) at the University of Washington. He is also affiliate faculty with UW’s Department of Computer Science and Engineering, Center for Statistics and Social Sciences, Center for Study of Demography and Ecology, and eScience Institute. He has a mathematics and theoretical computer science background as well as specific training and expertise in global health metrics, randomized algorithms, and combinatorial optimization. As a health metrician trained in these areas, Flaxman is drawn to research that has both methodological challenges and important implications. In his role at IHME, he leads software and methodological development for cost-effectiveness microsimulations, verbal autopsy analysis, and other areas. Flaxman developed a software tool known as DisMod-MR that is central to estimation processes within the IHME-led Global Burden of Disease (GBD) study. Additionally, he has played an instrumental role in developing IHME’s microsimulation framework, Vivarium to answer “what-if?” questions related to global health.

University of Washington data sharing plan outlines how the data from this project will be managed and shared.

 

 

Top

Back to Header