Approximate String Comparator Search Strategies for Very Large Administrative Lists

Written by:
RRS2005-02

Abstract

Rather than collect data from a variety of surveys, it is often more efficient to merge information from administrative lists. Matching of person files might be done using name and date-of-birth as the primary identifying information. There are obvious difficulties with entities having a commonly occurring name such as John Smith that may occur 30,000+ times (1.5 for each date-of-birth). If there are 5% typographical error in each field, then using fast character-by-character searches can miss 20% of true matches among non-commonly occurring records where name plus date-of-birth might be unique. This paper describes some existing solutions and current research directions.

Related Information


Page Last Revised - October 28, 2021