WITHDRAWN Optimizing Blocking Strategies for Record De-duplication at Scale

Theodore Charm First Author
U.S. Census Bureau
 
Wednesday, Aug 6: 11:05 AM - 11:20 AM
1320 
Contributed Papers 
Music City Center 
Record linkage in large datasets requires efficient blocking strategies. Through reducing comparisons to records with similar identifying attributes, blocking typically enables faster matching and scalability for record linkage in large datasets. Nonetheless, attribute-based blocking strategies often involve large block sizes and that could make record linkage computationally complex. Employing a more efficient blocking strategy, this study implements Locality Sensitive Hashing (LSH) for de-duplication of person records at a nationwide scale. LSH is a hashing technique that maps similar records into the same hash bucket with high probability. Applying the MinHash LSH Forest on a dataset with a defined truth deck, this study computes the number of candidate record pairs and percentage of real duplicate record pairs covered across all LSH buckets, as well as the runtime of the LSH algorithm. It also compares these LSH performance metrics with those of attribute-based blocking strategies. This study finds that LSH, through hyperparameter tuning, can efficiently create candidate record pairs that include the vast majority of real duplicate pairs at a nationwide scale.

Keywords

Record linkage

De-duplication

Blocking

Locality Sensitive Hashing 

Main Sponsor

Government Statistics Section