WITHDRAWN Optimizing Blocking Strategies for Record De-duplication at Scale
Wednesday, Aug 6: 11:05 AM - 11:20 AM
1320
Contributed Papers
Music City Center
Record linkage in large datasets requires efficient blocking strategies. Through reducing comparisons to records with similar identifying attributes, blocking typically enables faster matching and scalability for record linkage in large datasets. Nonetheless, attribute-based blocking strategies often involve large block sizes and that could make record linkage computationally complex. Employing a more efficient blocking strategy, this study implements Locality Sensitive Hashing (LSH) for de-duplication of person records at a nationwide scale. LSH is a hashing technique that maps similar records into the same hash bucket with high probability. Applying the MinHash LSH Forest on a dataset with a defined truth deck, this study computes the number of candidate record pairs and percentage of real duplicate record pairs covered across all LSH buckets, as well as the runtime of the LSH algorithm. It also compares these LSH performance metrics with those of attribute-based blocking strategies. This study finds that LSH, through hyperparameter tuning, can efficiently create candidate record pairs that include the vast majority of real duplicate pairs at a nationwide scale.
Record linkage
De-duplication
Blocking
Locality Sensitive Hashing
Main Sponsor
Government Statistics Section
You have unsaved changes.