Print Close

Record Linkage and Estimation using Alternative and Linked Data

Oksana Balabay Chair
Westat

Wednesday, Aug 6: 10:30 AM - 12:20 PM
4185
Contributed Papers

Music City Center

Room: CC-104D

Main Sponsor

Government Statistics Section

Presentations

Objectivity and Reproducibility of Probabilistic Record Linkage.

There has been tremendous research on the consolidation of health and medical data over the years, and researchers continue to explore the efficient way of consolidating and validating fragmented health and medical data to improve the service provided and better understand all other analyses. The two main approaches for record linkage are deterministic and probabilistic linkage. Probabilistic linkage has more advantages over the deterministic approach. However, subjective decisions are made when probabilistic linkage is applied. The data was generated with 20% noise input in the source dataset to support our proposed method. We showed an approach to selecting the optimal matching score for a match. Our results indicate an increase in the matching rate with an improvement in sensitivity, specificity, and precision.

Keywords

Record Linkage, probabilistic record linkage, matching probabilities, optimal threshold, medical data, health data

Co-Author

Bong-Jin Choi, North Dakota State University

First Author

Moruf Disu

Presenting Author

Moruf Disu

Using Augmented Twin Neural Networks to Match Occupational Injury and Illness Data

When record linkage efforts involve complex characteristics there is potential for machine learning (ML) techniques to succeed where traditional probabilistic linkage methods (e.g. Fellegi-Sunter) might fall short. However, there can still be pre-processing (e.g. geocoding) and hand-picked metrics (e.g. edit distances) that can further improve linkage outcomes beyond ML models' abilities. We present a fusion of these sides we are calling an Augmented Twin Neural Network. This approach leverages the nonlinear flexibility of Twin Neural Networks while adding additional layers to allow for hand curated comparators that may be difficult for ML optimizers to implicitly identify without sufficiently large, labeled data sets. The framework is used to match businesses from the BLS Survey of Occupational Injuries and Illnesses to businesses in the OSHA Injury Tracking Application data. Difficulties in matching company names and addresses and the existence of multi-establishment firms make this a prime application for testing. Linkage outcome metrics of this augmented method are compared with the results from both probabilistic and standard ML methods to illustrate the added benefits.

Keywords

record linkage

entity resolution

machine learning

neural networks

probabilistic matching

First Author

Elan Segarra, US Bureau of Labor Statistics

Presenting Author

Elan Segarra, US Bureau of Labor Statistics

WITHDRAWN Optimizing Blocking Strategies for Record De-duplication at Scale

Record linkage in large datasets requires efficient blocking strategies. Through reducing comparisons to records with similar identifying attributes, blocking typically enables faster matching and scalability for record linkage in large datasets. Nonetheless, attribute-based blocking strategies often involve large block sizes and that could make record linkage computationally complex. Employing a more efficient blocking strategy, this study implements Locality Sensitive Hashing (LSH) for de-duplication of person records at a nationwide scale. LSH is a hashing technique that maps similar records into the same hash bucket with high probability. Applying the MinHash LSH Forest on a dataset with a defined truth deck, this study computes the number of candidate record pairs and percentage of real duplicate record pairs covered across all LSH buckets, as well as the runtime of the LSH algorithm. It also compares these LSH performance metrics with those of attribute-based blocking strategies. This study finds that LSH, through hyperparameter tuning, can efficiently create candidate record pairs that include the vast majority of real duplicate pairs at a nationwide scale.

Keywords

Record linkage

De-duplication

Blocking

Locality Sensitive Hashing

First Author

Theodore Charm, U.S. Census Bureau

Overcoming Challenges in Infrastructure, Accessibility, Usability and Technical Adoption in Privacy-Preserving Record Linkage

Privacy Preserving Record Linkage (PPRL) is an important privacy preserving technology that aids with security and privacy concerns in data integration. It is an integral part of enabling organizations to link data while protecting sensitive information and maintaining public trust. Implementing the tools entails exploring practical applications, addressing challenges, and deploying PPRL tools with the appropriate technical experiences of. Specifically in terms of implementation strategies, need for approvals and technical support during different stages of the project's lifecycle are imperative. Learning the correct configuration, navigating information technology (IT) infrastructure, data security requirements and understanding the technical criteria of creating encrypted cryptographic hashes or cryptographic linkage keys are critical components to the success of a PPRL project. This presentation will highlight PPRL tool adoption across diverse environments which requires, data management strategies and an understanding of IT infrastructures. We will conclude by discussing some key lessons learned in our exploration of deploying PPRL tools.

Keywords

PPRL tools

Data Management

IT infrastructure

NSDS demonstration projects

Cryptographic Linkage

First Author

Kabirat Nasiru, Oakridge Institution for Science and Education Data Science Fellow - Research Ambassadors Program through National Center for Science and Engineering Statistics within the National Science Foundation

Presenting Author

WITHDRAWN Causal Inference with Linked Data Files

Causal analysis of observational studies requires data that comprise a set of covariates, a treatment assignment, and the observed outcomes. However, data confidentiality restrictions may distribute these variables across two or more files. In the absence of unique identifiers to link records across files, probabilistic record linkage algorithms can be leveraged to merge the datasets. Current applications of record linkage are concerned with estimation of associations, and not causal relationships. We present a Bayesian framework for record linkage and causal inference when causally relevant variables are spread across two files. Using simulations, we show that the new framework can improve the linkage accuracy, and provide accurate post-linkage causal inferences.

Keywords

record linkage

causal inference

missing data

Co-Author

Roee Gutman, Brown University

First Author

Gauri Kamat, Brown University