Wednesday, Aug 6: 10:30 AM - 12:20 PM
4185
Contributed Papers
Music City Center
Room: CC-104D
Main Sponsor
Government Statistics Section
Presentations
There has been tremendous research on the consolidation of health and medical data over the years, and researchers continue to explore the efficient way of consolidating and validating fragmented health and medical data to improve the service provided and better understand all other analyses. The two main approaches for record linkage are deterministic and probabilistic linkage. Probabilistic linkage has more advantages over the deterministic approach. However, subjective decisions are made when probabilistic linkage is applied. The data was generated with 20% noise input in the source dataset to support our proposed method. We showed an approach to selecting the optimal matching score for a match. Our results indicate an increase in the matching rate with an improvement in sensitivity, specificity, and precision.
Keywords
Record Linkage, probabilistic record linkage, matching probabilities, optimal threshold, medical data, health data
When record linkage efforts involve complex characteristics there is potential for machine learning (ML) techniques to succeed where traditional probabilistic linkage methods (e.g. Fellegi-Sunter) might fall short. However, there can still be pre-processing (e.g. geocoding) and hand-picked metrics (e.g. edit distances) that can further improve linkage outcomes beyond ML models' abilities. We present a fusion of these sides we are calling an Augmented Twin Neural Network. This approach leverages the nonlinear flexibility of Twin Neural Networks while adding additional layers to allow for hand curated comparators that may be difficult for ML optimizers to implicitly identify without sufficiently large, labeled data sets. The framework is used to match businesses from the BLS Survey of Occupational Injuries and Illnesses to businesses in the OSHA Injury Tracking Application data. Difficulties in matching company names and addresses and the existence of multi-establishment firms make this a prime application for testing. Linkage outcome metrics of this augmented method are compared with the results from both probabilistic and standard ML methods to illustrate the added benefits.
Keywords
record linkage
entity resolution
machine learning
neural networks
probabilistic matching
Record linkage in large datasets requires efficient blocking strategies. Through reducing comparisons to records with similar identifying attributes, blocking typically enables faster matching and scalability for record linkage in large datasets. Nonetheless, attribute-based blocking strategies often involve large block sizes and that could make record linkage computationally complex. Employing a more efficient blocking strategy, this study implements Locality Sensitive Hashing (LSH) for de-duplication of person records at a nationwide scale. LSH is a hashing technique that maps similar records into the same hash bucket with high probability. Applying the MinHash LSH Forest on a dataset with a defined truth deck, this study computes the number of candidate record pairs and percentage of real duplicate record pairs covered across all LSH buckets, as well as the runtime of the LSH algorithm. It also compares these LSH performance metrics with those of attribute-based blocking strategies. This study finds that LSH, through hyperparameter tuning, can efficiently create candidate record pairs that include the vast majority of real duplicate pairs at a nationwide scale.
Keywords
Record linkage
De-duplication
Blocking
Locality Sensitive Hashing
Privacy Preserving Record Linkage (PPRL) is an important privacy preserving technology that aids with security and privacy concerns in data integration. It is an integral part of enabling organizations to link data while protecting sensitive information and maintaining public trust. Implementing the tools entails exploring practical applications, addressing challenges, and deploying PPRL tools with the appropriate technical experiences of. Specifically in terms of implementation strategies, need for approvals and technical support during different stages of the project's lifecycle are imperative. Learning the correct configuration, navigating information technology (IT) infrastructure, data security requirements and understanding the technical criteria of creating encrypted cryptographic hashes or cryptographic linkage keys are critical components to the success of a PPRL project. This presentation will highlight PPRL tool adoption across diverse environments which requires, data management strategies and an understanding of IT infrastructures. We will conclude by discussing some key lessons learned in our exploration of deploying PPRL tools.
Keywords
PPRL tools
Data Management
IT infrastructure
NSDS demonstration projects
Cryptographic Linkage
First Author
Kabirat Nasiru, Oakridge Institution for Science and Education Data Science Fellow - Research Ambassadors Program through National Center for Science and Engineering Statistics within the National Science Foundation
Presenting Author
Kabirat Nasiru, Oakridge Institution for Science and Education Data Science Fellow - Research Ambassadors Program through National Center for Science and Engineering Statistics within the National Science Foundation
Causal analysis of observational studies requires data that comprise a set of covariates, a treatment assignment, and the observed outcomes. However, data confidentiality restrictions may distribute these variables across two or more files. In the absence of unique identifiers to link records across files, probabilistic record linkage algorithms can be leveraged to merge the datasets. Current applications of record linkage are concerned with estimation of associations, and not causal relationships. We present a Bayesian framework for record linkage and causal inference when causally relevant variables are spread across two files. Using simulations, we show that the new framework can improve the linkage accuracy, and provide accurate post-linkage causal inferences.
Keywords
record linkage
causal inference
missing data