A Flexible Bayesian Approach to Link Massive Noisy Databases

Andee Kaplan Co-Author
Colorado State University
 
Matthew Koslovsky Co-Author
Colorado State University
 
Hyungjoon Kim First Author
Colorado State University
 
Hyungjoon Kim Presenting Author
Colorado State University
 
Sunday, Aug 3: 2:20 PM - 2:35 PM
1388 
Contributed Papers 
Music City Center 
In many applications, from government to ecology, integrating data from diverse and noisy sources is critical for downstream inference. However, a unique identifier to link records from the same entity may not exist. Record linkage merges such databases to find duplicates within and across them. A popular method represents the truth of each entity as a latent variable, linking records by clustering observations to the truth, allowing for potential data distortions. It assumes the truth is a single fixed value, which may not match reality. For example, survey participants may not recall the exact value of their net income and provide an approximation. Any attempts to link this to official data necessarily encodes the response as random distortion rather than approximate truth. We present a novel generalization of the latent variable record linkage model, allowing values to be considered "fuzzy truths" instead of random distortions and handling discrete and continuous fields. We provide options to fit the model: Markov chain Monte Carlo and variational inference for massive data, and demonstrate its value via simulation and linking a longitudinal survey of Italian household wealth.

Keywords

Record linkage

Entity resolution

Bayesian hierarchical model

Variational inference

Measurement error 

Main Sponsor

Section on Bayesian Statistical Science