A Flexible Bayesian Approach to Link Massive Noisy Databases
Sunday, Aug 3: 2:20 PM - 2:35 PM
1388
Contributed Papers
Music City Center
In many applications, from government to ecology, integrating data from diverse and noisy sources is critical for downstream inference. However, a unique identifier to link records from the same entity may not exist. Record linkage merges such databases to find duplicates within and across them. A popular method represents the truth of each entity as a latent variable, linking records by clustering observations to the truth, allowing for potential data distortions. It assumes the truth is a single fixed value, which may not match reality. For example, survey participants may not recall the exact value of their net income and provide an approximation. Any attempts to link this to official data necessarily encodes the response as random distortion rather than approximate truth. We present a novel generalization of the latent variable record linkage model, allowing values to be considered "fuzzy truths" instead of random distortions and handling discrete and continuous fields. We provide options to fit the model: Markov chain Monte Carlo and variational inference for massive data, and demonstrate its value via simulation and linking a longitudinal survey of Italian household wealth.
Record linkage
Entity resolution
Bayesian hierarchical model
Variational inference
Measurement error
Main Sponsor
Section on Bayesian Statistical Science
You have unsaved changes.