A general framework for regression with mismatched data based on mixture modeling

Conference: International Conference on Health Policy Statistics 2023
01/10/2023: 9:00 AM - 10:45 AM MST


Data sets obtained from linking multiple files are frequently affected by mismatch error, as a result of non-unique or noisy identifiers used during record linkage. Accounting for such mismatch error in downstream analysis performed on the linked file is critical to ensure valid statistical inference. In this talk, we present a generic framework to enable valid post-linkage inference in the challenging secondary analysis setting in which only the linked file is given. The proposed framework can flexibly incorporate additional information about the underlying record linkage process, and covers a wide selection of statistical models. Specifically, we propose a pseudo-likelihood approach that is based on two-component mixture models whose two components represent specific distributions conditional on a pair of records being a correct match or mismatch, respectively. We will illustrate the effectiveness of the proposed approach via a simulation study, and then present two applications of the approach to real-world data sets, demonstrating contingency table analysis and semiparametric regression using penalized splines.


record linkage

secondary analysis

mismatch error

mixture models 


Brady West, Institute for Social Research


Guoqing Diao, George Washington University
Martin Slawski, George Mason University
Zhenbang Wang, George Mason University
Emanuel Ben-David, US Census Bureau