Methods for Data from Multiple Sources: Transfer Learning, Data Fusion, and More

Oana Enache Chair
Stanford University
 
Monday, Aug 4: 10:30 AM - 12:20 PM
4056 
Contributed Papers 
Music City Center 
Room: CC-201A 

Main Sponsor

Health Policy Statistics Section

Presentations

Bayesian Hierarchical Approach for Handling Non-ignorable Drop-out Across Multiple Clinical Trials in Schizophrenia

Clinical trials in Schizophrenia assess symptom severity using a clinician-rated scale like Positive and Negative Syndrome Scale (PANSS), measured over time. However, patients taking psychiatric medication have shown higher variability of response compared to patients taking medication related to a physical disorder. Within randomized trials, it has also been shown that the dropout rates can be quite large and vary between treatment groups, thus possibly introducing non-ignorable missingness or missing not-at-random (MNAR). If we combine such RCTs to evaluate treatment efficacy under individual patient-level (IPD) network meta-analysis (NMA) with non-ignorable dropout, we could be introducing bias in the estimation of the treatment effects. To address these challenges and maximize use of all available data, we aim to combine a popular method for addressing MNAR like pattern-mixture with Bayesian IPD NMA to improve the estimation of the treatment effects. Through simulations, we examine the impact of our approach under varying data availability conditions and complexity. We then apply our methods to clinical trials for schizophrenia treatments, demonstrating their effectiveness in handling non-ignorable dropout.  

Keywords

Item Response Theory

Bayesian Statistics

Comparative Effectiveness Research

Missing Data

Mental Health

Meta-Analysis 

Co-Author

Hwanhee Hong

First Author

Elaona Lemoto, Duke School of Medicine, Department of Biostatistics and Bioinformatics

Presenting Author

Elaona Lemoto, Duke School of Medicine, Department of Biostatistics and Bioinformatics

Data fusion causal effect estimation for joint outcomes with missing data and correlated components

Introduction: Data fusion to generalize health economic data from RCTs is a promising approach to inform healthcare policymaking. Recent research comparing 7 estimators found that the augmented calibration weighting (ACW) estimator is consistent and precise even under model misspecification and strong sampling bias (Colnet et al. 2024). However, its performance in estimating ratio statistics (eg. incremental cost effectiveness ratio) used in health economic studies has not been explored, particularly in settings of missingness and correlated outcome components.
Methods: We assess Colnet estimators for ratio statistics under varying missingness mechanisms and correlation structures. Simulated observational (N=49000) and weakly shifted RCT (N=1000) datasets were resampled and estimators calculated across 100 iterations.
Results: Estimator variance for ratio statistics is sensitive to correlation of components. The ACW, AIPSW, and g-formula estimators are consistent and precise under NMAR missingness and correlation (MSE < 0.05; SV <0.01).
Discussion: ACW's robustness for joint outcomes with correlated components and NMAR missingness supports its use in health economic analysis. 

Keywords

Data fusion

Causal inference

Health economic evaluation

Missing data

Incremental cost effectiveness ratio

Joint outcomes 

Co-Author(s)

Catherine Rabin, Weill Cornell Medicine
Ali Jalali, Weill Cornell Medicine

First Author

Caroline Andy, Weill Cornell Medicine

Presenting Author

Caroline Andy, Weill Cornell Medicine

Enhancing OMOP Vocabulary Mapping with a Transformer-Based Semantic-Hierarchical Framework

Interoperability across EHR systems is a critical barrier to leveraging healthcare data for policy and research due to inconsistent medical terminologies. The OMOP Common Data Model (CDM) offers a standardized framework to harmonize data across platforms. However, traditional rule-based mapping is labor-intensive, which disproportionately impacts underserved hospitals with limited resources. Existing tools, such as USAGI, alleviate this burden by automating the mapping process, but they struggle with semantic complexity. For example, mapping "Leukemia" to its superclass "Hematologic neoplasm" requires understanding hierarchical relationships that go beyond surface-level text similarity.

In this talk, we propose a novel transformer-based model for automated OMOP terminology mapping that integrates OMOP's vocabulary structure and relational hierarchy. Two special tokens were added to guide the model's focus during training. This dual-task training approach captures ontology-based dependencies beyond surface-level semantics. Preliminary evaluation on the unseen CIEL vocabulary (condition domain) demonstrates improved accuracy and scalability compared to existing methods. 

Keywords

sentence transformer

OMOP Common Data Model

semantic similarity

hierarchical relationships

terminology mapping

healthcare data integration 

Co-Author(s)

Dian Zhou, University of Illinois Urbana-Champaign
Enshuo Hsu, University of Texas MD Anderson Cancer Center
Jin Zhou, Hunan University

First Author

Jiefei Wang, University of Texas Medical Branch

Presenting Author

Jiefei Wang, University of Texas Medical Branch

Improving cancer risk prediction for underrepresented groups using transfer learning

Using risk prediction models tailored to specific populations to support medical decision making has the potential to improve patient outcomes, but developing such models for underrepresented groups is challenging due to limited sample sizes. In such cases, borrowing information from models developed for the majority population may enhance performance. We compare multiple approaches for improving prediction in an underrepresented target population by leveraging source and target data including regularized regression and pre-trained neural networks. Using simulations, we assess performance across varying degrees of departure between the covariate distribution and model architecture in the source and target populations. We apply these methods in the context of breast cancer risk prediction. Our findings provide insights into strategies for improving prediction in data-limited populations. 

Keywords

risk prediction

transfer learning

machine learning

health equity 

Co-Author

Rebecca Hubbard, Brown University

First Author

Mengyue Liu, Brown University

Presenting Author

Mengyue Liu, Brown University

Methodology for Supervised Optimization of the Construction of Physician Shared-Patient Networks

There is growing use of shared-patient physician networks in health services research and practice, but minimal study of the consequences of decisions made in constructing them. To address this gap, we surveyed physician employees of a national physician organization (NPO) on their peer physician relationships. Using the physicians' survey nominations as ground truths, we evaluated the diagnostic accuracy of shared-patient edge-weights and the optimal construction of physician networks from sequences of patient-physician encounters. To further improve diagnostic accuracy, we optimized network construction with respect to the within-dyad difference and summation of edge-strength (two orthogonal measures), optimally combining them to form a final edge-weight. To achieve these goals, we develop statistical procedures to quantify the extent that directionality and other features of referral paths yield edge-weights with improved diagnostic properties. We also develop network models of the survey nominations incorporating directed (edge) and undirected (dyadic) shared-patient network measures as predictors to demonstrate that the measurement of the network as a whole is improved. 

Keywords

Bipartite network

Diagnostic accuracy

Directional information

Optimal unipartite projection

Physician beliefs

Shared-patient physician network 

Co-Author(s)

Yifan Zhao
Carly Bobak
Chuanling Qin
Erika Moen, Geisel School of Medicine at Dartmouth
Daniel Rockmore, Dartmouth College

First Author

James O'Malley, Dartmouth University, Geisel School of Medicine

Presenting Author

James O'Malley, Dartmouth University, Geisel School of Medicine

Optimal Policy Learning Under Spatial Dependence With Applications to Groundwater in Wisconsin

When installing drinking water wells, it's well-understood that increasing well depth improves the quality of the groundwater, but also raises costs. Policymakers must therefore determine the minimum well depth required to meet the public health standards for contaminants in groundwater, such as nitrates, a popular contaminant from fertilizers. In Wisconsin, the current approach to setting the minimum well depth is often a single, static number, which ignores the local hydrogeological characteristics. In this paper, we propose a data-driven method for estimating the Spatial Minimum Resource Threshold Policy (spMRTP), which determines the minimum treatment level needed at each location to meet the target outcome. A key feature of spMRTP is to account for spatial dependence of contaminants where high contaminants levels in one area often imply high contaminant levels in adjacent areas. We estimate spMRTP by empirical risk minimization with a novel, nonparametric, doubly robust loss function. For computation, we propose to use the Vecchia approximation to efficiently evaluate the minimizer. Our simulation results demonstrate that the proposed method outperforms competing approaches, including non-spatial methods for policy learning and indirect estimation methods. We also apply our method to water quality data collected from 2014 to 2024 in Wisconsin and generate a spatial map of optimal, minimum well depths in Wiscnosin to meet the 10-ppm public health standard for nitrates. 

Keywords

Transportability

Overlap condition

Density ratio

Poisson regression

Inhomogeneous Poisson point process 

Co-Author(s)

Christopher Zahasky, University of Wisconsin- Madison
Xindi Lin
Hyunseung Kang, University of Wisconsin-Madison

First Author

Xinran Miao

Presenting Author

Xindi Lin

Using administrative data to complement survey results for a population-based health survey

The Hennepin County (Minnesota) Public Health Department administers a large random address-based sample survey (SHAPE) on the health of the adult population living in the county every 4 years. Over 7000 households responded to the most recent iteration of the survey in 2022. As with all surveys, some respondents skip some questions (item non-response) or enter unusable answers. Since some of the questions, e.g., household size, household income, are key to either weighting the data or assigning the respondent to demographic groups of interest, it is important that these be as complete as possible.

Although the SHAPE survey does not identify the person completing the survey, the respondent's household address is known. The SHAPE team has attempted to use other administrative data available through the County with household-level information to complement the survey results to replace or impute the missing information. This effort tests the applicability and usability of matching survey and administrative data at the local level to improve the quality of the data. 

Keywords

Address based sampling

Public Health

survey methodology

local government

administrative data

health research 

First Author

Urban Landreman, Hennepin County

Presenting Author

Urban Landreman, Hennepin County