Recent Topics in Missing Data and Model Selection

LINGPENG SHAN Chair
The Ohio State University
 
Monday, Aug 4: 2:00 PM - 3:50 PM
4074 
Contributed Papers 
Music City Center 
Room: CC-210 
This session will present latest research in strategies to handle missing data as well latest research in model selection

Main Sponsor

Biometrics Section

Presentations

A Robust Method for Integrating Heterogeneous and Summary-Level Data from Various Data Sources

The dramatic increase of data sources for the scientific research highlighted the need for statistical methods to efficiently combine different level data to create comprehensive model. In our previous work, we demonstrated that parameters for full model can be estimated from summary-level data by integrating straightforward score equations, provided the random sampling assumptions. In this research, we will propose an extended method that combines data from potentially heterogeneous populations and summary-level data while accounting for this heterogeneity using the Fisher Information Matrix. The technique utilizes this information to estimate the sampling weights of each study, which are then used to recalibrate the estimating equations for the full model coefficients. The performance of the proposed method will be evaluated under various sampling designs using simulation studies and applied to the reanalysis of data from U.S. cancer registries and summary-level odds ratio estimates of selected colorectal cancer (CRC) risk factors while relaxing the random sampling assumption. 

Keywords

Data integration

Information synthesis

Summary level information

Sampling weight calibration

propensity score 

Co-Author

Andriy Derkach, Memorial Sloan Kettering Cancer Center

First Author

Farimah Shamsi

Presenting Author

Farimah Shamsi

Beyond the Surface: Analyzing the Effects of Intercurrent Events in Randomized Clinical Trials

The ICH E9(R1) Addendum provides methods for addressing intercurrent events (ICEs), occurrences post-treatment initiation that can influence the interpretation of clinical outcomes. This guidance is critical for establishing an estimand, which defines the expected treatment effect. The Principal Stratification (PS) strategy, introduced by Frangakis & Rubin in 2002, categorizes participants based on potential ICE occurrences in different treatment groups to define a causal estimand. However, this category of models is associated with assumptions that some have deemed unverifiable and almost extreme (Vansteelandt & Van Lancker, 2024). In this presentation, we present simulation studies (CITIES: Abdul Wahab et al., 2023) and investigate the effects of incorrectly specifying covariates intended to model principal strata membership, which contravenes a fundamental assumption of PS models. 

Keywords

estimand

causal

intercurrent events

principal stratification

clinical trial

ICH E9(R1) 

First Author

Ahmad Hakeem Abdul Wahab, Janssen

Presenting Author

Ahmad Hakeem Abdul Wahab, Janssen

Knowledge-Guided Bayesian Factor Analysis of Multi-Omics Data in the Presence of Missingness

The integration of high-dimensional multi-omics data is critical for identifying joint mechanisms underlying complex diseases and phenotypes. Bayesian factor analysis models with informative, sparsity-inducing priors based on domain knowledge can decompose these data into low-dimensional representations. However, missingness poses a challenge for inferring latent factors; traditional complete-case and imputation approaches may induce bias when partially observed modalities are not missing completely at random. We propose a novel Bayesian factor model that employs data augmentation to dynamically impute incomplete -omics layers during inference. Hierarchical priors in our model enable the incorporation of biological graphs, promoting joint selection of biologically relevant factors that may be missing. Simulation studies showed that our method was robust to ignorable missingness and outperformed the state-of-the-art in the case of block-missingness. In a real-world application to Alzheimer's disease, we achieved interpretable dimension reduction and diagnosis prediction, illustrating the ability to elucidate complex biological systems in the presence of incomplete multi-omics data. 

Keywords

Missing data

Bayesian factor analysis

Multi-omics integration

Dimension reduction

Multivariate data analysis

Knowledge graph 

Co-Author(s)

Qiyiwen Zhang, University of Pittsburgh
Qi Long

First Author

Konstantinos Tsingas, University of Pennsylvania

Presenting Author

Konstantinos Tsingas, University of Pennsylvania

Robust Causal Inference for Point Exposures in Electronic Health Record Based Observational Studies

Missingness in variables that define eligibility criteria is a pervasive challenge in electronic health record (EHR)-based observational studies. It is typically the case that patients with incomplete eligibility information are excluded from analysis without consideration of assumptions that are being made (implicitly), leaving study conclusions subject to potential selection bias. To the best of our knowledge, however, very little work has been done to mitigate this concern, and existing solutions require correct specification of all relevant models for outcome/treatment/imputation to ensure consistent estimation of causal contrasts. In this work, we propose a robust and efficient estimator of the causal average treatment effect on the treated, study eligible population in cohort studies where eligibility defining covariates are missing at random. We demonstrate the use of our method on EHR data from Kaiser Permanente to analyze differences between two common bariatric surgical interventions for long term weight and glycemic outcomes among a cohort of severely obese patients with type II diabetes mellitus. 

Keywords

Missing Data

Causal Inference

Multiply Robust

Influence Functions

Bariatric Surgery

Diabetes 

Co-Author(s)

Alexander Levis, Carnegie Mellon University
Sebastien Haneuse, Harvard T.H. Chan School of Public Health

First Author

Luke Benz, Harvard University, Department of Biostatistics

Presenting Author

Luke Benz, Harvard University, Department of Biostatistics

Sensitivity analysis with iterative outlier detection for systematic reviews and meta-analyses

Meta-analysis is a widely used tool for synthesizing results from multiple studies. A critical problem in meta-analyses and systematic reviews is that outlying studies are frequently included, which can lead to invalid conclusions and affect the robustness of decision-making. Outliers may be caused by several factors such as study selection criteria, low study quality, small-study effects, and so on. The conventional outlier detection method in meta-analysis is based on a leave-one-study-out procedure. However, when calculating a potentially outlying study's deviation, other outliers could substantially impact its result. This article proposes an iterative method to detect potential outliers, which reduces such an impact that could confound the detection. Furthermore, we adopt bagging to provide valid inference for sensitivity analyses of excluding outliers. Based on simulation studies, the proposed iterative method yields smaller bias and heterogeneity after removing the identified outlier and provides higher accuracy on outlier detection. Two case studies are used to illustrate the proposed method's real-world performance. 

Keywords

Meta-analysis

heterogeneity

iterative method

outlier

sensitivity analysis 

Co-Author(s)

Jingshen Wang, UC Berkeley
Chong Wu, The University of Texas MD Anderson Cancer Center
Lifeng Lin

First Author

Zhuo Meng

Presenting Author

Zhuo Meng

Smiling through the gaps: Missing data solutions for periodontal estimates from partial-mouth exams

Partial-mouth periodontal examinations (PMPE) are a suggested alternative to full mouth examinations in oral-health epidemiological studies. While more cost-effective and less burdensome for study participants, they often introduce systematic missingness leading to substantial underestimation of disease prevalence. Viewing the problem from an incomplete-data perspective, our previous work employed multiple imputation (MI) as a framework for representing observed patterns of association while also reflecting uncertainty in individual values. While the MI approach was helpful in reducing bias, questions remain as to the effectiveness of alternative modeling assumptions and missing-data approaches over our initial MI approach for estimating periodontal disease prevalence from PMPE designs. We will outline an evaluation of trade-offs between bias reduction, coverage, and robustness across each of the newly considered missing-data mechanisms using both empirical data and simulations. Our results will provide methodological insights for improving PMPE-based missing data in epidemiological studies. 

Keywords

Dentistry

Oral health

Missing data

Epidemiology

Periodontitis

Public health 

Co-Author

Thomas Belin, University of California-Los Angeles

First Author

Danielle LaVine, University of California, Los Angeles

Presenting Author

Danielle LaVine, University of California, Los Angeles

Statistical analysis of outcomes involving missing data due to mortality or dropout

Multiple approaches are available to handle missing data if missing data are at random. We often observe missing outcome data in clinical studies due to death and dropout. Removing missing data due to mortality or dropout can reduce the sample size and subsequently statistical power. Moreover, the results are only generalizable to subjects representing non-mortality data. Missing data imputation ignoring death or dropout information may produce biased estimates and uninterpretable group estimates. We propose several approaches for analyzing different forms of outcome data after adjusting for the proportion of mortality or dropout rates. We used zero hurdle regression, multinomial logistic regression, ordinal logistic regression, and prognostic score-adjusted models. We applied these methods to determine the effects of statin on different outcomes in COVID-19 patients. Based on descriptive and simulation comparisons, our findings suggest directly analyzing the joint distribution of mortality or dropout and outcome data. However, the specific type of analysis depends on the type of outcome and prevalence of dropout or mortality in the study. 

Keywords

Zero hurdle models

Missing data

Multinomial logistic regression

Finite mixed models

prognostic score-adjusted models

ordinal logistic regression 

Co-Author

Shakeel Ahmed, Texas Tech University Health Sciences Center El Paso

First Author

Alok Dwivedi, Texas Tech University Health Sciences Center El Paso

Presenting Author

Alok Dwivedi, Texas Tech University Health Sciences Center El Paso