Monday, Aug 4: 2:00 PM - 3:50 PM
4074
Contributed Papers
Music City Center
Room: CC-210
This session will present latest research in strategies to handle missing data as well latest research in model selection
Main Sponsor
Biometrics Section
Presentations
The dramatic increase of data sources for the scientific research highlighted the need for statistical methods to efficiently combine different level data to create comprehensive model. In our previous work, we demonstrated that parameters for full model can be estimated from summary-level data by integrating straightforward score equations, provided the random sampling assumptions. In this research, we will propose an extended method that combines data from potentially heterogeneous populations and summary-level data while accounting for this heterogeneity using the Fisher Information Matrix. The technique utilizes this information to estimate the sampling weights of each study, which are then used to recalibrate the estimating equations for the full model coefficients. The performance of the proposed method will be evaluated under various sampling designs using simulation studies and applied to the reanalysis of data from U.S. cancer registries and summary-level odds ratio estimates of selected colorectal cancer (CRC) risk factors while relaxing the random sampling assumption.
Keywords
Data integration
Information synthesis
Summary level information
Sampling weight calibration
propensity score
The ICH E9(R1) Addendum provides methods for addressing intercurrent events (ICEs), occurrences post-treatment initiation that can influence the interpretation of clinical outcomes. This guidance is critical for establishing an estimand, which defines the expected treatment effect. The Principal Stratification (PS) strategy, introduced by Frangakis & Rubin in 2002, categorizes participants based on potential ICE occurrences in different treatment groups to define a causal estimand. However, this category of models is associated with assumptions that some have deemed unverifiable and almost extreme (Vansteelandt & Van Lancker, 2024). In this presentation, we present simulation studies (CITIES: Abdul Wahab et al., 2023) and investigate the effects of incorrectly specifying covariates intended to model principal strata membership, which contravenes a fundamental assumption of PS models.
Keywords
estimand
causal
intercurrent events
principal stratification
clinical trial
ICH E9(R1)
The integration of high-dimensional multi-omics data is critical for identifying joint mechanisms underlying complex diseases and phenotypes. Bayesian factor analysis models with informative, sparsity-inducing priors based on domain knowledge can decompose these data into low-dimensional representations. However, missingness poses a challenge for inferring latent factors; traditional complete-case and imputation approaches may induce bias when partially observed modalities are not missing completely at random. We propose a novel Bayesian factor model that employs data augmentation to dynamically impute incomplete -omics layers during inference. Hierarchical priors in our model enable the incorporation of biological graphs, promoting joint selection of biologically relevant factors that may be missing. Simulation studies showed that our method was robust to ignorable missingness and outperformed the state-of-the-art in the case of block-missingness. In a real-world application to Alzheimer's disease, we achieved interpretable dimension reduction and diagnosis prediction, illustrating the ability to elucidate complex biological systems in the presence of incomplete multi-omics data.
Keywords
Missing data
Bayesian factor analysis
Multi-omics integration
Dimension reduction
Multivariate data analysis
Knowledge graph
Missingness in variables that define eligibility criteria is a pervasive challenge in electronic health record (EHR)-based observational studies. It is typically the case that patients with incomplete eligibility information are excluded from analysis without consideration of assumptions that are being made (implicitly), leaving study conclusions subject to potential selection bias. To the best of our knowledge, however, very little work has been done to mitigate this concern, and existing solutions require correct specification of all relevant models for outcome/treatment/imputation to ensure consistent estimation of causal contrasts. In this work, we propose a robust and efficient estimator of the causal average treatment effect on the treated, study eligible population in cohort studies where eligibility defining covariates are missing at random. We demonstrate the use of our method on EHR data from Kaiser Permanente to analyze differences between two common bariatric surgical interventions for long term weight and glycemic outcomes among a cohort of severely obese patients with type II diabetes mellitus.
Keywords
Missing Data
Causal Inference
Multiply Robust
Influence Functions
Bariatric Surgery
Diabetes
Co-Author(s)
Alexander Levis, Carnegie Mellon University
Sebastien Haneuse, Harvard T.H. Chan School of Public Health
First Author
Luke Benz, Harvard University, Department of Biostatistics
Presenting Author
Luke Benz, Harvard University, Department of Biostatistics
Meta-analysis is a widely used tool for synthesizing results from multiple studies. A critical problem in meta-analyses and systematic reviews is that outlying studies are frequently included, which can lead to invalid conclusions and affect the robustness of decision-making. Outliers may be caused by several factors such as study selection criteria, low study quality, small-study effects, and so on. The conventional outlier detection method in meta-analysis is based on a leave-one-study-out procedure. However, when calculating a potentially outlying study's deviation, other outliers could substantially impact its result. This article proposes an iterative method to detect potential outliers, which reduces such an impact that could confound the detection. Furthermore, we adopt bagging to provide valid inference for sensitivity analyses of excluding outliers. Based on simulation studies, the proposed iterative method yields smaller bias and heterogeneity after removing the identified outlier and provides higher accuracy on outlier detection. Two case studies are used to illustrate the proposed method's real-world performance.
Keywords
Meta-analysis
heterogeneity
iterative method
outlier
sensitivity analysis
Partial-mouth periodontal examinations (PMPE) are a suggested alternative to full mouth examinations in oral-health epidemiological studies. While more cost-effective and less burdensome for study participants, they often introduce systematic missingness leading to substantial underestimation of disease prevalence. Viewing the problem from an incomplete-data perspective, our previous work employed multiple imputation (MI) as a framework for representing observed patterns of association while also reflecting uncertainty in individual values. While the MI approach was helpful in reducing bias, questions remain as to the effectiveness of alternative modeling assumptions and missing-data approaches over our initial MI approach for estimating periodontal disease prevalence from PMPE designs. We will outline an evaluation of trade-offs between bias reduction, coverage, and robustness across each of the newly considered missing-data mechanisms using both empirical data and simulations. Our results will provide methodological insights for improving PMPE-based missing data in epidemiological studies.
Keywords
Dentistry
Oral health
Missing data
Epidemiology
Periodontitis
Public health
Multiple approaches are available to handle missing data if missing data are at random. We often observe missing outcome data in clinical studies due to death and dropout. Removing missing data due to mortality or dropout can reduce the sample size and subsequently statistical power. Moreover, the results are only generalizable to subjects representing non-mortality data. Missing data imputation ignoring death or dropout information may produce biased estimates and uninterpretable group estimates. We propose several approaches for analyzing different forms of outcome data after adjusting for the proportion of mortality or dropout rates. We used zero hurdle regression, multinomial logistic regression, ordinal logistic regression, and prognostic score-adjusted models. We applied these methods to determine the effects of statin on different outcomes in COVID-19 patients. Based on descriptive and simulation comparisons, our findings suggest directly analyzing the joint distribution of mortality or dropout and outcome data. However, the specific type of analysis depends on the type of outcome and prevalence of dropout or mortality in the study.
Keywords
Zero hurdle models
Missing data
Multinomial logistic regression
Finite mixed models
prognostic score-adjusted models
ordinal logistic regression
Co-Author
Shakeel Ahmed, Texas Tech University Health Sciences Center El Paso
First Author
Alok Dwivedi, Texas Tech University Health Sciences Center El Paso
Presenting Author
Alok Dwivedi, Texas Tech University Health Sciences Center El Paso