CS004 Innovative Statistical and Computational Approaches

Conference: Symposium on Data Science and Statistics (SDSS) 2025
04/30/2025: 10:30 AM - 12:00 PM MDT
Refereed 
Room: Alpine West 

Chair

Julian Chan, Weber State University

Target Audience

Expert

Tracks

Computational Statistics
Symposium on Data Science and Statistics (SDSS) 2025

Presentations

Efficient Bayesian inference for two-stage models

Statistical models often require inputs that are not completely known. This can occur when those inputs are measured with error, indirectly, or when they correspond to an unobservable parameter in another model. A prominent application is environmental epidemiology, where individual air pollution exposure is a key variable for health outcomes, yet it cannot be inferred directly and is estimated by a model. In these cases, the common choice is the two-stage Bayesian statistical modeling approach, where the two levels of the model are written down separately. In this approach, the stage-one model estimates the unknown parameter and those estimates are then incorporated as inputs in the stage-two model. However, to target the correct posterior distributions, two-stage Bayesian models must correctly propagate the uncertainty from the first to the second stages. In practice, researchers often fail to do so and use simplified and incorrect methods. We show both analytically and empirically the negative consequences of failing to correctly account for uncertainty even in a simple setting. Plug-in methods that estimate and fix the inputs are subject to attenuation bias and underestimate uncertainties. Partial posterior methods that propagate uncertainty from the stage-one model without adjusting for the stage-two model fail to correct this bias and overinflate uncertainties. We propose two algorithms for two-stage modeling that propagate the uncertainty across the two stages. The first is a streamlined importance sampling algorithm that performs best when the inputs from the stage-one posterior are approximately independent, while the second provides a correction when this does not occur. We then use analytical and empirical results in a variety of settings to show that, unlike the common competing methods, our algorithms can correctly propagate uncertainties and target the correct distributions when the assumptions are met. 

Presenting Author

Konstantin Larin, Amherst College

First Author

Konstantin Larin, Amherst College

CoAuthor

Dan Kowal, Cornell University

Ensemble Learning for Survival Analysis of Clinical and Genomic Biomarkers in Advanced Non-Small Cell Lung Cancer

Lung cancer is the leading cause of cancer-related deaths in the U.S., with non-small cell lung cancer (NSCLC) comprising approximately 85% of cases. Survival analysis for NSCLC is essential for identifying clinical and genomic biomarkers influencing progression-free survival (PFS), time until progression or death due to NSCLC. Such biomarkers enable personalized treatment and prognosis prediction for NSCLC, improving patient outcomes and advancing precision oncology. In this study, we analyze a cohort of 216 U.S. patients with advanced NSCLC using two ensemble learning survival methods, random survival forests (RSF) and a gradient-boosted machine (GBM), and a stratified Cox proportional hazards models. All models accounted for censoring. RSF employs multiple decision trees to estimate hazards, with overall hazard predictions derived by averaging outputs from all trees. GBM uses regression trees as base learners, optimized with the Cox proportional hazards model's log-likelihood function. The models' PFS prediction performance was evaluated using the concordance index (C-index). All models demonstrated better-than-random prediction. GBM (C-index: 0.733) had the highest predictive capability followed by RSF (C-index: 0.732) and the stratified Cox proportional hazards model (C-index: 0.726). Key biomarkers were identified using permutation- and impurity-based feature importance and the effects of these biomarkers on PFS were characterized with hazard ratios. The models identified several significant biomarkers, including circulating albumin, derived neutrophil-to-lymphocyte ratio (dNLR), PD-L1 expression, and tumor mutational burden (TMB). Albumin and dNLR, markers of systemic inflammation, were linked to survival outcomes, reflecting the role of inflammation in cancer progression. PD-L1 and TMB, key immunotherapy biomarkers, showed modest protective effects, consistent with immunotherapy benefits for certain NSCLC patients. 

Presenting Author

Owen Sun, California Academy of Mathematics and Science

First Author

Owen Sun, California Academy of Mathematics and Science

CoAuthor

Olga Korosteleva, California State University-Long Beach

Framework for distinguishing anomalous diffusion models with constant and random parameters using statistical testing procedures

Anomalous diffusion refers to processes where the mean squared displacement grows non-linearly with time, following the relation E(X^2(t))~t^β, with β representing the anomalous exponent. This type of behavior, observed in complex systems like biological cells, often deviates from traditional diffusion models. Classical approaches, such as the fractional Brownian motion (FBM) and scaled Brownian motion (SBM), assume fixed exponents, which do not account for dynamics with varying anomalous parameters. To overcome this limitation, models like FBM with random exponents (FBMRE) and SBM with random exponents (SBMRE) have been developed. This work presents a universal procedure based on statistical testing to distinguish between anomalous diffusion models with constant and random anomalous exponents. This is done using time-averaged statistics and their ratio-based counterparts. In addition, a novel approach to optimizing time-lag selection using a divergence measure, specifically the Hellinger distance, is proposed. The methodology is widely applicable to distinguish constant from random anomalous diffusion, with its effectiveness depending on the choice of statistics, time lags, and process characteristics, as demonstrated through simulations (using a two-point distribution of the anomalous exponent) and analysis of real-world data. 

Presenting Author

Katarzyna Maraj-Zygmąt, Wrocław University of Science and Technology

First Author

Katarzyna Maraj-Zygmąt, Wrocław University of Science and Technology

CoAuthor(s)

Aleksandra Grzesiek, Wrocław University of Science and Technology
Diego Krapf, Colorado State University
Agnieszka Wyłomańska, Wrocław University of Science and Technology