Invited E-Poster Session I

Shirin Golchi Chair
McGill University
 
Sunday, Aug 3: 8:30 PM - 9:25 PM
4030 
Invited Posters 
Music City Center 
Room: CC-Hall B 

Presentations

01: Wastewater Epidemiology 2.0

Wastewater-based epidemiology (WBE) has emerged as a vital public health tool for understanding community viral dynamics. Its application has evolved significantly, with health departments worldwide now utilizing WBE for disease surveillance. This study introduces a straightforward state-space hierarchical modeling strategy to analyze virus levels in wastewater, with special consideration given to small populations and low viral concentrations. A key focus is establishing a reliable link between observed viral levels in wastewater and the corresponding number of infections in the community. By addressing these challenges, our model enhances the ability to translate wastewater data into actionable public health insights, supporting timely interventions and improving pandemic preparedness. 

Speaker

Jose Palacio, Rice University

02: Pixel by Pixel: A Second Chance

In the April 2023 issue of CHANCE, editors Donna LaLonde and Wendy Martinez presented a generative art challenge. One requirement drew from art historian Jason Bailey's definition of generative art as "art programmed using a computer that intentionally introduces randomness." Submissions needed to include the image, code, and a creation description. The work presented in this poster continues this exploration of classical artistic techniques and modern computational tools by developing a generative art that creates original works inspired by Vincent van Gogh's distinctive visual style. Using MATLAB, we simulated Van Gogh's characteristic color palettes and compositional patterns. To make this technology accessible to a broader audience future plans include the development of an interactive Shiny app. 

Speaker

Wendy Martinez, US Census

03: The Current Landscape of Statistics Instruction in High School Intermediate Algebra

PK-12 statistics education is evolving to address the critical need for developing students' statistical literacy in an increasingly data-driven world. To facilitate this development, national recommendations, including the Guidelines for Assessment and Instruction in Statistics Education II report, have identified the importance of increasing students' exposure and engagement with statistics throughout their PK-12 mathematics education. Additionally, recommendations to integrate statistics using a data-analytic and simulation-based approach leaves many mathematics teachers to navigate potentially new and unfamiliar content and practices when teaching statistics in mathematics courses, such as high school Intermediate Algebra, where there are several new statistics content standards. Our research study aimed to learn about the ways in which high school mathematics teachers across the United States are teaching statistics in Intermediate Algebra. In this poster, we will share the results of this research, summarizing high school teachers' experiences, choices, and constraints when teaching statistics in Intermediate Algebra courses. These results offer insight into the statistical preparation of teachers and how mathematics teacher educators can continue supporting high school mathematics teachers' development as statistics instructors. 

Speaker

Jennifer Green, Michigan State University

04: Mini-batch Estimation for Cox Models via Stochastic Gradient Descent

Stochastic gradient descent (SGD) algorithm has been widely used to optimize deep Cox neural network (Cox-NN) by updating model parameters using mini-batches of data. We show that SGD aims to optimize the average of mini-batch partial-likelihood, which is different from the standard partial-likelihood. This distinction requires developing new statistical properties for the global optimizer, namely, the mini-batch maximum partial-likelihood estimator (mb-MPLE). We establish that mb-MPLE for Cox-NN is consistent and achieves the optimal minimax convergence rate up to a polylogarithmic factor. For Cox regression with linear covariate effects, we further show that mb-MPLE is root-n-consistent and asymptotically normal with asymptotic variance approaching the information lower bound as batch size increases. Additionally, we offer practical guidance on using SGD. For Cox-NN, we demonstrate that the ratio of the learning rate to the batch size is critical in SGD dynamics, offering insight into hyperparameter tuning. For Cox regression, we characterize the iterative convergence of projected SGD, ensuring that the global optimizer, mb-MPLE, can be approximated with sufficiently many iterations. Finally, we demonstrate the effectiveness of mb-MPLE in a large-scale real-world application where the standard MPLE is intractable. 

Speaker

Lang Zeng, University of Pittsburgh

05: A Multiple Imputation Approach in Enhancing Causal Inference for Overall Survival in Randomized Controlled Trials with Crossover

Speaker

Junjing Lin, Takeda Pharmaceuticals

06: Multi-Teacher Bayesian Knowledge Distillation

Knowledge distillation is a powerful method for model compression, enabling the efficient deployment of complex deep learning models (teachers), including large language models. However, its underlying statistical mechanisms remain unclear, and uncertainty evaluation is often overlooked, especially in real-world scenarios requiring diverse teacher expertise. To address these challenges, we introduce \textit{Multi-Teacher Bayesian Knowledge Distillation} (MT-BKD), where a distilled student model learns from multiple teachers within the Bayesian framework. Our approach leverages Bayesian inference to capture inherent uncertainty in the distillation process. We introduce a teacher-informed prior, integrating external knowledge from teacher models and task-specific training data, offering better generalization, robustness, and scalability. Additionally, an entropy-based weighting mechanism adaptively adjusts each teacher's influence, allowing the student to combine multiple sources of expertise effectively. MT-BKD enhances the interpretability of the student model's learning process, improves predictive accuracy, and provides uncertainty quantification. We validate MT-BKD on both synthetic and real-world tasks, including protein subcellular location prediction and image classification. Our experiments show improved performance and robust uncertainty quantification, highlighting the strengths of our MT-BKD framework. 

Speaker

Luyang Fang, University of Georgia

07: Mixture of Directed Graphical Models for Discrete Spatial Random Fields

Current approaches for modeling discrete-valued outcomes associated with spatially-dependent areal units incur computational and theoretical challenges, especially in the Bayesian setting when full posterior inference is desired. As an alternative, we propose a novel statistical modeling framework for this data setting, namely a mixture of directed graphical models (MDGMs). The components of the mixture, directed graphical models, can be represented by directed acyclic graphs (DAGs) and are computationally quick to evaluate. The DAGs representing the mixture components are selected to correspond to an undirected graphical representation of an assumed spatial contiguity/dependence structure of the areal units, which underlies the specification of traditional modeling approaches for discrete spatial processes such as Markov random fields (MRFs). We introduce the concept of compatibility to show how an undirected graph can be used as a template for the structural dependencies between areal units to create sets of DAGs which, as a collection, preserve the structural dependencies represented in the template undirected graph. We then introduce three classes of compatible DAGs and corresponding algorithms for fitting MDGMs based on these classes. In addition, we compare MDGMs to MRFs and a popular Bayesian MRF model approximation used in high-dimensional settings in a series of simulations and an analysis of ecometrics data collected as part of the Adolescent Health and Development in Context Study. 

Speaker

Brandon Carter, University of Texas At Austin

08: Toward Finding Graphical Rules for the Efficient Estimation of Time-varying Treatment Effects

Criteria for identifying optimal adjustment sets (i.e., yielding a consistent estimator with minimal asymptotic variance) for estimating average treatment effects in parametric and nonparametric models have recently been established. In a single treatment time point setting, it has been shown that the optimal adjustment set can be identified based on a causal directed acyclic graph alone. In a longitudinal treatment setting, previous work has established graphical rules to compare the asymptotic variance of estimators based on nested time-dependent adjustment sets. However, these rules do not always permit the identification of an optimal time-dependent adjustment set based on a causal graph alone. We extend previous results by exploiting conditional independencies that can be read from the graph and show this can yield estimators with lower asymptotic variance. We conjecture that our new results may even allow the identification of an optimal time-dependent adjustment set based on the causal graph and provide numerical examples supporting this conjecture. 

Speaker

Denis Talbot, Universite Laval

09: On the Testing of Statistical Software

Testing statistical software presents unique challenges, especially when developers and test engineers are often the same individuals, who may lack formal training in software testing and have limited time for it. Therefore, it's crucial to adopt a testing approach that is both efficient and effective, and easily understood by developers. As it turns out, constructing test cases can be viewed as a design of experiments (DOE) problem. This poster introduces the concept of applying DOE principles to the testing of statistical software, highlighting how this approach can streamline the testing process and improve the quality of your own software packages. 

Speaker

Ryan Lekivetz, JMP

10: Unlocking Efficiency in Real-world Collaborative Studies: A Multi-site International Study with Collaborative One-shot Lossless Algorithm for Generalized Linear Mixed Model

The widespread adoption of real-world data (RWD) has given rise to numerous centralized and decentralized distributed research networks (DRNs) in health care. However, multi-site analysis using the data within these networks often remains challenging because of administrative burden and privacy concerns, especially in decentralized settings. To address these challenges, we developed the Collaborative One-shot Lossless Algorithm for Generalized Linear Mixed Models (COLA-GLMM), the first-ever algorithm that achieves both lossless and one-shot properties. This novel federated learning algorithm ensures accuracy against the gold standard of pooled patient-level data and offers two additional benefits: (1) it requires only summary statistics, thereby preserving patient privacy, and (2) it delivers results after a single round of communication rather than the multiple back-and-forth communications conventionally required, thereby reducing administrative burden. Additionally, we introduce an enhanced version of COLA-GLMM that employs homomorphic encryption to reduce risks of summary statistics misuse at the level of the coordinating center. We validated our proposed algorithm through simulations and a data application in a real-world study that analyzed decentralized data from eight databases to identify COVID-19 mortality risk factors across multiple sites.  

Speaker

Jiayi Tong, Johns Hopkins University

11: Seemingly Unrelated Regression (SUR) Copula Mixed Models for Multivariate Loss Reserving

In property and casualty (P&C) insurance, estimating unpaid claims is a critical task that directly impacts an insurer's reserve levels and risk capital requirements. Insurance companies often underwrite multiple, interrelated lines of business (LOBs), and appropriately modelling dependence across these LOBS is essential for accurate loss prediction and capital allocation.

The Seemingly Unrelated Regression (SUR) copula regression framework has been proposed to model such dependence using loss triangle data from a single company. However, this model can suffer from high bias due to limited data and its inability to fully capture heterogeneity across LOBs and firms.

To address these challenges, we propose a SUR copula mixed model that incorporates data from multiple companies and explicitly models heterogeneity via random effects and flexible distributional assumptions for each LOB. Furthermore, we introduce a shrinkage component to stabilize estimation in high-dimensional settings and improve generalization across heterogeneous company data.

Using multiple pairs of loss triangles from the National Association of Insurance Commissioners (NAIC) database, we demonstrate that our model reduces the bias between predicted and actual reserves when compared to the classical SUR copula regression. We also show that it delivers improved diversification benefits, as reflected in higher estimated risk capital gains. These results are validated through both empirical analysis and a targeted simulation study. 

Speaker

Anas Abdallah, McMaster University

12: Insights from Data Monitoring Committee Meeting Closed Sessions: Case Studies of Concerns, Resulting Recommendations, and Follow-up Over Time

Data monitoring committees (DMCs) review ongoing clinical trial data to make recommendations regarding trial conduct based on risk-benefit. The objective of this poster is to provide case studies of example situations that arose from a DMC's statistical review of unblinded by-arm data that led to concerns, how the DMC made recommendations based on the situations, and the subsequent actions that were taken to ensure patient safety and trial integrity. In many instances, DMCs recommend trials continue without modification as there are no concerns or the concerns do not rise to an actionable threshold. However, if a concern rises to an actionable level, the background for the actions is motivated by by-arm data and is not available to sponsors to protect trial integrity and minimize the potential for bias. These case studies will describe example concerns that can arise based on unblinded data, how a DMC may arrive at their recommendation and action items, and how the follow-up over time addresses the underlying concern.  

Speaker

Emily Woolley, Axio, a Cytel company

13: A Bayesian Record Linkage Approach to Applications in Tree Demography Using Overlapping LiDAR Scans

Increasingly, it has become common for data containing records about overlapping individuals to be distributed across multiple sources, making it necessary to identify which records refer to the same individual. The goal of record linkage is to estimate this unknown structure in the absence of a unique identifiable attribute. We introduce a Bayesian record linkage model for spatial location data motivated by the estimation of individual growth-size curves for conifers using overlapping LiDAR scans. Annual tree growth may be estimated upon correctly identifying unique individuals across scans in the presence of noise. We formalize a two-stage modeling framework, connecting the record linkage and downstream individual tree growth models, that provides uncertainty propagation through both stages of the modeling pipeline. In this work, we discuss the two-stage formulation, outline computational strategies to achieve scalability, assess model performance on simulated data, and fit the model to bi-temporal LiDAR scans of the Upper Gunnison Watershed to assess the impact of key topographic covariates on the growth behavior of conifer species in the Southern Rocky Mountains (USA). 

Speaker

Andee Kaplan, Colorado State University

14: Ranked Sparsity in Penalized Regression: Fun with Penalty Factors

The ranked sparsity framework is critical in widespread settings where predictor variables can be expected to contribute differing levels of information to describe or characterize the outcome (i.e., "mixed signals"). We motivate ranked sparsity via the Bayesian interpretation of the lasso, challenging the presumption that all covariates are equally worthy of entering into a model. Specifically, we illustrate the utility of ranked sparsity in the following settings: 1) for evaluating covariates belonging to groups of varying sizes or qualities, 2) for evaluating covariates representing derived variables (e.g. interactions), 3) for fitting time series models with complex seasonality and/or exogenous features, 4) for facilitating hypothesis testing for time-based interventions on complex time series data, and 5) for performing incomplete principal components regression. We highlight specific examples of each application, and also present a large scale predictive-model bake-off, showing how sparsity-ranked penalized regression can produce highly interpretable, transparent models with competitive prediction accuracy.  

Speaker

Ryan Peterson, University of Colorado - Anschutz Medical Campus

15: Scalable Learning for Partially Censored Gaussian Processes

Gaussian processes (GPs), known for their flexibility, uncertainty quantification, and interpretability, are particularly useful for modeling environmental variables across space and time. However, datasets with censoring, arising from detection limits of sensors, pose computational challenges. Common approaches, such as substituting censored values with detection limits or Markov chain Monte Carlo (MCMC) methods, can introduce biases or inefficiencies. Our work develops linear-complexity solutions for multivariate normal (MVN) probability estimation and sampling from truncated MVN (TMVN) distributions. Accompanying R packages, VeccTMVN and nntmvn are developed and published. Future academic travels would help me promote the scalable inference for partially censored datasets and the develop computation softwares. 

Speaker

Jian Cao, University of Houston

16: Selective Inference for Correlation Thresholding

We consider testing whether a set of Gaussian variables, selected from the data, is independent of the remaining variables. We assume that this set is selected via a very simple approach that is commonly used across scientific disciplines: we select a set of variables for which the correlation with all variables outside the set falls below some threshold. Unlike other settings in selective inference, failure to account for the selection step leads, in this setting, to excessively conservative (as opposed to anti-conservative) results. Our proposed test properly accounts for the fact that the set of variables is selected from the data and thus is not overly conservative. To develop our test, we condition on the event that the selection resulted in the set of variables in question. To achieve computational tractability, we develop a new characterization of the conditioning event in terms of the canonical correlation between the groups of random variables. In simulation studies and in the analysis of gene co-expression networks, we show that our approach has much higher power than a naive approach that ignores the effect of selection. 

Speaker

Arkajyoti Saha, University of California, Irvine

17: Investigating the Spatial Component of Serving Strategies in Tennis

A crucial element of a tennis player's strategy is where to aim their serve. While prior research has examined the mix between serving "out wide" (Wide) and "up the T" (T), less attention has been given to the precise aiming location within these regions. We address this by modeling the serve as a two-period Markov decision process (MDP), incorporating execution error and expected rewards. Using data from the 2020-21 Australian Open, we estimate player-specific execution error distributions as bivariate Gaussians, accounting for net-induced censoring via a Bayesian model. We then integrate point win probabilities from Kovalchik et al. [2020] to determine optimal aiming locations. We present our model results for many players on the ATP and WTA circuits. 

Speaker

Nathan Sandholtz, Brigham Young University

18: Multi-Fidelity, Parallel Bayesian Optimization with Expensive Simulators

Computer simulation plays a central role in modern design of physical experiments and engineered systems. However, the computational expense of high-fidelity simulation limits the throughput for searching through potential design spaces for optimal and novel cases. Bayesian optimization (BO) is a common approach to making this process more efficient by leveraging the predictive power of machine learning. In this work, we advance BO to use evaluations from multiple physical simulations to multi-fidelity BO and do so with asynchronous, parallel evaluation. This allows us to autonomously use fast, lower accuracy models to broadly search the design space and thoughtfully use more expensive, more high-fidelity simulations in the most promising subsets of design space. We demonstrate our results on optimal design of an inertial confinement fusion capsule. 

Speaker

Michael Grosskopf, Los Alamos National Laboratories

19: Doubly Robust Pivotal Confidence Intervals for a Monotonic Continuous Treatment Effect Curve

A large majority of literature on evaluating the significance of a treatment effect based on observational data has been focused on discrete treatments. These methods are not applicable to drawing inference for a continuous treatment, which arises in many important applications. Here, we develop doubly robust confidence intervals for the continuous treatment effect curve (at a fixed point) under the assumption that it is monotonic, by developing a likelihood ratio-type procedure. Monotonicity is often a very natural assumption in the setting of a continuous treatment effect curve, and the assumption of monotonicity removes the need to choose a smoothing parameter for the nonparametrically estimated curve (or the related need to estimate the curve's unknown bias which is challenging). We illustrate the new methods via simulations and a study of a dataset relating the effect of nurse staffing hours on hospital performance. 

Speaker

Charles Doss, University of Minnesota

20: A Fully Bayesian Joint Modeling Framework for Complex Intercurrent Event Handling

We propose a fully Bayesian joint modeling framework for analyzing longitudinal outcomes in the presence of one or more intercurrent events. Our innovative approach leverages a pattern-mixture model consisting of a marginal distribution for the intercurrent event(s) and a conditional distribution for the longitudinal outcome given the intercurrent event times. We demonstrate how, with this model, one can represent complex estimands (i.e., marginal treatment effects) intuitively as functions of model parameters from the marginal and conditional models. The framework is applied to a case study from a recent trial involving two intercurrent events - one addressed with a hypothetical strategy and the other with a composite strategy. 

Co-Author

Matthew Psioda, GSK