Sunday, Aug 3: 9:35 PM - 10:30 PM
4031
Invited Posters
Music City Center
Presentations
Models based on recursive adaptive partitioning such as decision trees and their ensembles are popular for high-dimensional regression as they can potentially avoid the curse of dimensionality. Because empirical risk minimization (ERM) is computationally infeasible, these models are typically trained using greedy algorithms. Although effective in many cases, these algorithms have been empirically observed to get stuck at local optima. We explore this phenomenon in the context of learning sparse regression functions over d binary features, showing that when the true regression function f* does not satisfy Abbe et al. (2022)'s Merged Staircase Property (MSP), greedy training requires exp(Ω(d)) to achieve low estimation error. Conversely, when f* does satisfy MSP, greedy training can attain small estimation error with only O(log d) samples. This dichotomy mirrors that of two-layer neural networks trained with stochastic gradient descent (SGD) in the mean-field regime, thereby establishing a head-to-head comparison between SGD-trained neural networks and greedy recursive partitioning estimators. Furthermore, ERM-trained recursive partitioning estimators achieve low estimation error with O(log d) samples irrespective of whether f* satisfies MSP, thereby demonstrating a statistical-computational trade-off for greedy training. Our proofs are based on a novel interpretation of greedy recursive partitioning using stochastic process theory and a coupling technique that may be of independent interest.
Correlated survival data are prevalent in various clinical settings and have been extensively discussed in the literature. One of the most common types of correlated survival data is clustered survival data, where the survival times within a cluster are correlated. Our study is motivated by invasive mechanical ventilation data from different intensive care units (ICUs) in Ontario, Canada, forming multiple clusters. To account for correlation within clusters, we introduce a shared frailty log-logistic accelerated failure time model with a random intercept specific to each cluster. We present a novel, fast variational Bayes (VB) algorithm for parameter inference and evaluate its performance using simulation studies that vary the number of clusters and their sizes. We also compare the performance of our proposed VB algorithm with the h-likelihood method and a Markov Chain Monte Carlo (MCMC) algorithm. The proposed algorithm delivers satisfactory results and demonstrates computational efficiency over the MCMC algorithm. We apply our method to the ICU ventilation data from Ontario to investigate the random effect of ICU site on ventilation duration.
Identifying disease-indicative genes is critical for deciphering disease mechanisms and has attracted significant interest in biomedical research. Spatial transcriptomics offers unprecedented insights for the detection of disease-specific genes by enabling within-tissue contrasts. However, this new technology poses challenges for conventional statistical models developed for RNA-sequencing, as these models often neglect the spatial organization of tissue spots. In this article, we propose a Bayesian shrinkage model to characterize the relationship between high-dimensional gene expressions and the disease status of each tissue spot, incorporating spatial correlation among these spots through autoregressive terms. Our model adopts a hierarchical structure to facilitate the analysis of multiple correlated samples and is further extended to accommodate the missing data within tissues. To ensure the model's applicability to datasets of varying sizes, we carry out two computational frameworks for Bayesian parameter estimation, tailored to both small and large sample scenarios. Simulation studies are conducted to evaluate the performance of the proposed model. The proposed model is applied to analyze the data arising from a HER2-positive breast cancer study.
Large sample behavior of dynamic information borrowing (DIB) estimators is investigated. Asymptotic properties of several DIB approaches (adaptive risk minimization, adaptive LASSO, Bayesian procedures with empirical power prior, fully Bayesian procedures, and a Bayes-frequentist compromise) are explored against shrinking to zero alternatives. As shown theoretically and with simulations, local asymptotic distributions of DIB estimators are often non-normal. A simple Gaussian setting with external information borrowing illustrates that none of the considered DIB methods outperforms others in terms of mean squared error (MSE): at different conflict values, the MSEs of DIBs are changing between the MSEs of the maximum likelihood estimators based on the current and pooled data. To uniquely determine an optimality criterion for DIB, a prior distribution on the conflict needs be either implicitly or explicitly determined using data independent considerations. Data independent assumptions on the conflict are also needed for DIB-based hypothesis testing. New families of DIB estimators parameterized by a sensitivity-to-conflict parameter "s" are suggested and their use is illustrated in an infant mortality example. The choice of "s" is determined in a data-independent manner by a cost-benefit compromise associated with the use of external data.
One of the many difficulties in modelling epidemic spread is that caused by behavioural change in the underlying population. This can be a major issue in public health since behaviour can change drastically as infection levels vary, both due to government mandates and personal decisions. Such changes behaviour may subsequently produce major changes in disease transmission dynamics. However, these issues arise in agriculture and public health, as changes in farming practice are also often observed as disease prevalence changes. We propose a model formulation wherein time-varying transmission is captured by the level of alarm in the population and specified as a function of recent epidemic history. This alarm function may be parametric or a non-parametric function such as a Gaussian process. The alarm function itself can also vary dynamically, allowing for phenomena such as "lockdown fatigue", or depend upon multiple epidemic metrics (e.g. case counts, hospitalisations and/or death counts). The models are set in a data-augmented Bayesian framework as epidemic data are often only partially observed, and we can utilize prior information to help with parameter identifiability. The benefit and utility of the proposed approach is illustrated via COVID-19 data.
Multinomial outcomes arise in numerous fields---from sports and species counts to genomics---yet existing software often focuses on simpler fixed-effects or purely multinomial (logistic) frameworks. The MMLN package introduces a suite of functions to fit more complex multinomial regression models, including incorporation of random effects, and evaluate the fit of all multinomial regression models using the squared Mahalanobis distance residuals proposed by Gerber and Craig (2023). These residuals generalize the randomized quantile residuals of Dunn and Smyth (1996) beyond the binomial case. The MMLN() function fits mixed-effects multinomial logistic-normal models via MCMC sampling, while the MDRes() function computes the squared Mahalanobis distance residuals to comprehensively evaluate model adequacy. Users can visualize or formally test these using quantile-quantile plots and Kolmogorov-Smirnov tests. The MMLN package broadens the practical toolbox for analyzing multinomial data by integrating flexible modeling with robust diagnostic tools.
Multivariate positive-valued time series are ubiquitous in many application domains including environment, finance, insurance, etc. We describe a Bayesian approach for dynamic modeling and forecasting of multivariate positive- valued time series with multivariate gamma distributions. We discuss a flexible level correlated model (LCM) framework which allows us to combine marginal gamma distributions for the positive‐valued component responses, while accounting for association among the components at a latent level. We introduce vector autoregressive evolution of the latent states, deriving its precision matrix and implementing fast approximate posterior estimation using integrated nested Laplace approximation (INLA). We use the R‐INLA package, building custom functions to handle our framework. We use the proposed approach to jointly model hourly concentrations of the air pollutants PM 2.5 and Ozone as a function of other pollutants, and weather variables. Our goal is to do a comparative temporal analysis of the air pollution in New Delhi with other polluted cities such as Los Angeles. In New Delhi, we analyze data from each of 22 air pollution monitoring stations from January 2018 to August 2023. Together with the analysis of the aggregate pollution in the city over time will help us understand different useful patterns in the pollution in New Delhi and its surrounding regions. This is joint work with Nalini Ravishanker (University of Connecticut), Anirban Chakraborti (Jawaharlal Nehru University) and Sourish Das (Chennai Mathematical Institute).
There are no standard approaches for analyses of endpoints in randomized controlled trials with incomplete measurements. When data are missing not at random (MNAR), estimates from regular statistical approaches for longitudinal analyses of continuous endpoints (i.e. mixed models for repeated measures or random-coefficient mixed effect models) can be biased. Pattern mixture models (PMMs), joint models (JMs), and multiple imputation (MI) address missing data and can be implemented as ad-hoc sensitivity analyses of the endpoints. The objective of the present study is to compare performance of PMMs, JMs, and MI under different scenarios. Our simulations show that PMMs with the reference-based imputation methods produce biased estimates for parameters of interest and inflate their variances. MI simulations with reference-based imputation methods produce similarly biased estimates with smaller variances. Estimates from JM have similar bias and variance to those from MI but require less model and missingness mechanism assumptions. When applicable, we recommend using JMs over PMMs and MIs for sensitivity analyses.
In many empirical settings, directly observing a treatment variable may be infeasible although an error-prone surrogate measurement of the latter will often be available. Causal inference based solely on the surrogate measurement is particularly challenging without validation data. We propose a method that obviates the need for validation data by carefully incorporating the surrogate measurement with a proxy of the hidden treatment to obtain nonparametric identification of several causal effects of interest, including the population average treatment effect, the effect of treatment on the treated, quantile treatment effects, and causal effects under marginal structural models. For inference, we provide general semiparametric theory for causal effects identified using our approach and derive a large class of semiparametric efficient estimators with an appealing multiple robustness property. A significant obstacle to our approach is the estimation of nuisance functions which involve the hidden treatment therefore preventing the direct use of standard machine learning algorithms, which we resolve by introducing a novel semiparametric EM algorithm. We examine the finite-sample performance of our method using simulations and an application which aims to estimate the causal effect of Alzheimer's disease on hippocampal volume using data from the Alzheimer's Disease Neuroimaging Initiative.
Patient-reported outcomes (PRO) data provide valuable insights into treatment effects from the patient's perspective and are informative for detecting subgroup treatment effects. This study focuses on estimating individualized treatment effects using PRO data with monotone missing responses in longitudinal studies. Given the heterogeneity of treatment effects and the challenges in analyzing PRO data, such as missing data, longitudinal structure, and non-normal distribution, a semiparametric quantile regression model is proposed. The model treats the treatment effect as an unknown functional curve of a weighted linear combination of covariates to explore covariate-specific treatment effects. For data with missing values, an inverse probability weighting (IPW)-based estimator is firstly introduced, and then an improved IPW estimator is developed by incorporating an auxiliary mean model and empirical likelihood to account for within-subject correlations. The effectiveness of our approach is demonstrated through numerical experiments and an application to PRO data from a breast cancer clinical trial, which motivated this study.
In oncology trials, attaining sufficient events to ensure adequate statistical power can prolong trial durations and delay access to potentially life-changing treatments. In the Bayesian paradigm, existing historical and/or external data can be leveraged to supplement current trial data, reducing this burden. However, borrowing information has traditionally required parametric models, whereas frequentist models of time-to-event data are typically non/semi-parametric. If violated, these parametric assumptions can introduce bias and type I error inflation, posing regulatory concerns. We introduce Bayesian nonparametric dynamic borrowing (BNPDB), which generalizes the latent exchangeability prior (LEAP) to enable individual-specific discounting without assuming a parametric model for the current data. Unlike LEAP, BNPDB is effectively model-free, allowing for flexible borrowing while mitigating bias and type I error risks. Through extensive simulations, we demonstrate BNPDB's performance against correctly specified and misspecified parametric alternatives. We showcase its practical applicability in a real oncology trial.
Health policy evidence-building requires data sources such as healthcare claims, electronic health records, probability and nonprobability survey data, epidemiological surveillance databases, administrative data, and more, all of which have strengths and limitations for a given policy analysis. Data integration techniques leverage the relative strengths of input sources to obtain a blended source that is richer, more informative, and with better fitness-for-use than any single input component. We note the expansion of opportunities to use data integration for health policy analyses, reviews key methodological approaches to expand the number of variables in a data set or to increase the precision of estimates, and provides directions for future research. We highlight some innovative projects related to data integration. As data quality improvement motivates data integration, key data quality frameworks are provided to structure assessments of candidate input data sources.
We propose a flexible Bayesian approach for sparse Gaussian graphical modeling of multivariate time series. We account for temporal correlation in the data by assuming that observations are characterized by an underlying and unobserved hidden discrete autoregressive process. We assume multivariate Gaussian emission distributions and capture spatial dependencies by modeling the state-specific precision matrices via graphical horseshoe priors. We characterize the mixing probabilities of the hidden process via a cumulative shrinkage prior that accommodates zero-inflated parameters for non-active components, and further incorporate a sparsity-inducing Dirichlet prior to estimate the effective number of states from the data. For posterior inference, we develop a sampling procedure that allows estimation of the number of discrete autoregressive lags and the number of states, and that cleverly avoids having to deal with the changing dimensions of the parameter space. We thoroughly investigate performance of our proposed methodology through several simulation studies. We further illustrate the use of our approach for the estimation of dynamic brain connectivity based on fMRI data collected on a subject performing a task-based experiment on latent learning
The USDA's National Agricultural Statistics Agency (NASS) estimates planted acres, harvested acres, production, and yield at the county, state, and national levels for 13 different crops each year in its Crops County Estimates Program (CCEP). Several data sources are available to inform these estimates. The County Agricultural Production Survey is conducted at the end of each growing season for this purpose. The USDA Farm Service Agency and the USDA Risk Management Agency collect administrative data from producers participating in, respectively, USDA crops or insurance programs. Because not all farms participate in these USDA programs, the numbers of planted acres and harvested acres in each administrative dataset provide, respectively, a lower bound and an upper bound. Depending on the crop, other data (e.g., the National Crop Commodity Productivity Index) can be informative. This poster focuses on the continuing evolution of the processes used to produce official statistics for the CCCEP. These include development of Bayesian hierarchical small area models, programs that round the estimates according to pre-set rules, and outlier detection.
Models for areal data are traditionally defined using the neighborhood structure of the regions on which the data are observed. The unweighted adjacency matrix of a graph is commonly used to characterize relationships between locations, resulting in the implicit assumption that all pairs of neighboring regions interact similarly, an assumption which may not be true in practice. It has been shown that more complex spatial relationships between graph nodes may be represented when edge weights are allowed to vary. Christensen and Hoff (2024) introduced a covariance model for data observed on graphs which is more flexible than traditional alternatives, parameterizing covariance as a function of an unknown edge weights matrix. However, their treatment of each edge weight as a unique parameter presents computational issues as graph sizes increase. In this work we propose a framework for estimating edge weight matrices that reduces their effective dimension via a basis function representation of the edge weights. We show that this method may be used to enhance the performance and flexibility of covariance models parameterized by such matrices in a series of simulations and data examples.
While functional variable selection plays an important role in reducing the dimension of variables, the development of simultaneous selections of functional domain subsets and the identification of functional graphical nodes are still quite limited. Functional variable selection for recovering sparsity in nonadditive and nonparametric models with high dimensional variables has been challenging. Our main interest is to identify subsets on the domain of the functional covariates and graph structure among locations and test the significant association of subsets and graph selection with response. No existing functional variable selection methods can conduct these multi-tasking inferences. In this presentation, we develop a Bayesian multi-tasking inference method under a joint functional kernel graphical model framework. Our method unifies Bayesian optimization, an Ising graphical model, and score-type test using a Bayesian approach so that we can perform (1) selection of subsets within the domain of the functional covariates, (2) the identification of relevant brain locations to correct for the confounding effect of correlation between the signals, and (3) test the significance of selection and identification. The advantage of our multi-tasking inference method is demonstrated using simulation and our motivating example of fMRI in autism spectrum disorder.
For healthcare systems, determining optimal treatment decisions for any given patient requires consideration of both the potential benefits and costs of an intervention. In recent work, traditional methods for identifying optimal multi-stage treatment regimes have been adapted to penalize treatments which are prohibitively expensive or harmful. These methods, however, typically assume unique treatment rules at each decision interval. Rules of this type may be undesirable in certain clinical applications, such as in the management of high blood pressure, where patients are expected to undergo consistent treatment or monitoring over an extended period. In this work, we develop methodology for estimating shared-parameter/time-invariant treatment regimes which explicitly account for the potential costs of an intervention. Estimation of rules in this setting are made more difficult due to: (1) highly skewed and zero-inflated cost distributions, (2) informatively censored cost data, and (3) confounding in treatment decisions. We propose the use of doubly robust g-methods, augmented with inverse probability weights, to identify optimal time-invariant treatment decisions for use in chronic disease settings. To illustrate usage, we identify optimally cost-effective time-invariant management strategies for patients with hypertension.
Arctic sea ice plays a critical role in the global climate system. Physical models of sea ice simulate key characteristics such as thickness, concentration, and motion, offering valuable insights into its behavior and future projections. However, these models often exhibit large parametric uncertainties due to poorly constrained input parameters. Statistical calibration provides a formal framework for estimating these parameters using observational data while also quantifying the uncertainty in model projections. Calibrating sea ice models poses unique challenges, as both model output and observational data are high-dimensional multivariate spatial fields. In this talk, I present a hierarchical latent variable model that leverages principal component analysis to capture spatial dependence and radial basis functions to model discrepancies between simulations and observations. This method is demonstrated through the calibration of MPAS-Seaice, the sea ice component of the E3SM, using satellite observations of Arctic sea ice.
The rise of virtual instruction in recent years has highlighted its flexibility, accessibility, and the elimination of physical space constraints. Online learning is especially beneficial for students with diverse needs and schedules that make attending in-person classes challenging. Asynchronous online courses further enhance these advantages by allowing students to engage with course content at their convenience. To foster peer interaction, many online courses incorporate discussion forums. However, these assignments often lack the structure, format, and accountability necessary for high-quality and meaningful interactions. This study employs a classroom research model to introduce, refine, and evaluate Collaborative Keys (CKs) within an undergraduate setting. Specifically, we explore how revisions to the CKs improved cooperative learning in an asynchronous online introductory statistics course. Using a modified version of the Community of Inquiry framework, a social constructivism framework, we assessed key social elements of student interactions before and after implementing substantial revisions. The revised CKs were designed to enhance student-to-student engagement by shifting from whole-class discussions to small-group collaborations, promoting individual accountability through personal responses, and fostering positive interdependence by requiring a shared final answer within each group. Our initial findings indicate that these revisions strengthened group connections and improved social interactions, suggesting that structured, small-group collaboration can enhance the effectiveness of asynchronous online learning.
The Killer Whale is the world's largest predator and a cultural icon in the Pacific Northwest. In the past two decades, the Southern Resident Killer Whale population has declined by more than 25 percent, putting the population at risk of extinction. Much of the whale's habitat in our local waters overlaps with the shipping lanes that connect the Pacific Ocean with ports in southern British Columbia and northern Washington State. The continued decline in SRKW has been linked to disturbance from commercial vessels servicing our regional ports. In recent years, citizen science networks connected through social media are providing real-time sighting information while substantial infrastructure investment has resulted in real-time underwater acoustic monitoring stations. This has opened the possibility of forecasting systems to fuse these real-time data streams with movement models to predict future trajectories of whales. There has been substantial recent progress in building the A.I. detection/classification algorithms for underwater hydrophones and infrared imaging from vantage point cameras, as well as stochastic movement models. These new statistical methods have accelerated the development of computing infrastructure for real-time vessel impact mitigation within SRKW critical habitat. These innovations support data-driven conservation strategies to protect this endangered population of killer whales.
Speaker
Ruth Joy, Department of Statistics, SFU