Tuesday, Aug 5: 10:30 AM - 12:20 PM
4104
Contributed Posters
Music City Center
Room: CC-Hall B
Main Sponsor
Section on Statistical Learning and Data Science
Presentations
16S sequencing is a widely used approach for studying microbial communities, but its resolution limits its ability to provide comprehensive functional and mutational insights. We introduce a computational framework that integrates 16S sequencing data with a phylogenetic reference tree constructed through ancestral state reconstruction, utilizing the most up-to-date reference databases from long-read sequencing. By leveraging this long-read-informed phylogenetic framework, our method enhances functional predictions and mutation assessments, addressing key limitations of 16S sequencing and enabling more precise insights into microbial functional and genetic diversity. Importantly, this framework empowers researchers to extract richer information from 16S data, even in the absence of direct access to long-read sequencing technologies. Furthermore, we generalize this framework to accommodate shotgun sequencing data, broadening its applicability and utility across diverse microbiome research applications.
Keywords
16S sequencing
Long-read sequencing
Phylogenetic reference tree
Ancestral state reconstruction
Microbiome analysis
Causal discovery in finite sample settings presents significant challenges in feasibility and precision, yet most of the existing work often assumes asymptotic conditions or discrete support of causal variables. This paper examines the fundamental limits of causal discovery under finite data constraints, focusing on function complexity and statistical guarantees in continuous settings. We introduce a novel framework for identifying approximate causal relationships, utilizing KL-divergence minimization to estimate causal effects. Our approach adapts inherently to the finite-sample regime, offering robustness guarantees and capturing non-linear dependencies through extensions into reproducing kernel Hilbert spaces. Additionally, we develop a testing procedure to discern the direction of causality, enhancing the practical applicability of our framework in data-limited contexts. These contributions clarify the feasibility of causal inference when data is scarce and establish theoretical bounds for estimation of complex functional relationships.
Keywords
Causal DIscovery
KL Divergence
Finite samples
This study aims to optimize the k-nearest neighbors search (kNN search) by reducing the computational burden of the Brute-force method while providing the same solution. Our method leverages data structures and probabilistic assumptions to enhance the scalability of the search. By focusing on the Training set, we define a sample space that limits the k-nearest neighbors search to a smaller space. For each observation in the Query set a fixed radius search is employed, with the radius stochastically linked to the desired number of neighbors. This approach allows us to find the k-nearest neighbors using only a fraction of the entire Training set. Through simulations and a theoretical complexity analysis, we demonstrate that our method outperforms the Brute-force approach, particularly when the Training and Query set sample sizes are large. In addition, a benchmark on an Alzheimer's disease data set further demonstrated this, showing a 62.5-fold improvement in CPU time. Overall, our stochastic approach significantly reduces the computational load of kNN search while maintaining accuracy, making it a viable alternative to traditional methods for large datasets.
Keywords
kNN search
Computational efficiency
fixed-radius search
Branch and bound
Machine learning
Various studies have been conducted to design classification models in situations where human error is present or where the population distribution is not precisely known. However, research explicitly addressing imbalanced data is still in its early stages. In this context, we propose a novel optimal sampling method that enhances classification performance without requiring additional data collection or sacrificing the desirable distributional properties of the classification model. Among optimal sampling methods, the Inverse ProbabilityWeighted (IPW) estimator is utilized to sub-sample more informative instances from the dataset. In particular, under imbalanced data settings, the amount of available information is more closely tied to the number of positive instances than to the total data size. Therefore, all positive instances are retained, and the negative instances are substantially reduced using a non-uniform sampling strategy, thereby improving estimation efficiency. This study derives the asymptotic distribution of the IPW estimator combined with a kernel-based method and shows that the proposed estimator is not only unbiased but also consistent. Furthermore, through extensive simulation studies and application to a real dataset, we demonstrate that the proposed method remains effective under imbalanced data and unspecified model settings. The results confirm that the proposed estimator achieves superior efficiency compared to existing methods.
Keywords
Active learning
Optimal sampling
Imbalanced data
Label noise
Binary classification
Semi-supervised learning
The increasing availability of tensor-valued time series data has created new challenges for statistical modeling, particularly when both responses and covariates are high-dimensional and high-order tensors. To address the issues of over-parameterization and limited sample sizes, this paper introduces a novel CP-based low-rank structure for coefficient tensors in tensor-on-tensor regression and autoregressive models. The method uses CP decomposition to extract features from responses and covariates, enabling a supervised factor modeling framework that enhances both interpretability and estimation efficiency. The method further incorporates a sparse component to account for heterogeneous signals and potential model misspecifications. Estimation is performed using the alternating least squares (ALS) algorithm with updates cast as linear regression problems. Non-asymptotic estimation error bounds are established, and simulations and a real-world ENSO dataset confirm the method's effectiveness.
Keywords
CP decomposition
ENSO area detection
High-dimensional time series data
Low-rank plus sparse modeling
Tensor autoregression
In the fields of environmental science and medicine it is increasingly common to have access to data collected on subjects over time. Given a sufficiently dense sampling, these data can often be smoothed and analyzed as functional variables. While functional variables can be used as covariates in regression models, traditional methods, such as the functional linear model, impose constraints that limit the usefulness of functional covariates as predictors. In this paper we introduce Basis Bayesian Additive Regression Trees (bBART), an adaptation to the original BART model that allows for the inclusion of functional covariates using basis expansions. By leveraging the BART model, bBART inherits many attractive features, including requiring no assumption of additivity or smooth effects and enabling posterior inference with MCMC samples.
Keywords
Bayesian Additive Regression Trees
BART
Functional data
Several new Transformer-based time series models have been developed in the past 5 years and research has provided evidence of these models' superior performance compared to classic statistical models such as ARIMA. While Transformer-based models show impressive performance on baseline datasets, there has been no research done on the robustness of these models on datasets with controlled modifications. In this paper, the temporal fusion transformer (TFT) model was compared to the classical statistical model ARIMA on simulated data with the following modifications: (1) increases in dependent variable noise, (2) addition of exogenous variables that are uncorrelated to the dependent variable, (3) reduction in training set size. The TFT and ARIMA models were compared using mean squared error (MSE) and mean absolute error (MAE) on various horizons. Results show X, Y, Z.
Keywords
Time Series
Transformer
Temporal Fusion Transformer (TFT)
ARIMA
Simulated data
Prognostic models for older adults admitted to a skilled nursing facility (SNF) following hospitalization are needed to guide clinical decisions. Using 20% Medicare data (2017-2019), we developed models predicting 6-month mortality and community discharge post-SNF. A hybrid approach combined machine learning (Gradient Boosting, Random Forest, Neural Networks, SuperLearner) for feature selection with Bayesian logistic regression to estimate inclusion probabilities, quantify uncertainty, and compute credible intervals for individual risk predictions. Performance was assessed by discrimination (C-statistic) and calibration. Model outputs include Bayesian odds ratios with 95% credible intervals, which represent the range within which the true odds ratio lies with 95% probability. The final model provides interpretable effect estimates and predictive uncertainty, making it a valuable tool for SNF clinicians.
Keywords
Prognostic Models
Prediction
Bayesian
Machine Learning
Skilled Nursing Facility
Permutation tests provide exact finite-sample inference under minimal assumptions, making them invaluable for analyzing experimental data. Though these tests are widely used for hypothesis testing, existing methods for inverting them to obtain confidence intervals have focused largely on location-scale models, leaving a gap in our ability to construct exact, randomization-based confidence intervals for many common outcome models. In this paper, we present a new approach to inverting permutation tests applicable to a broad class of outcome types, such as binary, count, heavy-tailed, and censored outcomes, with natural extensions to common semiparametric models, such as the logistic partially linear model. Importantly, our method is computationally efficient, requiring only a one-dimensional grid search across the real line, while related approaches are generally more computationally intensive and conservative. Through an extensive simulation study, we demonstrate the efficacy of our confidence interval construction across diverse outcome types in both parametric and semiparametric settings.
Keywords
Randomization testing
Exact confidence intervals
Semiparametric inference
Finite-sample inference
Our objective is to estimate the green value of housing by focusing on energy performance labels in order to understand how housing prices evolve when energy performance improves.
Instead of fitting a hedonic modeling that is some special kind of linear model, and as it was done in previous works, we fit random forests or XGBoost models.
Unlike linear models, which directly reveal the relative importance of the variables via coefficients, these complex models require alternative methods to quantify the impact of the input variables. Shapley values are often used to tackle this issue for random forests and XGBoost models, that do not provide explicit coefficients. Their calculation guarantees that each feature is fairly represented, taking into account all possible combinations of variables.
However, with non-linear and complex models such as random forests and XGBoost, the exact calculation of Shapley values becomes computationally prohibitive.
As a consequence we used more efficient approximation methods such as SHAP, KernelSHAP and FastSHAP to interpret the predictions given by models and we managed to propose an estimate of the "green value" of a housing.
Keywords
Machine Learning
Shapley Values
Green Effect
Hedonic Regression
Random Forests
XGBoost
Mixed spatial effects models are widely used in the analysis of geospatial data. Such models typically consist of a mean function, which depends on covariates, and spatial random effects. Various approaches are available to model the mean structure in a non-linear way, mostly non-parametric methods such as generalized additive models (GAM) or machine learning methods. A common assumption is that the flexible mean function accounts for all the spatial dependence, thus implying that there is no residual spatial dependence and the errors can be taken as independent. Recent work has sought to relax this assumption on the errors, while still retaining the hypothesis that the spatially dependent errors are second-order stationary.
In this talk, we relax the assumption that the spatial random effects arise as realization of a stationary spatial process, and we highlight how Bayesian additive regression trees(BART) leads to systematic errors in this situation. To address this shortcoming, we propose a new BART-based approach that accommodate both stationary and nonstationary geospatial data. Namely, our proposal addresses to overly assign locations to the same leaf nodes as neighboring observations.
Keywords
Random Forest
Non-stationary spatial data
The research presents a comprehensive overview of feature selection methodologies in machine learning, addressing the challenges posed by high-dimensional data and the need to mitigate the curse of dimensionality. The project is built upon improving model performance via features section methods. Various approaches for feature selection, including supervised and unsupervised methods, have been outlined, and new strategies have been proposed to introduce robustness and sparsity in the feature selection process. Furthermore, it highlights the importance of evaluating these methods within a multi-class classification framework using simulated and real-world datasets. The study's contributions include introducing an SNR-based feature selection technique, exploring feature recovery guarantees, proposing robust methods for outlier handling, incorporating per-class feature selection for multi-class classification, and conducting extensive experiments to validate the proposed methods' efficacy and robustness.
Keywords
Feature Selection
Latent Factor Models
Robust Loss Optimization
Multiclass Classification
Signal-to-Noise Ratio(SNR)
Surface-enhanced Raman spectroscopy (SERS) holds remarkable potential for the rapid and portable detection of trace molecules. However, the analysis and comparison of SERS spectra are challenging due to the diverse range of instruments used for data acquisition. A spectra instrument transformation framework based on the penalized functional regression model (SpectraFRM) is introduced for cross-instrument mapping with subsequent machine learning classification to compare transformed spectra with standard spectra. In particular, the nonparametric forms of the functional response, predictors, and coefficients employed in SepctraFRM allow for efficient modeling of the nonlinear relationship between target spectra and standard spectra. With an additional feature extraction step, the transformed spectra outperform the original spectra by 10% in analytes identification tasks. Overall, the proposed method is shown to be flexible, robust, accurate, and interpretable despite varieties of analytes and instruments, making it a potentially powerful tool for the standardization of SERS spectra from various instruments.
Keywords
Functional Regression
spectrum transformation
surface-enhanced Raman scattering
The standard regression tree method applied to observations within clusters poses both methodological and implementation challenges. Effectively leveraging these data requires methods that account for both individual-level and sample-level effects. We propose Generalized Tree-Informed Mixed Model (GTIMM), which replaces the linear fixed effect in a generalized linear mixed model (GLMM) with the output of a regression tree. Traditional parameter estimation and prediction techniques, such as the expectation-maximization algorithm, scale poorly in high-dimensional settings, creating a computational bottleneck. To address this, we employ a quasi-likelihood framework with stochastic gradient descent for optimized parameter estimation. Additionally, we establish a theoretical bound for the mean squared prediction error. The predictive performance of our method is evaluated through simulations and compared with existing approaches. Finally, we apply our model to predict country-level GDP based on trade, foreign direct investment, unemployment, inflation, and geographic region.
Keywords
tree based regression
clustered data
mixed effects
penalized quasi-likelihood
stochastic gradient descent
prediction
Functional graphical models have recently been employed to capture the conditional independence structure of high-dimensional functional data and functional time series. However, real-world datasets often encounter challenges due to latent processes that can confound multiple variables. To address this issue, we propose a novel estimator for functional graphical models that accounts for the presence of unobserved confounders. Our estimator is constructed using the projection of the right singular functions of the multivariate random functions, and is the functional extension of the Right Singular Vector Projection estimation for random variable. Unlike previous approaches that focus on "spiked" confounders and rely on removing principal components from multivariate functional PCA, our method is designed to accommodate a broader range of confounder effects, from strong and spiked to weak. We establish the consistency of our estimator in the high-dimensional setting, and demonstrate its effectiveness through synthetic simulations and fMRI data analysis
Keywords
Functional Data Analysis
Graphical Models
High-dimensional Statistics
Structure Learning
Hidden Confounders
Covariance Estimator
Cleaning unstructured text data, particularly in large data contexts, presents a considerable challenge given the size, complexity of language and the diversity of data sources. In this work, we propose a methodology that combines Large Language Models (LLMs) and Research-Based Prompts (RBP) to effectively streamline and improve text data cleaning for tabular data analysis. To evaluate the methodology's efficacy, we examine three key sources of variability-LLM choice (e.g., ChatGPT, Llama), prompt choice, and text data type (e.g., nurse vs. physician text)-and their impact on overall performance as compared to a human label.
This poster outlines each phase of our methodology, including initial data collection, the use of specialized prompts, iterative LLM refinements, and automated code generation for advanced text processing. We demonstrate the effectiveness of these approaches on two types of datasets: clinical text and Reddit posts. Lastly, we address limitations such as potential biases in LLM outputs and the need for continuous human oversight to ensure accuracy and reliability.
Keywords
Data management
Data cleaning
Large Language Model (LLM)
unstructured text data
Transfer learning is a critical part of real-world machine learning deployments and has been extensively studied in experimental works with overparameterized neural networks. However, even in the simplest setting of linear regression a notable gap still exists in the theoretical understanding of transfer learning. In-distribution research on high-dimensional linear regression has led to the identification of a phenomenon known as benign overfitting, in which linear interpolators overfit to noisy training labels and yet still generalize well. This behavior occurs under specific conditions on the source covariance matrix and input data dimension. Therefore, it is natural to wonder how such high-dimensional linear models behave under transfer learning. We prove the first non-asymptotic excess risk bounds for benignly-overfit linear interpolators in the transfer learning setting. From our analysis, we propose a taxonomy of beneficial and malignant covariate shifts based on the degree of overparameterization. We follow our analysis with empirical studies showing these beneficial and malignant covariate shifts for linear interpolators and simple neural networks in certain settings.
Keywords
high-dimensional statistics
statistical learning
distribution shift
generalization
deep learning theory
transfer learning
Credit card fraud poses a significant challenge and leads to substantial financial losses. Although machine learning and deep learning models have been extensively studied in this domain, few address the issue of data imbalance, which can bias predictions. In this paper, we explore techniques to address data imbalance, including Synthetic Minority Oversampling Technique (SMOTE), simple oversampling, and Variational Autoencoders (VAE). These methods are evaluated using metrics tailored for imbalanced datasets. In real-world scenarios, there is often a trade-off between recall and precision, both of which significantly impact revenue.
Our preliminary results show that SMOTE biases toward recall (0.897) than precision (0.098) but generates distributionally similar synthetic data, while VAE achieves better precision (0.903) and generalizability. Combining VAE-generated data with baseline logistic regression significantly improves performance with ROC-AUC 0.978, offering a computationally efficient solution for large-scale fraud detection in imbalanced datasets. This study highlights the trade-offs between different techniques and provides a practical solution for fraud detection.
Keywords
Fraud Detection
Synthetic Data
Machine Learning
Neural Network
Deep Learning
Linear discriminant analysis (LDA) faces significant challenges when the number of features (p) exceeds the number of observations (n). While various methods have been proposed to address this issue, most assume n and p are comparable or impose restrictive structural assumptions on the population covariance matrix. In this study, we present a unified framework for LDA based on an optimal shrinkage method designed for ultra-high dimensional data, where p grows polynomially in n. As examples within our framework, we consider two types of shrinkage estimators: a linear shrinker, leading to a regularized LDA, and a nonlinear shrinker under the generalized spiked covariance matrix model. Leveraging recent advances in random matrix theory, we establish theoretical guarantees for our approach by analyzing the asymptotic behavior of outlier eigenvalues and eigenvectors, as well as deriving a quantum unique ergodicity estimate for non-outlier eigenvectors of the spiked sample covariance matrix. These results also reveal a phase transition phenomenon in LDA, allowing us to characterize the conditions under which LDA succeeds or fails based on the magnitude of the mean difference.
Keywords
Linear discriminant analysis
Optimal shrinkage estimation
Spiked model
Random matrix theory
The current landscape of functional data analysis (FDA) predominantly caters to single variables on one view of dataset, overlooking the prevalence of multiview multivariate functional data in biomedical research. While canonical correlation analysis (CCA) stands out as a popular choice for integrative analysis, its applicability is limited to cross-sectional data, failing to address longitudinal or functional data scenarios. In response to these limitations, we propose an innovative integrative sparse functional canonical correlation analysis approach for multiview data. This novel framework aims to tackle the challenges posed by multiview functional datasets, seamlessly integrating both cross-sectional and longitudinal/functional data while accounting for sparsity. The method aims to identify linear combinations of variable functions for each view such that the correlation between the sets of linear combinations is maximized. Our method will also identify interpretable variables that maximize such association over time. We will conduct simulation studies to evaluate the effectiveness of our approach. We will use our method to investigate multi-omics biomarkers in IBD.
Keywords
High-dimensional data
Longitudinal data
Microbiome data
Sparsity
Variable selection
The immunological nature of the tumor microenvironment (TME) in cancer is strongly associated with treatment response and clinical outcomes. The dynamics driving these associations are believed to be spatially-dependent and occurring at various scales within the tissue. Thus, detection of the immunological features of the TME requires a holistic spatial analysis integrating information across scales. In this work we develop a computational pipeline automating the discovery and spatial localization of immunological characteristics of the tumor microenvironment in melanoma. We use a novel deep learning-based framework implementing a self-supervised spatial segmentation of multiplexed immunofluorescence tissue samples into interpretable categories representing distinct immunological features and cell types.
Keywords
Deep Learning
Self Supervised Learning
Gigapixel Imaging
Cancer - Melanoma
Image Segmentation
Multiplexed Immunofluorescence
Latent space models (LSMs) provide a powerful framework for analyzing network data by embedding nodes in a latent space. Incorporating covariate information via edge covariates is a natural extension that enhances the practical utility of such models. Prior work has shown that effective estimation under this setting can be achieved using maximum likelihood estimators (MLEs). However, the asymptotic normality of the estimators for the edge effect remains unknown, making valid statistical inference challenging. In this work, we establish theoretical guarantees for MLEs under this setting, including consistency and asymptotic normality. Through extensive numerical simulations, we demonstrate that our proposed method enables valid statistical inference for the edge effect. These findings contribute to the statistical methodology for LSMs, providing a principled framework for parameter estimation and inference in network models with edge covariates.
Keywords
latent space models
maximum likelihood inference
network with edge covariates
Robust estimation of heterogeneous treatment effects is a fundamental challenge for optimal decision-making in domains ranging from personalized medicine to educational policy. In recent years, predictive machine learning has emerged as a valuable toolbox for causal estimation, enabling more flexible effect estimation. However, accurately estimating conditional average treatment effects (CATE) remains a major challenge, particularly in the presence of many covariates. In this article, we propose pretraining strategies that leverages a phenomenon in real-world applications: factors that are prognostic of the outcome are frequently also predictive of treatment effect heterogeneity. In medicine, for example, components of the same biological signaling pathways frequently influence both baseline risk and treatment response. Specifically, we demonstrate our approach within the R-learner framework, which estimates the CATE by solving individual prediction problems based on a residualized loss. We use this structure to incorporate "side information" and develop models that can exploit synergies between risk prediction and causal effect estimation. In settings where these synergies are present, this cross-task learning enables more accurate signal detection: yields lower estimation error, reduced false discovery rates, and higher power for detecting heterogeneity.
Keywords
Causal inference
Heterogeneous treatment effects
statistical learning
The widespread deployment of smart meters in the residential and tertiary sectors has made it possible to collect high-frequency electricity consumption data at the consumer level (individuals, professionals, etc.). This data is a raw material for research on the prediction of electricity consumption at this level. The majority of this research is largely aimed at meeting the needs of industry, such as applications in the context of smart homes and programs for managing and reducing consumption.
The objective of this work is to deploy or implement short-term (D + 1) electrical load forecasting models at the consumer level. The complexity of the subject lies in the fact that consumption data on this scale is very volatile. Indeed, it includes a large amount of noise and depends on the consumer's lifestyle and consumption habits.
In this study, RNN-LSTM models were deployed to predict household load, taking into account various industrial constraints.
The models were tested and evaluated on a large sample of disparate load curves in the residential sector. An approach was also proposed for the prediction of the most volatile load curves.
Keywords
Machine Learning
Deep Learning
RNN-LSTM
Electricity load curves
Short-term individual prediction