Print Close

Contributed Poster Presentations: Section on Statistical Learning and Data Science

Shirin Golchi Chair
McGill University

Tuesday, Aug 5: 10:30 AM - 12:20 PM
4104
Contributed Posters

Music City Center

Room: CC-Hall B

Main Sponsor

Section on Statistical Learning and Data Science

Presentations

37: A Computational Framework for Enhancing 16S Sequencing Using Long-Read Phylogenetic Reference Data

16S sequencing is a widely used approach for studying microbial communities, but its resolution limits its ability to provide comprehensive functional and mutational insights. We introduce a computational framework that integrates 16S sequencing data with a phylogenetic reference tree constructed through ancestral state reconstruction, utilizing the most up-to-date reference databases from long-read sequencing. By leveraging this long-read-informed phylogenetic framework, our method enhances functional predictions and mutation assessments, addressing key limitations of 16S sequencing and enabling more precise insights into microbial functional and genetic diversity. Importantly, this framework empowers researchers to extract richer information from 16S data, even in the absence of direct access to long-read sequencing technologies. Furthermore, we generalize this framework to accommodate shotgun sequencing data, broadening its applicability and utility across diverse microbiome research applications.

Keywords

16S sequencing

Long-read sequencing

Phylogenetic reference tree

Ancestral state reconstruction

Microbiome analysis

Co-Author(s)

Yushu Shi
Robert Jenq, City of Hope
Antonio Gomes, City of Hope

First Author

Xinran Qi, City of Hope

Presenting Author

Xinran Qi, City of Hope

38: A Variational Non-Asymptotic Testing Procedure for Causal Discovery

Causal discovery in finite sample settings presents significant challenges in feasibility and precision, yet most of the existing work often assumes asymptotic conditions or discrete support of causal variables. This paper examines the fundamental limits of causal discovery under finite data constraints, focusing on function complexity and statistical guarantees in continuous settings. We introduce a novel framework for identifying approximate causal relationships, utilizing KL-divergence minimization to estimate causal effects. Our approach adapts inherently to the finite-sample regime, offering robustness guarantees and capturing non-linear dependencies through extensions into reproducing kernel Hilbert spaces. Additionally, we develop a testing procedure to discern the direction of causality, enhancing the practical applicability of our framework in data-limited contexts. These contributions clarify the feasibility of causal inference when data is scarce and establish theoretical bounds for estimation of complex functional relationships.

Keywords

Causal DIscovery

KL Divergence

Finite samples

Co-Author(s)

Yixin Wang
Yian Ma, University of California San Diego

First Author

Paolo Borello

Presenting Author

Paolo Borello

39: A stochastic approach to k-nearest neighbors search using a fixed radius method

This study aims to optimize the k-nearest neighbors search (kNN search) by reducing the computational burden of the Brute-force method while providing the same solution. Our method leverages data structures and probabilistic assumptions to enhance the scalability of the search. By focusing on the Training set, we define a sample space that limits the k-nearest neighbors search to a smaller space. For each observation in the Query set a fixed radius search is employed, with the radius stochastically linked to the desired number of neighbors. This approach allows us to find the k-nearest neighbors using only a fraction of the entire Training set. Through simulations and a theoretical complexity analysis, we demonstrate that our method outperforms the Brute-force approach, particularly when the Training and Query set sample sizes are large. In addition, a benchmark on an Alzheimer's disease data set further demonstrated this, showing a 62.5-fold improvement in CPU time. Overall, our stochastic approach significantly reduces the computational load of kNN search while maintaining accuracy, making it a viable alternative to traditional methods for large datasets.

Keywords

kNN search

Computational efficiency

fixed-radius search

Branch and bound

Machine learning

Co-Author(s)

Alexander Alsup
Jeffrey Thompson, Department of Biostatistics and Data Science at KUMC
Devin Koestler, University of Kansas Medical Center

First Author

Brahian Cano Urrego, University of Kansas Medical Center ASA Student Chapter

Presenting Author

Brahian Cano Urrego, University of Kansas Medical Center ASA Student Chapter

40: Optimal Sampling under Class Imbalance: A Kernel-Based IPW Estimator for Efficient Classification

Various studies have been conducted to design classification models in situations where human error is present or where the population distribution is not precisely known. However, research explicitly addressing imbalanced data is still in its early stages. In this context, we propose a novel optimal sampling method that enhances classification performance without requiring additional data collection or sacrificing the desirable distributional properties of the classification model. Among optimal sampling methods, the Inverse ProbabilityWeighted (IPW) estimator is utilized to sub-sample more informative instances from the dataset. In particular, under imbalanced data settings, the amount of available information is more closely tied to the number of positive instances than to the total data size. Therefore, all positive instances are retained, and the negative instances are substantially reduced using a non-uniform sampling strategy, thereby improving estimation efficiency. This study derives the asymptotic distribution of the IPW estimator combined with a kernel-based method and shows that the proposed estimator is not only unbiased but also consistent. Furthermore, through extensive simulation studies and application to a real dataset, we demonstrate that the proposed method remains effective under imbalanced data and unspecified model settings. The results confirm that the proposed estimator achieves superior efficiency compared to existing methods.

Keywords

Active learning

Optimal sampling

Imbalanced data

Label noise

Binary classification

Semi-supervised learning

Co-Author

JooChul Lee, Auburn University

First Author

Hyelim Jung, Auburn University

Presenting Author

Hyelim Jung, Auburn University

41: An interpretable and efficient autoregressive model for high-dimensional tensor-valued time series

The increasing availability of tensor-valued time series data has created new challenges for statistical modeling, particularly when both responses and covariates are high-dimensional and high-order tensors. To address the issues of over-parameterization and limited sample sizes, this paper introduces a novel CP-based low-rank structure for coefficient tensors in tensor-on-tensor regression and autoregressive models. The method uses CP decomposition to extract features from responses and covariates, enabling a supervised factor modeling framework that enhances both interpretability and estimation efficiency. The method further incorporates a sparse component to account for heterogeneous signals and potential model misspecifications. Estimation is performed using the alternating least squares (ALS) algorithm with updates cast as linear regression problems. Non-asymptotic estimation error bounds are established, and simulations and a real-world ENSO dataset confirm the method's effectiveness.

Keywords

CP decomposition

ENSO area detection

High-dimensional time series data

Low-rank plus sparse modeling

Tensor autoregression

Co-Author

Lan Li, NA

First Author

Yuxi CAI

Presenting Author

Lan Li, NA

42: Basis Bayesian Additive Regression Trees

In the fields of environmental science and medicine it is increasingly common to have access to data collected on subjects over time. Given a sufficiently dense sampling, these data can often be smoothed and analyzed as functional variables. While functional variables can be used as covariates in regression models, traditional methods, such as the functional linear model, impose constraints that limit the usefulness of functional covariates as predictors. In this paper we introduce Basis Bayesian Additive Regression Trees (bBART), an adaptation to the original BART model that allows for the inclusion of functional covariates using basis expansions. By leveraging the BART model, bBART inherits many attractive features, including requiring no assumption of additivity or smooth effects and enabling posterior inference with MCMC samples.

Keywords

Bayesian Additive Regression Trees

BART

Functional data

Co-Author(s)

Tanzy Love, University of Rochester
Angela Groves, University of Rochester

First Author

Joshua Marvald

Presenting Author

Joshua Marvald

43: Comparative Time Series Analysis of the Temporal Fusion Transformer (TFT) and ARIMA Model

Several new Transformer-based time series models have been developed in the past 5 years and research has provided evidence of these models' superior performance compared to classic statistical models such as ARIMA. While Transformer-based models show impressive performance on baseline datasets, there has been no research done on the robustness of these models on datasets with controlled modifications. In this paper, the temporal fusion transformer (TFT) model was compared to the classical statistical model ARIMA on simulated data with the following modifications: (1) increases in dependent variable noise, (2) addition of exogenous variables that are uncorrelated to the dependent variable, (3) reduction in training set size. The TFT and ARIMA models were compared using mean squared error (MSE) and mean absolute error (MAE) on various horizons. Results show X, Y, Z.

Keywords

Time Series

Transformer

Temporal Fusion Transformer (TFT)

ARIMA

Simulated data

Co-Author

Catherine Ticzon, Classmate

First Author

Aaron Abromowitz

Presenting Author

Aaron Abromowitz

44: Developing Models for Mortality and Community Discharge After Skilled Nursing Facility Admission

Prognostic models for older adults admitted to a skilled nursing facility (SNF) following hospitalization are needed to guide clinical decisions. Using 20% Medicare data (2017-2019), we developed models predicting 6-month mortality and community discharge post-SNF. A hybrid approach combined machine learning (Gradient Boosting, Random Forest, Neural Networks, SuperLearner) for feature selection with Bayesian logistic regression to estimate inclusion probabilities, quantify uncertainty, and compute credible intervals for individual risk predictions. Performance was assessed by discrimination (C-statistic) and calibration. Model outputs include Bayesian odds ratios with 95% credible intervals, which represent the range within which the true odds ratio lies with 95% probability. The final model provides interpretable effect estimates and predictive uncertainty, making it a valuable tool for SNF clinicians.

Keywords

Prognostic Models

Prediction

Bayesian

Machine Learning

Skilled Nursing Facility

Co-Author(s)

John Boscardin, UCSF Medicine & Biostatistics
Siqi Gan
W. James Deardorff, Division of Geriatrics, University of California, San Francisco

First Author

Bocheng Jing

Presenting Author

Bocheng Jing

45: Exact Confidence Intervals for Randomization-Based Inference

Permutation tests provide exact finite-sample inference under minimal assumptions, making them invaluable for analyzing experimental data. Though these tests are widely used for hypothesis testing, existing methods for inverting them to obtain confidence intervals have focused largely on location-scale models, leaving a gap in our ability to construct exact, randomization-based confidence intervals for many common outcome models. In this paper, we present a new approach to inverting permutation tests applicable to a broad class of outcome types, such as binary, count, heavy-tailed, and censored outcomes, with natural extensions to common semiparametric models, such as the logistic partially linear model. Importantly, our method is computationally efficient, requiring only a one-dimensional grid search across the real line, while related approaches are generally more computationally intensive and conservative. Through an extensive simulation study, we demonstrate the efficacy of our confidence interval construction across diverse outcome types in both parametric and semiparametric settings.

Keywords

Randomization testing

Exact confidence intervals

Semiparametric inference

Finite-sample inference

Co-Author(s)

Yash Nair, Stanford University
Kelly Zhang, Columbia University
Lucas Janson, Harvard University

First Author

Biyonka Liang

Presenting Author

Biyonka Liang

46: Explainable Machine Learning to Assess the Value of Sustainable Housing

Our objective is to estimate the green value of housing by focusing on energy performance labels in order to understand how housing prices evolve when energy performance improves.

Instead of fitting a hedonic modeling that is some special kind of linear model, and as it was done in previous works, we fit random forests or XGBoost models.

Unlike linear models, which directly reveal the relative importance of the variables via coefficients, these complex models require alternative methods to quantify the impact of the input variables. Shapley values are often used to tackle this issue for random forests and XGBoost models, that do not provide explicit coefficients. Their calculation guarantees that each feature is fairly represented, taking into account all possible combinations of variables.

However, with non-linear and complex models such as random forests and XGBoost, the exact calculation of Shapley values becomes computationally prohibitive.

As a consequence we used more efficient approximation methods such as SHAP, KernelSHAP and FastSHAP to interpret the predictions given by models and we managed to propose an estimate of the "green value" of a housing.

Keywords

Machine Learning

Shapley Values

Green Effect

Hedonic Regression

Random Forests

XGBoost

Co-Author(s)

Elizaveta Logosha, Groupe Vivialys
Frederic Bertrand, University of Technology of Troyes

First Author

Myriam Maumy-Bertrand, Universite De Technologie De Troyes

Presenting Author

Myriam Maumy-Bertrand, Universite De Technologie De Troyes

47: Extending Bayesian additive regression trees to handle non-stationary spatial data

Mixed spatial effects models are widely used in the analysis of geospatial data. Such models typically consist of a mean function, which depends on covariates, and spatial random effects. Various approaches are available to model the mean structure in a non-linear way, mostly non-parametric methods such as generalized additive models (GAM) or machine learning methods. A common assumption is that the flexible mean function accounts for all the spatial dependence, thus implying that there is no residual spatial dependence and the errors can be taken as independent. Recent work has sought to relax this assumption on the errors, while still retaining the hypothesis that the spatially dependent errors are second-order stationary.
In this talk, we relax the assumption that the spatial random effects arise as realization of a stationary spatial process, and we highlight how Bayesian additive regression trees(BART) leads to systematic errors in this situation. To address this shortcoming, we propose a new BART-based approach that accommodate both stationary and nonstationary geospatial data. Namely, our proposal addresses to overly assign locations to the same leaf nodes as neighboring observations.

Keywords

Random Forest

Non-stationary spatial data

Co-Author(s)

Veronica Berrocal, University of California, Irvine
Ana Kenney, UC Irvine

First Author

Hwanggwan Gwon

Presenting Author

Hwanggwan Gwon

48: Feature Selection for Latent Factor Models

The research presents a comprehensive overview of feature selection methodologies in machine learning, addressing the challenges posed by high-dimensional data and the need to mitigate the curse of dimensionality. The project is built upon improving model performance via features section methods. Various approaches for feature selection, including supervised and unsupervised methods, have been outlined, and new strategies have been proposed to introduce robustness and sparsity in the feature selection process. Furthermore, it highlights the importance of evaluating these methods within a multi-class classification framework using simulated and real-world datasets. The study's contributions include introducing an SNR-based feature selection technique, exploring feature recovery guarantees, proposing robust methods for outlier handling, incorporating per-class feature selection for multi-class classification, and conducting extensive experiments to validate the proposed methods' efficacy and robustness.

Keywords

Feature Selection

Latent Factor Models

Robust Loss Optimization

Multiclass Classification

Signal-to-Noise Ratio(SNR)

Co-Author

Adrian Barbu, UCLA

First Author

Rittwika Kansabanik, Florida State University

Presenting Author

Rittwika Kansabanik, Florida State University

49: Functional Regression for SERS Spectrum Transformation Across Diverse Instruments

Surface-enhanced Raman spectroscopy (SERS) holds remarkable potential for the rapid and portable detection of trace molecules. However, the analysis and comparison of SERS spectra are challenging due to the diverse range of instruments used for data acquisition. A spectra instrument transformation framework based on the penalized functional regression model (SpectraFRM) is introduced for cross-instrument mapping with subsequent machine learning classification to compare transformed spectra with standard spectra. In particular, the nonparametric forms of the functional response, predictors, and coefficients employed in SepctraFRM allow for efficient modeling of the nonlinear relationship between target spectra and standard spectra. With an additional feature extraction step, the transformed spectra outperform the original spectra by 10% in analytes identification tasks. Overall, the proposed method is shown to be flexible, robust, accurate, and interpretable despite varieties of analytes and instruments, making it a potentially powerful tool for the standardization of SERS spectra from various instruments.

Keywords

Functional Regression

spectrum transformation

surface-enhanced Raman scattering

First Author

Tao Wang

Presenting Author

Tao Wang

50: Generalized Tree-Informed Mixed Model Regression

The standard regression tree method applied to observations within clusters poses both methodological and implementation challenges. Effectively leveraging these data requires methods that account for both individual-level and sample-level effects. We propose Generalized Tree-Informed Mixed Model (GTIMM), which replaces the linear fixed effect in a generalized linear mixed model (GLMM) with the output of a regression tree. Traditional parameter estimation and prediction techniques, such as the expectation-maximization algorithm, scale poorly in high-dimensional settings, creating a computational bottleneck. To address this, we employ a quasi-likelihood framework with stochastic gradient descent for optimized parameter estimation. Additionally, we establish a theoretical bound for the mean squared prediction error. The predictive performance of our method is evaluated through simulations and compared with existing approaches. Finally, we apply our model to predict country-level GDP based on trade, foreign direct investment, unemployment, inflation, and geographic region.

Keywords

tree based regression

clustered data

mixed effects

penalized quasi-likelihood

stochastic gradient descent

prediction

Co-Author(s)

Xin Jin, The University of Tampa
Riddhi Ghosh, Bowling Green State Universty

First Author

Jeremiah Allis

Presenting Author

Jeremiah Allis

51: High-dimensional Functional Graphical Modeling in the Presence of Hidden Confounders

Functional graphical models have recently been employed to capture the conditional independence structure of high-dimensional functional data and functional time series. However, real-world datasets often encounter challenges due to latent processes that can confound multiple variables. To address this issue, we propose a novel estimator for functional graphical models that accounts for the presence of unobserved confounders. Our estimator is constructed using the projection of the right singular functions of the multivariate random functions, and is the functional extension of the Right Singular Vector Projection estimation for random variable. Unlike previous approaches that focus on "spiked" confounders and rely on removing principal components from multivariate functional PCA, our method is designed to accommodate a broader range of confounder effects, from strong and spiked to weak. We establish the consistency of our estimator in the high-dimensional setting, and demonstrate its effectiveness through synthetic simulations and fMRI data analysis

Keywords

Functional Data Analysis

Graphical Models

High-dimensional Statistics

Structure Learning

Hidden Confounders

Covariance Estimator

Co-Author

Y. Samuel Wang, Cornell University

First Author

Filippo Fiocchi, Cornell University

Presenting Author

Filippo Fiocchi, Cornell University

52: Leveraging Large Language Models (LLMs) and Research-Based Prompts for Cleaning Unstructured Text

Cleaning unstructured text data, particularly in large data contexts, presents a considerable challenge given the size, complexity of language and the diversity of data sources. In this work, we propose a methodology that combines Large Language Models (LLMs) and Research-Based Prompts (RBP) to effectively streamline and improve text data cleaning for tabular data analysis. To evaluate the methodology's efficacy, we examine three key sources of variability-LLM choice (e.g., ChatGPT, Llama), prompt choice, and text data type (e.g., nurse vs. physician text)-and their impact on overall performance as compared to a human label.

This poster outlines each phase of our methodology, including initial data collection, the use of specialized prompts, iterative LLM refinements, and automated code generation for advanced text processing. We demonstrate the effectiveness of these approaches on two types of datasets: clinical text and Reddit posts. Lastly, we address limitations such as potential biases in LLM outputs and the need for continuous human oversight to ensure accuracy and reliability.

Keywords

Data management

Data cleaning

Large Language Model (LLM)

unstructured text data

First Author

Joshua Lambert, University of Cincinnati

Presenting Author

Joshua Lambert, University of Cincinnati

54: Minimum-Norm Interpolation Under Covariate Shift

Transfer learning is a critical part of real-world machine learning deployments and has been extensively studied in experimental works with overparameterized neural networks. However, even in the simplest setting of linear regression a notable gap still exists in the theoretical understanding of transfer learning. In-distribution research on high-dimensional linear regression has led to the identification of a phenomenon known as benign overfitting, in which linear interpolators overfit to noisy training labels and yet still generalize well. This behavior occurs under specific conditions on the source covariance matrix and input data dimension. Therefore, it is natural to wonder how such high-dimensional linear models behave under transfer learning. We prove the first non-asymptotic excess risk bounds for benignly-overfit linear interpolators in the transfer learning setting. From our analysis, we propose a taxonomy of beneficial and malignant covariate shifts based on the degree of overparameterization. We follow our analysis with empirical studies showing these beneficial and malignant covariate shifts for linear interpolators and simple neural networks in certain settings.

Keywords

high-dimensional statistics

statistical learning

distribution shift

generalization

deep learning theory

transfer learning

Co-Author(s)

Neil Mallinar, University of California, San Diego
Spencer Frei
Bin Yu, University of California at Berkeley

First Author

Austin Zane, University of California, Berkeley

Presenting Author

Austin Zane, University of California, Berkeley

55: Mitigating Data Imbalance in Credit Card Fraud Detection

Credit card fraud poses a significant challenge and leads to substantial financial losses. Although machine learning and deep learning models have been extensively studied in this domain, few address the issue of data imbalance, which can bias predictions. In this paper, we explore techniques to address data imbalance, including Synthetic Minority Oversampling Technique (SMOTE), simple oversampling, and Variational Autoencoders (VAE). These methods are evaluated using metrics tailored for imbalanced datasets. In real-world scenarios, there is often a trade-off between recall and precision, both of which significantly impact revenue.
Our preliminary results show that SMOTE biases toward recall (0.897) than precision (0.098) but generates distributionally similar synthetic data, while VAE achieves better precision (0.903) and generalizability. Combining VAE-generated data with baseline logistic regression significantly improves performance with ROC-AUC 0.978, offering a computationally efficient solution for large-scale fraud detection in imbalanced datasets. This study highlights the trade-offs between different techniques and provides a practical solution for fraud detection.

Keywords

Fraud Detection

Synthetic Data

Machine Learning

Neural Network

Deep Learning

Co-Author(s)

Yisong Chen
Chuanhao Nie
Yixin Xu

First Author

Chuqing Zhao, Harvard University

Presenting Author

Chuqing Zhao, Harvard University

56: Optimal Shrinkage Estimation for Linear Discriminant Analysis in Ultra-high Dimension

Linear discriminant analysis (LDA) faces significant challenges when the number of features (p) exceeds the number of observations (n). While various methods have been proposed to address this issue, most assume n and p are comparable or impose restrictive structural assumptions on the population covariance matrix. In this study, we present a unified framework for LDA based on an optimal shrinkage method designed for ultra-high dimensional data, where p grows polynomially in n. As examples within our framework, we consider two types of shrinkage estimators: a linear shrinker, leading to a regularized LDA, and a nonlinear shrinker under the generalized spiked covariance matrix model. Leveraging recent advances in random matrix theory, we establish theoretical guarantees for our approach by analyzing the asymptotic behavior of outlier eigenvalues and eigenvectors, as well as deriving a quantum unique ergodicity estimate for non-outlier eigenvectors of the spiked sample covariance matrix. These results also reveal a phase transition phenomenon in LDA, allowing us to characterize the conditions under which LDA succeeds or fails based on the magnitude of the mean difference.

Keywords

Linear discriminant analysis

Optimal shrinkage estimation

Spiked model

Random matrix theory

Co-Author

Xiucai Ding

First Author

Wonjun Seo

Presenting Author

Wonjun Seo

57: Sparse Functional Canonical Correlation Analysis for Multiview Data Integration

The current landscape of functional data analysis (FDA) predominantly caters to single variables on one view of dataset, overlooking the prevalence of multiview multivariate functional data in biomedical research. While canonical correlation analysis (CCA) stands out as a popular choice for integrative analysis, its applicability is limited to cross-sectional data, failing to address longitudinal or functional data scenarios. In response to these limitations, we propose an innovative integrative sparse functional canonical correlation analysis approach for multiview data. This novel framework aims to tackle the challenges posed by multiview functional datasets, seamlessly integrating both cross-sectional and longitudinal/functional data while accounting for sparsity. The method aims to identify linear combinations of variable functions for each view such that the correlation between the sets of linear combinations is maximized. Our method will also identify interpretable variables that maximize such association over time. We will conduct simulation studies to evaluate the effectiveness of our approach. We will use our method to investigate multi-omics biomarkers in IBD.

Keywords

High-dimensional data

Longitudinal data

Microbiome data

Sparsity

Variable selection

Co-Author(s)

Guannan Wang, College of William and Mary
Sandra Safo, University of Minnesota

First Author

Limeng Liu

Presenting Author

Limeng Liu

58: Spatial Characterization of the Tumor Microenvironment using Self Supervised Learning

The immunological nature of the tumor microenvironment (TME) in cancer is strongly associated with treatment response and clinical outcomes. The dynamics driving these associations are believed to be spatially-dependent and occurring at various scales within the tissue. Thus, detection of the immunological features of the TME requires a holistic spatial analysis integrating information across scales. In this work we develop a computational pipeline automating the discovery and spatial localization of immunological characteristics of the tumor microenvironment in melanoma. We use a novel deep learning-based framework implementing a self-supervised spatial segmentation of multiplexed immunofluorescence tissue samples into interpretable categories representing distinct immunological features and cell types.

Keywords

Deep Learning

Self Supervised Learning

Gigapixel Imaging

Cancer - Melanoma

Image Segmentation

Multiplexed Immunofluorescence

Co-Author(s)

Venkatraman Seshan, MSKCC
Colin Begg, Memorial Sloan-Kettering Cancer Center
Ronglai Shen, Memorial Sloan-Kettering Cancer Center

First Author

Vincent Pisztora

Presenting Author

Vincent Pisztora

59: Statistical Inference for Latent Space Model on Network data with Edge Covariates

Latent space models (LSMs) provide a powerful framework for analyzing network data by embedding nodes in a latent space. Incorporating covariate information via edge covariates is a natural extension that enhances the practical utility of such models. Prior work has shown that effective estimation under this setting can be achieved using maximum likelihood estimators (MLEs). However, the asymptotic normality of the estimators for the edge effect remains unknown, making valid statistical inference challenging. In this work, we establish theoretical guarantees for MLEs under this setting, including consistency and asymptotic normality. Through extensive numerical simulations, we demonstrate that our proposed method enables valid statistical inference for the edge effect. These findings contribute to the statistical methodology for LSMs, providing a principled framework for parameter estimation and inference in network models with edge covariates.

Keywords

latent space models

maximum likelihood inference

network with edge covariates

Co-Author(s)

Gongjun Xu, University of Michigan
Ji Zhu, University of Michigan

First Author

Feifan Jiang, University of Michigan

Presenting Author

Feifan Jiang, University of Michigan

60: Statistical Learning for Heterogeneous Treatment Effects: Pretraining, Prognosis, and Prediction

Robust estimation of heterogeneous treatment effects is a fundamental challenge for optimal decision-making in domains ranging from personalized medicine to educational policy. In recent years, predictive machine learning has emerged as a valuable toolbox for causal estimation, enabling more flexible effect estimation. However, accurately estimating conditional average treatment effects (CATE) remains a major challenge, particularly in the presence of many covariates. In this article, we propose pretraining strategies that leverages a phenomenon in real-world applications: factors that are prognostic of the outcome are frequently also predictive of treatment effect heterogeneity. In medicine, for example, components of the same biological signaling pathways frequently influence both baseline risk and treatment response. Specifically, we demonstrate our approach within the R-learner framework, which estimates the CATE by solving individual prediction problems based on a residualized loss. We use this structure to incorporate "side information" and develop models that can exploit synergies between risk prediction and causal effect estimation. In settings where these synergies are present, this cross-task learning enables more accurate signal detection: yields lower estimation error, reduced false discovery rates, and higher power for detecting heterogeneity.

Keywords

Causal inference

Heterogeneous treatment effects

statistical learning

Co-Author(s)

Erik Sverdrup, Monash University
Robert Tibshirani, Stanford University

First Author

Maximilian Schuessler, Stanford University School of Medicine

Presenting Author

Maximilian Schuessler, Stanford University School of Medicine

61: Using RNN to analyze electricity residential load curves and perform short-term prediction

The widespread deployment of smart meters in the residential and tertiary sectors has made it possible to collect high-frequency electricity consumption data at the consumer level (individuals, professionals, etc.). This data is a raw material for research on the prediction of electricity consumption at this level. The majority of this research is largely aimed at meeting the needs of industry, such as applications in the context of smart homes and programs for managing and reducing consumption.

The objective of this work is to deploy or implement short-term (D + 1) electrical load forecasting models at the consumer level. The complexity of the subject lies in the fact that consumption data on this scale is very volatile. Indeed, it includes a large amount of noise and depends on the consumer's lifestyle and consumption habits.

In this study, RNN-LSTM models were deployed to predict household load, taking into account various industrial constraints.

The models were tested and evaluated on a large sample of disparate load curves in the residential sector. An approach was also proposed for the prediction of the most volatile load curves.

Keywords

Machine Learning

Deep Learning

RNN-LSTM

Electricity load curves

Short-term individual prediction

Co-Author(s)

Fatima Fahs, Electricity of Strasbourg (France)
Myriam Maumy-Bertrand, Universite De Technologie De Troyes

First Author

Frederic Bertrand, University of Technology of Troyes

Presenting Author

Frederic Bertrand, University of Technology of Troyes