Advancing Distributed Learning for Complex and Heterogeneous Data

Ran Chen Chair
 
Keren Li Organizer
University of Alabama at Birmingham
 
Monday, Aug 4: 10:30 AM - 12:20 PM
0569 
Topic-Contributed Paper Session 
Music City Center 
Room: CC-104D 
This session delves into the latest advancements in distributed learning, with a focus on tackling complexities arising from data heterogeneity, privacy constraints, and scalability challenges. As the fields of statistics, data science, and AI rapidly evolve, it is increasingly essential to develop techniques that enable effective learning across distributed systems while addressing communication constraints and data privacy. This session highlights innovative statistical methods and frameworks that address these issues, from collaborative knowledge sharing to high-dimensional data analysis and functional data processing.
Key Themes:
1. Distributed Collaborative Learning with Representative Knowledge Sharing: This title emphasizes the collaborative aspect, the use of knowledge sharing, and the idea of distillation on a representative dataset without needing a public dataset. It also subtly hints at addressing heterogeneity through weighted teacher models.
2. Reinforcement Learning in Distributed Systems: This talk discusses online decision-making strategies that exploit both the similarities and differences among nodes in distributed systems. Applications include business and healthcare.
3. High-Dimensional Problems in Distributed Learning and Deep Learning: An examination of the statistical methods used to address the challenges posed by high-dimensional data, particularly in the context of distributed and deep learning environments.
4. Harnessing Deep Learning and Distributed Systems for Next-Generation Functional Data: This presentation explores the innovative use of deep learning and distributed learning techniques to tackle the challenges posed by next-generation functional data. These advanced methodologies unlock new insights and capabilities, enabling more powerful and scalable data analysis solutions in the era of big data and complex functional datasets.
5. Generalized Information Criterion for Ensemble Kernel Learning: This talk introduces a computationally efficient method with using a new information criterion that promotes an ensemble kernel learning for large data analysis.
This session is timely, addressing critical challenges in distributed learning, such as managing heterogeneity and enhancing collaborative learning without compromising data privacy. The methods presented offer scalable, innovative solutions in the age of big data, appealing to statisticians, data scientists, and AI researchers alike by providing insights into both theoretical advancements and practical applications.
In line with the JSM 2025 theme, "Statistics, Data Science, and AI Enriching Society," this session showcases how statistical innovation in distributed learning contributes to the development of AI systems that are efficient, scalable, and socially beneficial. By emphasizing heterogeneity, privacy-preserving knowledge sharing, and advanced data integration, the session addresses crucial areas for the enrichment of data science and AI applications in society.

Keywords

Distributed Machine Learning

Reinforcement Learning

High-Dimensional Analysis

Functional Analysis

Ensemble Kernel Learning 

Applied

Yes

Main Sponsor

IMS

Co Sponsors

International Chinese Statistical Association
Section on Statistical Learning and Data Science

Presentations

Distributed Collaborative Learning with Representative Knowledge Sharing

Distributed Collaborative Learning (DCL) addresses critical challenges in privacy-aware machine learning by enabling indirect knowledge transfer across nodes with heterogeneous feature distributions. Unlike conventional federated learning approaches, DCL assumes non-IID data and prediction task distributions that span beyond local training data, requiring selective collaboration to achieve generalization. In this work, we propose a novel Collaborative Transfer Learning (CTL) framework that utilizes representative datasets and adaptive distillation weights to facilitate efficient and privacy-preserving collaboration. By quantifying node similarity via Distributed Energy Coefficients, approximated from Taylor-expanded energy distance, CTL dynamically selects optimal collaborators and refines local models through knowledge distillation on shared representative datasets. These representatives, locally constructed synthetic datasets that encode conditional information, serve as a common ground for knowledge exchange and model comparison. We highlight how Representative Learning enables quantification of model heterogeneity, facilitates transfer under non-IID task distributions, and supports scalable generalization. Simulations demonstrate the benefit of adaptive collaboration, with CTL achieving superior trade-offs between personalization and global coordination. We also discuss a taxonomy of data heterogeneity types, including newly defined model and representation divergences, and illustrate their relevance to node alignment and collaborative efficiency.  

Keywords

Collaborative Transfer Learning

Knowledge distillation

Contrastive Learning

Federated Learning

Representative Learning 

Speaker

Keren Li, University of Alabama at Birmingham

Interpretable Personalized Online Reinforcement Learning with Applications in Business and Healthcare

Reinforcement learning (RL) has achieved remarkable success in engineering-focused domains. However, its application to high-stakes, human-centered fields such as business and healthcare remains challenging due to unique barriers: significant heterogeneity among individuals, the continuity of state and action spaces, and heightened demands for interpretability and online algorithms. To address these challenges, we propose a personalized reinforcement learning framework that accounts for both individual heterogeneity and shared patterns across human subjects regarding their state-transition and reward-generating mechanisms through a novel personalized kernel embedding approach. Building on our model, we develop an efficient online RL algorithm. We demonstrate the efficacy of our approach through a rigorous regret analysis and showcase its interpretability through practical case studies. 

Keywords

Data-driven decision-making, Reinforcement Learning, Personalization, Interpretable Algorithms 

Speaker

Ran Chen

Phase-Aware Federated Deep Neural Network Classification for Heterogeneous Functional Data

Multi-stage neuro–imaging studies such as the Alzheimer's Disease Neuroimaging Initiative (ADNI) release data in phased cohorts that differ in scanner hardware, sampling frames, and population mix, yielding heterogeneous, high‑dimensional functional observations. Deep neural networks (DNNs) can capture the resulting non‑linear decision boundaries, but joint training is often infeasible owing to privacy constraints, phase‑specific label scarcity, and limits on storing petabyte‑scale archives. We propose a sequential distributed‑learning framework that trains a DNN classifier across K heterogeneous agents without co‑locating raw functional data. The learner visits each agent once, updates parameters locally, discards the data, and transfers only compressed weights; an adaptive, sequential gradient‑weighting strategy progressively mitigates covariate and label shift to optimize classification accuracy on the target agent, while an embedded functional feature selector pinpoints informative functional covariates. We establish minimax‑optimal excess‑risk bounds, prove selection consistency, and identify a sharp phase‑transition threshold that governs learnability for sparsely observed functional data. Simulations and an ADNI case study on three‑year MCI‑to‑AD conversion show that the method matches the accuracy of a centralized DNN, recovers key brain regions, and reduces memory and communication costs by an order of magnitude, providing a scalable, privacy‑preserving solution for heterogeneous functional data analysis. 

Keywords

Functional data analysis


Deep neural network classifier


Variable selection 

Speaker

Shuoyang Wang, University of Louisville

Real-Time Model Synchronization: Decentralized and Asynchronous Strategies for Scalable Machine Learning

In distributed machine learning systems, the ability to update models dynamically across multiple nodes is critical for maintaining accuracy and responsiveness in environments with continuous data streams. Traditional batch-based or centralized training methods often struggle with scalability, latency, and synchronization bottlenecks. This talk explores cutting-edge techniques for online model updating in distributed settings, focusing on incremental learning (processing data sequentially without retraining), decentralized learning (node-specific updates with minimal coordination), consensus-based strategies (achieving global model coherence through local collaboration), and asynchronous updates (eliminating synchronization barriers to reduce latency). These approaches collectively address challenges such as dynamic data distribution shifts, communication overhead, and system heterogeneity. By enabling real-time adaptation, they enhance scalability, fault tolerance, and resource efficiency while preserving model performance. Practical applications span federated learning, IoT networks, and large-scale analytics, where timely insights depend on seamless coordination across nodes. The talk also discusses trade-offs between consistency and speed, robustness to node failures, and open challenges in balancing theoretical guarantees with real-world deployment constraints. This synthesis of strategies provides a roadmap for building agile, resilient machine learning systems capable of thriving in fast-evolving data landscapes. 

Speaker

Zhong Chen, Southern Illinois University

TorchSVM: A PyTorch-based Library for Large-scale Kernel SVM and Kernel Machines

In this talk, we introduce TorchKSVM, a PyTorch-based library that trains kernel SVMs and other large-margin classifiers with exact cross-validation error computation. Traditional SVM solvers often encounter scalability and efficiency challenges, particularly when handling large datasets or performing multiple cross-validation runs. TorchKSVM effectively enhances both speed and scalability through CUDA-accelerated matrix operations. By carefully designing the underlying algorithms, TorchKSVM employs advanced strategies, such as spectral algorithms, to fully leverage parallel computing and optimize the use of computational resources. Benchmark experiments demonstrate that TorchKSVM consistently outperforms existing kernel SVM solvers, in both CPU and GPU implementations, in terms of accuracy and speed. 

Keywords

Support Vector Machines

PyTorch

GPU

Parallel Computing,

Large-Margin Classification 

Speaker

Boxiang Wang, University of Iowa