Print Close

Heavy-tails and Robust Statistics in Learning Algorithms

Jose Blanchet Chair
Stanford University

Jose Blanchet Organizer
Stanford University

Wenhao Yang Organizer

Thursday, Aug 7: 8:35 AM - 10:20 AM
0565
Topic-Contributed Paper Session

Music City Center

Room: CC-101C

As machine learning becomes integral to solving real-world problems, the need for robustness in learning algorithms has grown significantly. Heavy-tailed data distributions, outliers, and adversarial perturbations frequently challenge the reliability of standard machine learning methods. Addressing these challenges requires innovative approaches rooted in robust statistics and distributionally robust optimization, ensuring that models maintain performance even under challenging data conditions.

This session explores recent advancements in statistical methodologies and algorithmic frameworks designed to handle heavy-tailed data and outliers. Topics include techniques for distributionally robust optimization that protect against worst-case data distributions, robust statistical methods that enhance model resilience to outliers, and the development of learning algorithms tailored to heavy-tailed and non-standard noise environments. By showcasing these cutting-edge approaches, this session highlights the pivotal role of robust statistics in advancing machine learning theory and practice.

Applied

Main Sponsor

IMS

Co Sponsors

Section on Statistical Computing

Section on Statistical Learning and Data Science

Presentations

A decade of algorithmic robust statistics

The study of robust statistics was initiated by work of Anscombe, Tukey, and Huber in the 60s and 70s. Yet until recently, efficient algorithms for these questions eluded us in many important yet basic high-dimensional settings. However, around a decade ago, starting with the work of Diakonikolas et. al and Lai-Rao-Vempala, there was a bona fide revolution in the way we think about and design algorithms for dealing with high-dimensional outliers. Now, not only do we have polynomial time algorithms for many robust statistical tasks, we now even have practical algorithms, and we have also discovered deep connections between these tasks and other well-studied settings in statistics such as differential privacy, and learning with heavy tails. In this talk, I will survey some of the progress that has been made in this field, and then I will then discuss some exciting future directions, including connections to other aspects of statistics, learning theory, theoretical computer science, and perhaps unexpectedly, quantum computation.

Speaker

Jerry Li, University of Washington

A Definition of Non-Stationary Bandits

Prior work characterizes a non-stationary bandit as one whose reward distribution changes over time. In the Bayesian setting, however, different choices of the underlying distribution sequence can yield conflicting classifications, making the notion of non-stationarity ambiguous. Likewise, popular metrics for non-stationarity or hardness—such as the entropy rate of the optimal action sequence or the temporal variation of mean rewards—also depend on a latent sequence of reward distributions, varying that sequence can produce inconsistent or misleading measures. In some cases, selecting a latent sequence that genuinely reflects the intuitive non-stationarity or difficulty of a problem can itself be challenging. Because these metrics drive both algorithm design and regret analyses, their ambiguity and undesirable properties often lead to algorithms that over-explore and to regret bounds that are harder to interpret. To address these issues, we introduce a definition of non-stationary bandits that makes no reference to any latent sequence, thus eliminating the classification ambiguity. Our definition further guarantees that any two bandits inducing indistinguishable interaction experiences are classified identically—either both stationary or both non-stationary. Finally, we propose a set of axiomatic criteria for non-stationarity or hardness metrics, agent design, and regret analysis.

Speaker

Yueyang Liu, Rice University

Learning optimal solutions to chance-constrained optimization models from limited data

We consider the challenge of provisioning resources to meet high reliability or service level requirements, which are typical in applications such as power distribution, emergency response, and regulatory risk management. With chance-constrained optimization serving as a natural starting point for modeling this class of problems, our main contribution is to characterize how the optimal costs and decisions scale for a generic joint chance-constrained model as the target probability of satisfying the reliability constraints approaches its maximal level. Besides providing insights into the behaviour of optimal costs and decisions based on large-deviations theory, the scaling result carries several algorithmic implications, beginning with the following observation on distributionally robust optimization (DRO) modeling of chance constraints: While widely used DRO approaches based on KL-divergences, Wasserstein distances, and moments distort the scaling properties of optimal decisions and lead to exponentially higher costs, incorporating marginal distributions or employing suitably chosen f -divergence balls preserves the correct scaling and ensures that decisions remain conservative by at most a constant or logarithmic factor. Further, given N data samples, we demonstrate how the scaling can serve as a basis for estimating approximately Pareto-optimal decisions whose constraint violation probabilities can even be smaller than the Ω(1/N)−estimation barrier that is fundamental in the absence of parametric assumptions.

Keywords

Distributionally Robust Optimization

Chance constrained optimization

Extreme value extrapolation

Pareto efficiency

Rare events

Large deviations

Co-Author

Anand Deo, Indian Institute of Management Bangalore

Speaker

Karthyek Murthy, Singapore University of Technology & Design

Limit Theorems for Stochastic Gradient Descent with Infinite Variance

Stochastic gradient descent is a classic algorithm that has gained great popularity especially in
the last decades as the most common approach for training models in machine learning. While the
algorithm has been well-studied when stochastic gradients are assumed to have a finite variance,
there is significantly less research addressing its theoretical properties in the case of infinite variance
gradients. In this paper, we establish the asymptotic behavior of stochastic gradient descent in the
context of infinite variance stochastic gradients, assuming that the stochastic gradient is regular
varying with index α ∈ (1, 2). The closest result in this context was established in 1969, in the
one-dimensional case and assuming that stochastic gradients belong to a more restrictive class of
distributions. We extend it to the multidimensional case, covering a broader class of infinite variance
distributions. As we show, the asymptotic distribution of the stochastic gradient descent algorithm
can be characterized as the stationary distribution of a suitably defined Ornstein-Uhlenbeck process
driven by an appropriate stable Lévy process. Additionally, we explore the applications of these
results in linear regression and logistic regression models.

Keywords

Levy process

Stochastic gradient descent

heavy tail

Co-Author(s)

Jose Blanchet, Stanford University
Aleksandar Mijatović, University of Warwick

Speaker

Wenhao Yang

Two Tales, One Resolution: Physics-Informed Test Time Scaling and Precondition

In this talk, I will introduce a novel framework for physics-informed debiasing of machine learning estimators, which we call Simulation-Calibrated Scientific Machine Learning (SCaSML). This approach leverages the structure of physical models to achieve two key objectives:

Unbiased Predictions: It produces unbiased predictions even when the underlying machine learning predictor is biased.
Overcoming Dimensionality Challenges: It mitigates the curse of dimensionality that often affects high-dimensional estimators.
The SCaSML paradigm integrates a (potentially) biased machine learning algorithm with a de-biasing procedure that is rigorously designed using numerical analysis and stochastic simulation. Our methodology aligns with recent advances in inference-time computation—similar to those seen in the large language model literature—demonstrating that additional computation can enhance ML estimates.

Furthermore, we establish a surprising equivalence between our framework and another research direction that utilizes approximate (linearized) solvers to precondition iterative methods. This connection not only bridges two distinct areas of study but also offers new insights into improving estimation accuracy in complex, high-dimensional settings.

Speaker

Yiping Lu