New Developments in Machine and Deep Learning

Reetam Majumder Chair
North Carolina State University
 
Sunday, Aug 4: 2:00 PM - 3:50 PM
5007 
Contributed Papers 
Oregon Convention Center 
Room: CC-C125 

Main Sponsor

IMS

Presentations

A Larger Stepsize Improves Gradient Descent in Classisificaion Problems

Gradient Descent (GD) and Stochastic Gradient Descent (SGD) are pivotal in machine learning, particularly in neural network optimization. Conventional wisdom suggests smaller stepsizes for stability, yet in practice, larger stepsizes often yield faster convergence and improved generalization, despite initial instability. This talk delves into the dynamics of GD for logistic regression with linearly separable data, under the setting that the stepsize η is constant but large, whereby the loss initially oscillates. We show that GD exits the initial oscillatory phase rapidly in under O(η) iterations, subsequently achieving a risk of Õ(1 / (t η)). This analysis reveals that, without employing momentum techniques or variable stepsize schedules, GD can achieve an accelerated error rate of Õ(1/T^2) after T iterations with a stepsize of η = Θ(T). In contrast, if the stepsize is small such that the loss does not oscillate, we show an Ω(1/T) lower bound. Our results further extend to general classification loss functions, nonlinear models in the neural tangent kernel regime, and SGD with large stepsizes. Our results are validated with experiments on neural networks. 

Keywords

logistic regression

gradient descent

optimization

neural network

acceleration

edge of stability 

View Abstract 3762

Co-Author(s)

Matus Telgarsky, New York University
Bin Yu, University of California at Berkeley
Peter Bartlett, University of California Berkeley

First Author

Jingfeng Wu

Presenting Author

Jingfeng Wu

CONDITION INDEPENDENCE WITH DEEP NEURAL NETWORK BASED BINARY EXPANSION TEST (DEEPBET)

This project focuses on testing conditional independence between two random variables (X and Y) given a set of high-dimensional confounding variables (Z). The high dimensionality of confounding variables poses a challenge for many existing tests, leading to either inflated type-I errors or insufficient power. To address this issue, we leverage the Deep Neural Network (DNN)'s ability to handle complex, high-dimensional data while circumventing the curse of dimensionality. We propose a novel DeepBET test procedure. First, we utilize a DNN model to estimate the conditional means of X and Y given Z using part of the data and obtain predicted errors using the other part of the data. Then, we apply a novel binary expansion statistics to construct our test statistics using predicted errors for dependence detection. Furthermore, we implement a multiple-split
procedure to enhance power, utilizing the entirety of the sample while minimizing randomness. Our results show that the proposed method adeptly controls type I error control and exhibits a significant capacity to detect alternatives, making it a robust approach for testing conditional independence. 

Keywords

Conditional independence

Deep Neural Network

Non-parametric Statistics

Binary Expansion Testing

Multi-split method 

View Abstract 2456

Co-Author(s)

Kai Zhang, UNC Chapel Hill
Ping-Shou Zhong, University of Illinois at Chicago

First Author

Yang Yang

Presenting Author

Yang Yang

Deep Fréchet Regression

This paper addresses the challenge of modeling the relationship between non-Euclidean responses and Euclidean predictors. We propose a regression model capable of handling high-dimensional predictors without parametric assumptions. Two key challenges are addressed: the curse of dimensionality in nonparametric regression and the absence of linear structure in general metric spaces. The former is tackled using deep neural networks, while for the latter we state the feasibility of mapping the metric space where responses reside to a low-dimensional Euclidean space using manifold learning. We introduce a reverse mapping approach, using local fréchet regression, to revert the low-dimensional representation back to the original metric space. To establish a comprehensive theoretical framework, we investigate the convergence rate of deep neural network under dependent and biased sub-Gaussian noise. The convergence rate of the proposed regression model is then obtained by expanding the scope of local fréchet regression to accommodate multivariate predictors in the presence of errors in predictors. We show in simulations and applications the proposed model outperforms existing methods. 

Keywords

Curse of Dimensionality

Deep Learning

Fréchet regression

Non-Euclidean data

Manifold learning 

View Abstract 2134

Co-Author(s)

Yidong Zhou
Hans-Georg Mueller, UC Davis

First Author

SU I IAO

Presenting Author

SU I IAO

Inference via Machine Learning

From proteomics to remote sensing, machine learning predictions are beginning to substitute for real data when collection of the latter is difficult, slow or costly. In this talk I will present recent and ongoing work that permits the use of predictions for the purpose of valid statistical inference. I will discuss the use of machine learning predictions as substitutes for high-quality data on one hand, and as a tool for guiding real data collection on the other. In both cases, machine learning allows for a significant boost in statistical power compared to "classical" baselines for inference that do not leverage prediction. Based on joint works with Anastasios Angelopoulos, Stephen Bates, Emmanuel Candes, John Duchi, Clara Fannjiang, and Michael Jordan. 

Keywords

machine learning

prediction-powered inference

active inference 

View Abstract 3444

First Author

Tijana Zrnic, University of California

Presenting Author

Tijana Zrnic, University of California

Learning time-scales in two-layers neural networks

Gradient-based learning in multi-layer neural networks displays a number of striking features. In particular, the decrease rate of empirical risk is non-monotone even after averaging over large batches. Long plateaus in which one observes barely any progress alternate with intervals of rapid decrease. These successive phases of learning often take place on very different time scales. Finally, models learnt in an early phase are typically 'simpler' or 'easier to learn' although in a way that is difficult to formalize.

Although theoretical explanations of these phenomena have been put forward, each of them captures at best certain specific regimes. In this paper, we study the training dynamics of a wide two-layer neural network under single-index model in high-dimension. Based on a mixture of new rigorous results, non-rigorous mathematical derivations, and numerical simulations, we propose a scenario for the learning dynamics in this setting. In particular, the proposed evolution exhibits separation of timescales and intermittency. These behaviors arise naturally because the population gradient flow can be recast as a singularly perturbed dynamical system.
 

Abstracts


First Author

Kangjie Zhou, Stanford University

Presenting Author

Kangjie Zhou, Stanford University

Simultaneous Classification and Feature Selection for Complex Functional Data

Classification using high-dimensional features arises frequently in many contemporary statistical studies such as imaging data classification for PET scan or other high-throughput data. The difficulty of high-dimensional functional data classification is intrinsically caused by the existence of many noise features that do not contribute to the reduction of misclassification rate. There is limited study on the analysis of the impacts of high dimensionality on functional data classification. We bridge the gap by proposing a deep neural network-based algorithms which perform penalized classification and feature selection simultaneously. Simulation studies and real data analysis support our theoretical results and demonstrate convincingly the advantage of our new classification procedure. 

Keywords

functional data analysis

deep neural network

feature selection

classification 

View Abstract 2115

Co-Author

guanqun Cao, Michigan State University

First Author

Shuoyang Wang, Yale University

Presenting Author

guanqun Cao, Michigan State University

Trade-off between dependence and complexity in Wasserstein distance learning

The Wasserstein distance is a powerful tool in modern machine learning to metrize the space of probability distributions in a way that takes into account the geometry of the domain.
Therefore, a lot of attention has been devoted in the literature to understanding rates of convergence for Wasserstein distances based on iid data. However, often in machine learning applications, especially in reinforcement learning, object tracking, performative prediction, and other online learning problems, observations are received sequentially, rendering some inherent temporal dependence. Motivated by this observation, we attempt to understand the problem of estimating Wasserstein distances using the natural plug-in estimator based on stationary beta-mixing sequences, a widely used assumption in the study of dependent processes. Our rates of convergence results are applicable under both short and long-range dependence. As expected, under short-range dependence, the rates match those observed in the iid. case. Interestingly, however, even under long-range dependence, we can show that the rates can match those in the iid case provided the (intrinsic) dimension is large enough.
 

Keywords

Entropy regularized optimal transport

Mckean-Vlasov diffusion

mirror descent

parabolic Monge-Amp`ere

Sinkhorn algorithm

Wasserstein mirror gradient flow 

Abstracts


Co-Author(s)

Young-Heon Kim, University of British Columbia
Soumik Pal, University of Washington, Seattle
Geoffrey Schiebinger, University of British Columbia

First Author

Nabarun Deb

Presenting Author

Nabarun Deb