Sunday, Aug 4: 2:00 PM - 3:50 PM
5007
Contributed Papers
Oregon Convention Center
Room: CC-C125
Main Sponsor
IMS
Presentations
Gradient Descent (GD) and Stochastic Gradient Descent (SGD) are pivotal in machine learning, particularly in neural network optimization. Conventional wisdom suggests smaller stepsizes for stability, yet in practice, larger stepsizes often yield faster convergence and improved generalization, despite initial instability. This talk delves into the dynamics of GD for logistic regression with linearly separable data, under the setting that the stepsize η is constant but large, whereby the loss initially oscillates. We show that GD exits the initial oscillatory phase rapidly in under O(η) iterations, subsequently achieving a risk of Õ(1 / (t η)). This analysis reveals that, without employing momentum techniques or variable stepsize schedules, GD can achieve an accelerated error rate of Õ(1/T^2) after T iterations with a stepsize of η = Θ(T). In contrast, if the stepsize is small such that the loss does not oscillate, we show an Ω(1/T) lower bound. Our results further extend to general classification loss functions, nonlinear models in the neural tangent kernel regime, and SGD with large stepsizes. Our results are validated with experiments on neural networks.
Keywords
logistic regression
gradient descent
optimization
neural network
acceleration
edge of stability
This project focuses on testing conditional independence between two random variables (X and Y) given a set of high-dimensional confounding variables (Z). The high dimensionality of confounding variables poses a challenge for many existing tests, leading to either inflated type-I errors or insufficient power. To address this issue, we leverage the Deep Neural Network (DNN)'s ability to handle complex, high-dimensional data while circumventing the curse of dimensionality. We propose a novel DeepBET test procedure. First, we utilize a DNN model to estimate the conditional means of X and Y given Z using part of the data and obtain predicted errors using the other part of the data. Then, we apply a novel binary expansion statistics to construct our test statistics using predicted errors for dependence detection. Furthermore, we implement a multiple-split
procedure to enhance power, utilizing the entirety of the sample while minimizing randomness. Our results show that the proposed method adeptly controls type I error control and exhibits a significant capacity to detect alternatives, making it a robust approach for testing conditional independence.
Keywords
Conditional independence
Deep Neural Network
Non-parametric Statistics
Binary Expansion Testing
Multi-split method
This paper addresses the challenge of modeling the relationship between non-Euclidean responses and Euclidean predictors. We propose a regression model capable of handling high-dimensional predictors without parametric assumptions. Two key challenges are addressed: the curse of dimensionality in nonparametric regression and the absence of linear structure in general metric spaces. The former is tackled using deep neural networks, while for the latter we state the feasibility of mapping the metric space where responses reside to a low-dimensional Euclidean space using manifold learning. We introduce a reverse mapping approach, using local fréchet regression, to revert the low-dimensional representation back to the original metric space. To establish a comprehensive theoretical framework, we investigate the convergence rate of deep neural network under dependent and biased sub-Gaussian noise. The convergence rate of the proposed regression model is then obtained by expanding the scope of local fréchet regression to accommodate multivariate predictors in the presence of errors in predictors. We show in simulations and applications the proposed model outperforms existing methods.
Keywords
Curse of Dimensionality
Deep Learning
Fréchet regression
Non-Euclidean data
Manifold learning
From proteomics to remote sensing, machine learning predictions are beginning to substitute for real data when collection of the latter is difficult, slow or costly. In this talk I will present recent and ongoing work that permits the use of predictions for the purpose of valid statistical inference. I will discuss the use of machine learning predictions as substitutes for high-quality data on one hand, and as a tool for guiding real data collection on the other. In both cases, machine learning allows for a significant boost in statistical power compared to "classical" baselines for inference that do not leverage prediction. Based on joint works with Anastasios Angelopoulos, Stephen Bates, Emmanuel Candes, John Duchi, Clara Fannjiang, and Michael Jordan.
Keywords
machine learning
prediction-powered inference
active inference
Gradient-based learning in multi-layer neural networks displays a number of striking features. In particular, the decrease rate of empirical risk is non-monotone even after averaging over large batches. Long plateaus in which one observes barely any progress alternate with intervals of rapid decrease. These successive phases of learning often take place on very different time scales. Finally, models learnt in an early phase are typically 'simpler' or 'easier to learn' although in a way that is difficult to formalize.
Although theoretical explanations of these phenomena have been put forward, each of them captures at best certain specific regimes. In this paper, we study the training dynamics of a wide two-layer neural network under single-index model in high-dimension. Based on a mixture of new rigorous results, non-rigorous mathematical derivations, and numerical simulations, we propose a scenario for the learning dynamics in this setting. In particular, the proposed evolution exhibits separation of timescales and intermittency. These behaviors arise naturally because the population gradient flow can be recast as a singularly perturbed dynamical system.
Abstracts
Classification using high-dimensional features arises frequently in many contemporary statistical studies such as imaging data classification for PET scan or other high-throughput data. The difficulty of high-dimensional functional data classification is intrinsically caused by the existence of many noise features that do not contribute to the reduction of misclassification rate. There is limited study on the analysis of the impacts of high dimensionality on functional data classification. We bridge the gap by proposing a deep neural network-based algorithms which perform penalized classification and feature selection simultaneously. Simulation studies and real data analysis support our theoretical results and demonstrate convincingly the advantage of our new classification procedure.
Keywords
functional data analysis
deep neural network
feature selection
classification
The Wasserstein distance is a powerful tool in modern machine learning to metrize the space of probability distributions in a way that takes into account the geometry of the domain.
Therefore, a lot of attention has been devoted in the literature to understanding rates of convergence for Wasserstein distances based on iid data. However, often in machine learning applications, especially in reinforcement learning, object tracking, performative prediction, and other online learning problems, observations are received sequentially, rendering some inherent temporal dependence. Motivated by this observation, we attempt to understand the problem of estimating Wasserstein distances using the natural plug-in estimator based on stationary beta-mixing sequences, a widely used assumption in the study of dependent processes. Our rates of convergence results are applicable under both short and long-range dependence. As expected, under short-range dependence, the rates match those observed in the iid. case. Interestingly, however, even under long-range dependence, we can show that the rates can match those in the iid case provided the (intrinsic) dimension is large enough.
Keywords
Entropy regularized optimal transport
Mckean-Vlasov diffusion
mirror descent
parabolic Monge-Amp`ere
Sinkhorn algorithm
Wasserstein mirror gradient flow
Abstracts