Print Close

A Larger Stepsize Improves Gradient Descent in Classisificaion Problems

Presented During: New Developments in Machine and Deep Learning

Matus Telgarsky Co-Author
New York University

Bin Yu Co-Author
University of California at Berkeley

Peter Bartlett Co-Author
University of California Berkeley

Jingfeng Wu First Author

Jingfeng Wu Presenting Author

Sunday, Aug 4: 2:05 PM - 2:20 PM
3762
Contributed Papers

Oregon Convention Center

Gradient Descent (GD) and Stochastic Gradient Descent (SGD) are pivotal in machine learning, particularly in neural network optimization. Conventional wisdom suggests smaller stepsizes for stability, yet in practice, larger stepsizes often yield faster convergence and improved generalization, despite initial instability. This talk delves into the dynamics of GD for logistic regression with linearly separable data, under the setting that the stepsize η is constant but large, whereby the loss initially oscillates. We show that GD exits the initial oscillatory phase rapidly in under O(η) iterations, subsequently achieving a risk of Õ(1 / (t η)). This analysis reveals that, without employing momentum techniques or variable stepsize schedules, GD can achieve an accelerated error rate of Õ(1/T^2) after T iterations with a stepsize of η = Θ(T). In contrast, if the stepsize is small such that the loss does not oscillate, we show an Ω(1/T) lower bound. Our results further extend to general classification loss functions, nonlinear models in the neural tangent kernel regime, and SGD with large stepsizes. Our results are validated with experiments on neural networks.

Keywords

logistic regression

gradient descent

optimization

neural network

acceleration

edge of stability

View Abstract 3762

Main Sponsor

IMS