A Larger Stepsize Improves Gradient Descent in Classisificaion Problems

Matus Telgarsky Co-Author
New York University
 
Bin Yu Co-Author
University of California at Berkeley
 
Peter Bartlett Co-Author
University of California Berkeley
 
Jingfeng Wu First Author
 
Jingfeng Wu Presenting Author
 
Sunday, Aug 4: 2:05 PM - 2:20 PM
3762 
Contributed Papers 
Oregon Convention Center 
Gradient Descent (GD) and Stochastic Gradient Descent (SGD) are pivotal in machine learning, particularly in neural network optimization. Conventional wisdom suggests smaller stepsizes for stability, yet in practice, larger stepsizes often yield faster convergence and improved generalization, despite initial instability. This talk delves into the dynamics of GD for logistic regression with linearly separable data, under the setting that the stepsize η is constant but large, whereby the loss initially oscillates. We show that GD exits the initial oscillatory phase rapidly in under O(η) iterations, subsequently achieving a risk of Õ(1 / (t η)). This analysis reveals that, without employing momentum techniques or variable stepsize schedules, GD can achieve an accelerated error rate of Õ(1/T^2) after T iterations with a stepsize of η = Θ(T). In contrast, if the stepsize is small such that the loss does not oscillate, we show an Ω(1/T) lower bound. Our results further extend to general classification loss functions, nonlinear models in the neural tangent kernel regime, and SGD with large stepsizes. Our results are validated with experiments on neural networks.

Keywords

logistic regression

gradient descent

optimization

neural network

acceleration

edge of stability 

Main Sponsor

IMS