Gradient Descent with Polyak's Momentum Finds Flatter Minima via Large Catapults

Jun 17, 2024·
Prin Phunyaphibarn
Equal contribution
Junghyun Lee
Junghyun Lee
Equal contribution
,
Bohan Wang
,
Huishuai Zhang
,
Chulhee Yun
· 1 min read
Abstract
Although gradient descent with Polyak’s momentum is widely used in modern machine and deep learning, a concrete understanding of its effects on the training trajectory remains elusive. In this work, we empirically show that for linear diagonal networks and nonlinear neural networks, momentum gradient descent with a large learning rate displays large catapults, driving the iterates towards much flatter minima than those found by gradient descent. We hypothesize that the large catapult is caused by momentum ``prolonging’’ the self-stabilization effect (Damian et al., 2023). We provide theoretical and empirical support for our hypothesis in a simple toy example and empirical evidence supporting our hypothesis for linear diagonal networks.
Type
Publication
ICML 2024 - 2nd Workshop on High-dimensional Learning Dynamics (HiLD), NeurIPS 2023 Workshop on Mathematics of Modern Machine Learning (M3L) - Oral Presentation
publications

Previous title: Large Catapults in Momentum Gradient Descent with Warmup: An Empirical Study

Junghyun Lee
Authors
PhD Candidate in Artificial Intelligence
PhD candidate at KAIST AI, jointly advised by Se-Young Yun and Chulhee Yun. I work on interactive machine learning, theoretical aspects of LLMs, learning/optimization theory, and statistical analysis of large networks.