Gradient Descent with Polyak's Momentum Finds Flatter Minima via Large Catapults

Jun 17, 2024·

Prin Phunyaphibarn

Equal contribution

Junghyun Lee

Equal contribution

Bohan Wang

Huishuai Zhang

Chulhee Yun

· 1 min read

PDF Poster Slides

Abstract

Although gradient descent with Polyak’s momentum is widely used in modern machine and deep learning, a concrete understanding of its effects on the training trajectory remains elusive. In this work, we empirically show that for linear diagonal networks and nonlinear neural networks, momentum gradient descent with a large learning rate displays large catapults, driving the iterates towards much flatter minima than those found by gradient descent. We hypothesize that the large catapult is caused by momentum ``prolonging’’ the self-stabilization effect (Damian et al., 2023). We provide theoretical and empirical support for our hypothesis in a simple toy example and empirical evidence supporting our hypothesis for linear diagonal networks.

Type

Conference paper

Publication

ICML 2024 - 2nd Workshop on High-dimensional Learning Dynamics (HiLD), NeurIPS 2023 Workshop on Mathematics of Modern Machine Learning (M3L) - Oral Presentation

Previous title: Large Catapults in Momentum Gradient Descent with Warmup: An Empirical Study

Last updated on Jun 17, 2024

Deep Learning Theory Optimization

Authors

Junghyun Lee

PhD Candidate in AI

PhD candidate at KAIST AI, jointly advised by Se-Young Yun and Chulhee Yun. I work on interactive machine learning, theoretical aspects of LLMs, learning/optimization theory, and statistical analysis of large networks.

← A Unified Confidence Sequence for Generalized Linear Models, with Applications to Bandits Jun 19, 2024

Querying Easily Flip-flopped Samples for Deep Active Learning May 7, 2024 →

No results found

Gradient Descent with Polyak's Momentum Finds Flatter Minima via Large Catapults