Choice of step-size

Choice of step-size#

Gradient descent is a first-order local optimization algorithm for finding a minimum of a differentiable function. It is based on the observation that a function decreases fastest in the direction of its negative gradient, and therefore such direction can be followed to reach a minimum. Mathematically, this iterative process can be summarized as

\[ \textbf{w}_{k+1} = \textbf{w}_k - \alpha \nabla J(\textbf{w}_k). \]

Gradient descent is equipped with a parameter \(\alpha\) that controls the size of steps taken in the negative gradient direction. The choice of such step-size parameter is critical to ensure the convergence of gradient descent. Indeed, only a sufficiently small step-size guarantees that the new point \({\bf w}_{k+1}\) is lower than \({\bf w}_k\) on the function, so that \(J({\bf w}_{k+1}) < J({\bf w}_k)\). If the step-size is too large, the sequence of points thereby generated could oscillate wildly, never reaching a minimum. A too small step-size is also problematic, because in that case the above sequence would move so slow that far too many iterations would be required to reach a minimum. Both of these scenarios are illustrated in the figure below.

Contents

This chapter presents the main strategies to choose the step-size so that gradient descent converges.

You will learn how to choose a constant step-size by trial and error.
You will see that the step-size can be related to the maximum curvature of a function.
You will recognize when to use a schedule to diminish the step-size during the optimization.
You will discover some clever techniques to adapt the step-size during the optimization.

Acknowledgment

Part of this chapter is based on the supplementary material for the textbook Machine Learning Refined (1st ed., 2016) written by Jeremy Watt, Reza Borhani, and Aggelos K. Katsaggelos.