Local optimization

2.5. Local optimization#

Grid search works by probing a multitude of uncorrelated points to find an approximate minimum of a function. This is however just one way to solve an optimization problem. Other approaches are based on local optimization, where a single point is progressively refined by pulling it toward lower and lower values of the cost function, until a global or local minimum is reached. This progressive refinement can be designed to be extremely effective, making it more scalable than grid search.

2.5.2. Random search#

The laziest strategy to find a descent direction consists of looking around in random directions, until a point lower on the function is found. This amounts to continuously updating a point ${\bf w}_{k}$ and a step-size $\alpha_k$ through the following steps.

Determine a set of random directions ${\bf d}_{k}^{(1)},\dots,{\bf d}_{k}^{(S)}$.
Compute the update ${\bf \bar{w}}_{k+1}^{(s)} = {\bf w}_{k}+\alpha_k \,{\bf d}_k^{(s)}$ for every direction $s\in\{1,\dots,S\}$.
Choose the point ${\bf \bar{w}}_{k+1}^{(s^*)}$ with the smallest value among $J({\bf \bar{w}}_{k+1}^{(1)}), \dots, J({\bf \bar{w}}_{k+1}^{(S)})$.
Set ${\bf w}_{k+1} = {\bf \bar{w}}_{k+1}^{(s^*)}$ if it attains a lower value than ${\bf w}_{k}$ on the function.
Adjust the step-size $\alpha_{k+1}$ and repeat the process.

This idea is illustrated in the pictures below on a two-dimensional function. At each iteration, $S=4$ candidate directions (white arrows) are randomly sampled, and the best one is selected as a descent direction (strong white arrow). The latter is then used to update the current point (red circle), from which the process starts over.

${\bf w}_1 = {\bf w}_0 + \alpha_0 {\bf d}_0$

../../_images/2479f5bbae9132af02add0b7d9b665d53a7b8a60aecc9f339a1c6dd2d25e8627.png

${\bf w}_2 = {\bf w}_1 + \alpha_1 {\bf d}_1$

../../_images/c08ae67668440577a1ef70be2c72d2e8bad92531e89eac9a1b71edcd78bec4d9.png

${\bf w}_3 = {\bf w}_2 + \alpha_2 {\bf d}_2$

../../_images/7b9022835833497412037562e9bd9b4d654482d0e67464e9545804e96281980f.png

${\bf w}_4 = {\bf w}_3 + \alpha_3 {\bf d}_3$

../../_images/e1a5e1376586e181279d8b2cb697a6bbb49ecb4938b1595990723aafb5bbc66e.png

2.5.2.1. Step 1 - Directions#

How do we choose a set of random directions at the $k$-th iteration? We can draw vectors from a normal distribution with zero mean and unitary variance:

\[ \widehat{ {\bf d} }_k^{(1)}, \dots, \widehat{ {\bf d} }_k^{(S)} \sim \mathcal{N}({\bf 0}, {\bf 1}). \]

The only issue with doing so is that each candidate direction would have different length. We overcome this issue by normalizing them to have unitary length:

\[ (\forall s\in\{1,\dots,S\})\qquad {\bf d}_k^{(s)} = \frac{ \widehat{ {\bf d} }_k^{(s)} }{\big\|\widehat{ {\bf d} }_k^{(s)}\big\|}. \]

This ensures that the random directions are sampled from a circonference with unitary radius centered at ${\bf w}_{k}$.

Before normalization

../../_images/ca7daaf92b0f7886949023e00db2daafe21548c2baa5fe4b53b28fee248ff301.png

After normalization

../../_images/a4f50f85c3b4ed9c566e1d5254dcfbc63c8e6bd41c5adf0ef255b6ce378e6e98.png

2.5.2.2. Step 2 - Selection#

Once the candidate directions are selected, they are scaled by the step-size $\alpha_k$ to completely control how far we travel along a given direction. Then, we test all the candidate directions and take the one providing the greatest decrease on the function:

\[ {\bf d}_{k} = \operatorname*{argmin}_{ {\bf d}\in\{ {\bf d}_k^{(1)},\dots,{\bf d}_k^{(S)}\} } J({\bf w}_k + \alpha_k\,{\bf d}). \]

Candidate directions

../../_images/e753ff4bbd137d87c673e88bae87e57c1229c51309689a3592f1137622de9386.png

Direction with the greatest decrease

../../_images/f88d1a1a03a406dc56dff4db7bffa1618fafc7e9ac9eb5cf57283b475f5e6af0.png

2.5.2.3. Step 3 - Update#

With the selected direction, we perform a tentative update

\[ {\bf \bar{w}}_{k+1} = {\bf w}_k + \alpha_k\,{\bf d}_k. \]

We make ${\bf \bar{w}}_{k+1}$ as a new point only if it decreases the function, otherwise we discard it and start over:

\[\begin{split} {\bf w}_{k+1} = \begin{cases} {\bf \bar{w}}_{k+1} &\textrm{if $J({\bf \bar{w}}_{k+1})<J({\bf w}_k)$}\\[0.5em] {\bf w}_k & \textrm{otherwise}. \end{cases}\end{split}\]

../../_images/340fa2698e07e58661514414c943e27077a41b4576cb40b61f19e9abfca6454a.png

2.5.2.4. Step 4 - Step-size#

The step-size $\alpha_k$ is a positive value that controls the distance travelled at each iteration, in the sense that

\[ \| \bar{{\bf w}}_{k+1} - {\bf w}_k \| = \| {\bf w}_k + \alpha_k \, {\bf d}_k - {\bf w}_k \| = \alpha_k \| {\bf d}_k \| = \alpha_k. \]

The choice of step-size is essential to guarantee that the sequence ${\bf w}_0,{\bf w}_1,\dots,{\bf w}_k$ converges to a (global or local) minimum of the function. While it may be tempting to select a constant step-size, this approach only works with sufficiently small values, at the expense of requiring many iterations before arriving at convergence. Even so, the algorithm can only get so close to a solution, because the update on the current point cannot be smaller than the step-size itself.

A better approach consists of adapting the step-size as the algorithm goes. The 1/5-th success rule is one of the oldest strategies designed for this purpose. It consists of increasing the step-size after a successful step and decreasing it after a failure, so as to keep a success probability of 1/5. Concretely, the step-size is updated at the end of each iteration as follows

\[\begin{split} \alpha_{k+1} = \begin{cases} 1.5 \alpha_k &\textrm{if $J({\bf \bar{w}}_{k+1})<J({\bf w}_k)$}\\[0.5em] \dfrac{\alpha_k}{\sqrt[4]{1.5}} & \textrm{otherwise}. \end{cases}\end{split}\]

The 1/5-th success rule ensures that $\alpha_k\to0$ as $k\to+\infty$, so the algorithm eventually stops (most likely at a global or local minimum). Note that the factor 1.5 may be replaced with any value strictly greater than one.

2.5.3. Numerical implementation#

The following is a Python implementation of random search. Each step of the algorithm is translated into one or several lines of code. Numpy library is used to vectorize operations whenever possible.

def random_search(cost_fun, init, alpha, epochs, n_points, factor=1.5):
    """
    Finds the minimum of a function.
    
    Arguments:
    cost_fun -- cost function | callable that takes 'w' and returns 'J(w)'
    init     -- initial point | vector of shape (N,)
    alpha    -- step-size     | scalar
    epochs   -- n. iterations | integer
    n_points -- n. directions | integer
    factor   -- step adjuster | scalar greater than 1
    
    Returns:
    w -- minimum point | vector of shape (N,)
    """
    assert factor > 1, "The factor must be greater than one"
    
    # Initialization
    w = np.array(init).ravel()  # vector (1d array)
    J = cost_fun(w)
    
    # Refinement loop
    for k in range(epochs):
        
        # Step 1. Choose random directions
        directions  = np.random.randn(n_points, w.size)
        directions /= np.linalg.norm(directions, axis=1, keepdims=True)

        # Step 2. Select the best direction
        w_all = w + alpha * directions
        costs = [cost_fun(wk) for wk in w_all]
        i_min = np.argmin(costs)

        # Steps 3 & 4. Update point and step-size
        if costs[i_min] < J:
            w = w_all[i_min]
            J = costs[i_min]
            alpha *= factor
        else:
            alpha /= factor**0.25
            
    return w

The python implementation requires five arguments:

the function to optimize,
a starting point,
the initial step-size that controls the distance travelled at each iteration,
the number of iterations to be performed,
the number of directions to test per iteration.

An usage example is presented next.

# cost function of two variables
cost_fun = lambda w: (2*w[0]+1)**2 + (w[1]+2)**2

init = [1, 3]  # starting point
alpha  = 0.01  # initial step-size
epochs = 20    # number of iterations
n_points = 10  # number of directions per iteration 
    
# numerical optimization
w = random_search(cost_fun, init, alpha, epochs, n_points)

The process of optimizing a function with random search is illustrated below. The white dots mark the sequence of points generated by the algorithm to find its final approximation of the minimum. While these points may seem to be much less than grid search, keep in mind that random search tests a bunch of directions at each iteration. These directions are not shown in the picture, but they definitely contribute to the overall complexity of the method.

../../_images/9055de121ba499f36d969e36fae47a1c04a54506e31f69535f5d528650822877.png

2.5.4. Curse of dimensionality#

Random search is also crippled by the curse of dimensionality, although this phenomenon is a bit less severe. The issue here is that the chance of randomly selecting a descent direction gets lower and lower as the dimension $N$ of the cost function increases. For instance, the descent probability falls below $1\%$ when $N=30$. Another big issue with random search lies in the highly inefficient way in which descent directions are found. If we had a computationally cheaper way of finding descent directions, then the local method paradigm would work extremely well. This is precisely what we will do in later chapters, which are dedicated solely to the pursuit of finding good and computationally cheap descent directions, leading to a powerful algorithm called gradient descent.