2.5. Local optimization#

Grid search works by probing a multitude of uncorrelated points to find an approximate minimum of a function. This is however just one way to solve an optimization problem. Other approaches are based on local optimization, where a single point is progressively refined by pulling it toward lower and lower values of the cost function, until a global or local minimum is reached. This progressive refinement can be designed to be extremely effective, making it more scalable than grid search.

2.5.1. Sequential refinement#

A generic sequential refinement scheme works as follows. Initially, a starting point \({\bf w}_0\) is selected, randomly or in any other way. Then, a “descent direction” at \({\bf w}_0\) is found. This is a vector \({\bf d}_0\) pointing in such a direction as to decrease the value taken by the function \(J({\bf w})\) at the point \({\bf w}_0\). By doing so, the first update can be literally computed as \({\bf w}_1 = {\bf w}_0 + {\bf d}_0\). Thereafter, a new descent direction \({\bf d}_1\) is found at \({\bf w}_1\), so as to compute the second update \({\bf w}_2 = {\bf w}_1 + {\bf d}_1\). This process is repeated several times, producing the following sequence of points:

\[\begin{split}\begin{aligned} {\bf w}_0 & \\ {\bf w}_1 &= {\bf w}_0 + {\bf d}_0 \\ {\bf w}_2 &= {\bf w}_1 + {\bf d}_1 \\ &\;\vdots\\ {\bf w}_{k+1} &= {\bf w}_k + {\bf d}_k. \end{aligned}\end{split}\]

Here, it is importat that the descent directions are selected so as to generate a decreasing sequence

\[J({\bf w}_0) > J({\bf w}_1) > \dots > J({\bf w}_{k+1}).\]

There are numerous ways of determining proper descent directions, which distinguish local optimization methods from each other. In this section, we will discuss one specific strategy in more detail, called random search.

2.5.3. Numerical implementation#

The following is a Python implementation of random search. Each step of the algorithm is translated into one or several lines of code. Numpy library is used to vectorize operations whenever possible.

def random_search(cost_fun, init, alpha, epochs, n_points, factor=1.5):
    """
    Finds the minimum of a function.
    
    Arguments:
    cost_fun -- cost function | callable that takes 'w' and returns 'J(w)'
    init     -- initial point | vector of shape (N,)
    alpha    -- step-size     | scalar
    epochs   -- n. iterations | integer
    n_points -- n. directions | integer
    factor   -- step adjuster | scalar greater than 1
    
    Returns:
    w -- minimum point | vector of shape (N,)
    """
    assert factor > 1, "The factor must be greater than one"
    
    # Initialization
    w = np.array(init).ravel()  # vector (1d array)
    J = cost_fun(w)
    
    # Refinement loop
    for k in range(epochs):
        
        # Step 1. Choose random directions
        directions  = np.random.randn(n_points, w.size)
        directions /= np.linalg.norm(directions, axis=1, keepdims=True)

        # Step 2. Select the best direction
        w_all = w + alpha * directions
        costs = [cost_fun(wk) for wk in w_all]
        i_min = np.argmin(costs)

        # Steps 3 & 4. Update point and step-size
        if costs[i_min] < J:
            w = w_all[i_min]
            J = costs[i_min]
            alpha *= factor
        else:
            alpha /= factor**0.25
            
    return w

The python implementation requires five arguments:

  • the function to optimize,

  • a starting point,

  • the initial step-size that controls the distance travelled at each iteration,

  • the number of iterations to be performed,

  • the number of directions to test per iteration.

An usage example is presented next.

# cost function of two variables
cost_fun = lambda w: (2*w[0]+1)**2 + (w[1]+2)**2

init = [1, 3]  # starting point
alpha  = 0.01  # initial step-size
epochs = 20    # number of iterations
n_points = 10  # number of directions per iteration 
    
# numerical optimization
w = random_search(cost_fun, init, alpha, epochs, n_points)

The process of optimizing a function with random search is illustrated below. The white dots mark the sequence of points generated by the algorithm to find its final approximation of the minimum. While these points may seem to be much less than grid search, keep in mind that random search tests a bunch of directions at each iteration. These directions are not shown in the picture, but they definitely contribute to the overall complexity of the method.

../../_images/9055de121ba499f36d969e36fae47a1c04a54506e31f69535f5d528650822877.png

2.5.4. Curse of dimensionality#

Random search is also crippled by the curse of dimensionality, although this phenomenon is a bit less severe. The issue here is that the chance of randomly selecting a descent direction gets lower and lower as the dimension \(N\) of the cost function increases. For instance, the descent probability falls below \(1\%\) when \(N=30\). Another big issue with random search lies in the highly inefficient way in which descent directions are found. If we had a computationally cheaper way of finding descent directions, then the local method paradigm would work extremely well. This is precisely what we will do in later chapters, which are dedicated solely to the pursuit of finding good and computationally cheap descent directions, leading to a powerful algorithm called gradient descent.