7.2 Unconstrained Optimization

We turn now to minimization problems. The point x— , the solution of (7.2),

is called a global minimizer of f , while x— is a local minimizer of f if ∃R > 0

such that

f (x— ) ¤ f (x), ∀x ∈ B(x— ; R).

Throughout this section we shall always assume that f ∈ C 1 (Rn ), and

we refer to [Lem89] for the case in which f is non di¬erentiable. We shall

denote by

T

‚f ‚f

∇f (x) = (x), . . . , (x) ,

‚x1 ‚xn

the gradient of f at a point x. If d is a non null vector in Rn , then the

directional derivative of f with respect to d is

f (x + ±d) ’ f (x)

‚f

(x) = lim

‚d ±

±’0

T

and satis¬es ‚f (x)/‚d = [∇f (x)] d. Moreover, denoting by (x, x + ±d)

the segment in Rn joining the points x and x + ±d, with ± ∈ R, Taylor™s

expansion ensures that ∃ξ ∈ (x, x + ±d) such that

f (x + ±d) ’ f (x) = ±∇f (ξ)T d. (7.20)

If f ∈ C 2 (Rn ), we shall denote by H(x) (or ∇2 f (x)) the Hessian matrix of

f evaluated at a point x, whose entries are

‚ 2 f (x)

hij (x) = , i, j = 1, . . . , n.

‚xi ‚xj

In such a case it can be shown that, if d = 0, the second-order directional

derivative exists and we have

‚2f

(x) = dT H(x)d.

2

‚d

7.2 Unconstrained Optimization 295

For a suitable ξ ∈ (x, x + d) we also have

1

f (x + d) ’ f (x) = ∇f (x)T d + dT H(ξ)d.

2

Existence and uniqueness of solutions for (7.2) are not guaranteed in Rn .

Nevertheless, the following optimality conditions can be proved.

Property 7.4 Let x— ∈ Rn be a local minimizer of f and assume that

f ∈ C 1 (B(x— ; R)) for a suitable R > 0. Then ∇f (x— ) = 0. Moreover,

if f ∈ C 2 (B(x— ; R)) then H(x— ) is positive semide¬nite. Conversely, if

x— ∈ B(x— ; R) and H(x— ) is positive de¬nite, then x— is a local minimizer

of f in B(x— ; R).

A point x— such that ∇f (x— ) = 0, is said to be a critical point for f . This

condition is necessary for optimality to hold. However, this condition also

becomes su¬cient if f is a convex function on Rn , i.e., such that ∀x, y ∈ Rn

and for any ± ∈ [0, 1]

f [±x + (1 ’ ±)y] ¤ ±f (x) + (1 ’ ±)f (y). (7.21)

For further and more general existence results, see [Ber82].

7.2.1 Direct Search Methods

In this section we deal with direct methods for solving problem (7.2), which

only require f to be continuous. In later sections, we shall introduce the

so-called descent methods, which also involve values of the derivatives of f

and have, in general, better convergence properties.

Direct methods are employed when f is not di¬erentiable or if the com-

putation of its derivatives is a nontrivial task. They can also be used to

provide an approximate solution to employ as an initial guess for a descent

method. For further details, we refer to [Wal75] and [Wol78].

The Hooke and Jeeves Method

Assume we are searching for the minimizer of f starting from a given initial

point x(0) and requiring that the error on the residual is less than a certain

¬xed tolerance . The Hooke and Jeeves method computes a new point x(1)

using the values of f at suitable points along the orthogonal coordinate

directions around x(0) . The method consists of two steps: an exploration

step and an advancing step.

The exploration step starts by evaluating f (x(0) + h1 e1 ), where e1 is the

¬rst vector of the canonical basis of Rn and h1 is a positive real number to

be suitably chosen.

If f (x(0) + h1 e1 ) < f (x(0) ), then a success is recorded and the starting

point is moved in x(0) + h1 e1 , from which an analogous check is carried out

at point x(0) + h1 e1 + h2 e2 with h2 ∈ R+ .

296 7. Nonlinear Systems and Numerical Optimization

If, instead, f (x(0) + h1 e1 ) ≥ f (x(0) ), then a failure is recorded and a

similar check is performed at x(0) ’ h1 e1 . If a success is registered, the

method explores, as previously, the behavior of f in the direction e2 starting

from this new point, while, in case of a failure, the method passes directly

to examining direction e2 , keeping x(0) as starting point for the exploration

step.

To achieve a certain accuracy, the step lengths hi must be selected in

such a way that the quantities

|f (x(0) ± hj ej ) ’ f (x(0) |, j = 1, . . . , n (7.22)

have comparable sizes.

The exploration step terminates as soon as all the n Cartesian directions

have been examined. Therefore, the method generates a new point, y(0) ,

after at most 2n+1 functional evaluations. Only two possibilities may arise:

¤

1. y(0) = x(0) . In such a case, if max hi the method terminates

i=1,... ,n

(0)

and yields the approximate solution x . Otherwise, the step lengths

hi are halved and another exploration step is performed starting from

x(0) ;

max |hi | < , then the method terminates yielding

2. y(0) = x(0) . If

i=1,... ,n

y(0) as an approximate solution, otherwise the advancing step starts.

The advancing step consists of moving further from y(0) along the

direction y(0) ’ x(0) (which is the direction that recorded the maxi-

mum decrease of f during the exploration step), rather then simply

setting y(0) as a new starting point x(1) .

This new starting point is instead set equal to 2y(0) ’ x(0) . From this

point a new series of exploration moves is started. If this exploration

leads to a point y(1) such that f (y(1) ) < f (y(0) ’ x(0) ), then a new

starting point for the next exploration step has been found, otherwise

the initial guess for further explorations is set equal to y(1) = y(0) ’

x(0) .

The method is now ready to restart from the point x(1) just com-

puted.

Program 60 provides an implementation of the Hooke and Jeeves method.

The input parameters are the size n of the problem, the vector h of the

initial steps along the Cartesian directions, the variable f containing the

functional expression of f in terms of the components x(1), . . . , x(n), the

initial point x0 and the stopping tolerance toll equal to . In output, the

code returns the approximate minimizer of f , x, the value minf attained by

f at x and the number of iterations needed to compute x up to the desired

accuracy. The exploration step is performed by Program 61.

7.2 Unconstrained Optimization 297

Program 60 - hookejeeves : The method of Hooke and Jeeves (HJ)

function [x,minf,nit]=hookejeeves(n,h,f,x0,toll)

x = x0; minf = eval(f); nit = 0;

while h > toll

[y] = explore(h,n,f,x);

if y == x, h = h/2; else

x = 2*y-x; [z] = explore(h,n,f,x);

if z == x, x = y; else, x = z; end

end

nit = nit +1;

end

minf = eval(f);

Program 61 - explore : Exploration step in the HJ method

function [x]=explore(h,n,f,x0)

x = x0; f0 = eval(f);

for i=1:n

x(i) = x(i) + h(i); ¬ = eval(f);

if ¬ < f0, f0 = ¬; else

x(i) = x0(i) - h(i); ¬ = eval(f);

if ¬ < f0, f0 = ¬; else, x(i) = x0 (i); end

end

end

The Method of Nelder and Mead

This method, proposed in [NM65], employs local linear approximants of f

to generate a sequence of points x(k) , approximations of x— , starting from

simple geometrical considerations. To explain the details of the algorithm,

we begin by noticing that a plane in Rn is uniquely determined by ¬xing

n + 1 points that must not be lying on a hyperplane.

Denote such points by x(k) , for k = 0, . . . , n. They could be generated as

x(k) = x(0) + hk ek , k = 1, . . . , n (7.23)

having selected the steplengths hk ∈ R+ in such a way that the variations

(7.22) are of comparable size.

Let us now denote by x(M ) , x(m) and x(µ) those points of the set x(k)

at which f respectively attains its maximum and minimum value and the

(k)

value immediately preceding the maximum. Moreover, denote by xc the

centroid of point x(k) de¬ned as

n

1

x(k) x(j) .

=

c

n

j=0,j=k

298 7. Nonlinear Systems and Numerical Optimization

The method generates a sequence of approximations of x— , starting from

x(k) , by employing only three possible transformations: re¬‚ections with

respect to centroids, dilations and contractions. Let us examine the details

of the algorithm assuming that n + 1 initial points are available.

1. Determine the points x(M ) , x(m) and x(µ) .

2. Compute as an approximation of x— the point

n

1

x(i)

¯

x=

n + 1 i=0

¯

and check if x is su¬ciently close (in a sense to be made precise) to

—

x . Typically, one requires that the standard deviation of the values

f (x(0) ), . . . , f (x(n) ) from

n

1

¯ f (x(i) )

f=

n + 1 i=0

are less than a ¬xed tolerance µ, that is

n 2

1 ¯

f (x(i) ) ’ f < µ.

n i=0

(M )

Otherwise, x(M ) is re¬‚ected with respect to xc , that is, the follow-

ing new point xr is computed

xr = (1 + ±)x(M ) ’ ±x(M ) ,

c

where ± ≥ 0 is a suitable re¬‚ection factor. Notice that the method

has moved along the “opposite” direction to x(M ) . This statement has

a geometrical interpretation in the case n = 2, since the points x(k)

coincide with x(M ) , x(m) and x(µ) . They thus de¬ne a plane whose

slope points from x(M ) towards x(m) and the method provides a step

along this direction.

3. If f (x(m) ) ¤ f (x(r) ) ¤ f (x(µ) ), the point x(M ) is replaced by x(r)

and the algorithm returns to step 2.

4. If f (x(r) ) < f (x(m) ) then the re¬‚ection step has produced a new

minimizer. This means that the minimizer could lie outside the set

de¬ned by the convex hull of the considered points. Therefore, this

set must be expanded by computing the new vertex

x(e) = βx(r) + (1 ’ β)x(M ) ,

c

where β > 1 is an expansion factor. Then, before coming back to step

2., two possibilities arise:

7.2 Unconstrained Optimization 299

4a. if f (x(e) ) < f (x(m) ) then x(M ) is replaced by x(e) ;

4b. f (x(e) ) ≥ f (x(m) ) then x(M ) is replaced by x(r) since f (x(r) ) <

f (x(m) ).

5. If f (x(r) ) > f (x(µ) ) then the minimizer probably lies within a subset

of the convex hull of points x(k) and, therefore, two di¬erent ap-

proaches can be pursued to contract this set. If f (x(r) ) < f (x(M ) ),

the contraction generates a new point of the form

x(co) = γx(r) + (1 ’ γ)x(M ) , γ ∈ (0, 1),

c

otherwise,

x(co) = γx(M ) + (1 ’ γ)x(M ) , γ ∈ (0, 1),

c

Finally, before returning to step 2., if f (x(co) ) < f (x(M ) ) and f (x(co) ) <

f (x(r) ), the point x(M ) is replaced by x(co) , while if f (x(co) ) ≥ f (x(M ) )

or if f (x(co) ) > f (x(r) ), then n new points x(k) are generated, with

k = 1, . . . , n, by halving the distances between the original points

and x(0) .

As far as the choice of the parameters ±, β and γ is concerned, the following

values are empirically suggested in [NM65]: ± = 1, β = 2 and γ = 1/2. The

resulting scheme is known as the Simplex method (that must not be con-

fused with a method sharing the same name used in linear programming),

since the set of the points x(k) , together with their convex combinations,

form a simplex in Rn .

The convergence rate of the method is strongly a¬ected by the orientation

of the starting simplex. To address this concern, in absence of information

about the behavior of f , the initial choice (7.23) turns out to be satisfactory

in most cases.

We ¬nally mention that the Simplex method is the basic ingredient of

the MATLAB function fmins for function minimization in n dimensions.

Example 7.5 Let us compare the performances of the Simplex method with the

Hooke and Jeeves method, in the minimization of the Rosembrock function

f (x) = 100(x2 ’ x2 )2 + (1 ’ x1 )2 . (7.24)

1

This function has a minimizer at (1, 1)T and represents a severe benchmark for

testing numerical methods in minimization problems. The starting point for both

methods is set equal to x(0) = (’1.2, 1)T , while the step sizes are taken equal

to h1 = 0.6 and h2 = 0.5, in such a way that (7.23) is satis¬ed. The stopping

tolerance on the residual is set equal to 10’4 . For the implementation of Simplex

method, we have used the MATLAB function fmins.

Figure 7.2 shows the iterates computed by the Hooke and Jeeves method (of

which one in every ten iterates have been reported, for the sake of clarity) and by

300 7. Nonlinear Systems and Numerical Optimization

2

1.8

1.6

1.4

1.2

1

0.8

0.6

0.4

0.2

0

’1.5 ’1 ’0.5 0 0.5 1 1.5

FIGURE 7.2. Convergence histories of the Hooke and Jeeves method

(crossed-line) and the Simplex method (circled-line). The level curves of the min-

imized function (7.24) are reported in dashed line

the Simplex method, superposed to the level curves of the Rosembrock function.

The graph demonstrates the di¬culty of this benchmark: actually, the function

is like a curved, narrow valley, which attains its minimum along the parabola of

equation x2 ’ x2 = 0.

1

The Simplex method converges in only 165 iterations, while 935 are needed for

the Hooke and Jeeves method to converge. The former scheme yields a solution

equal to (0.999987, 0.999978)T , while the latter gives the vector (0.9655, 0.9322)T .

•

7.2.2 Descent Methods

In this section we introduce iterative methods that are more sophisticated

than those examined in Section 7.2.1. They can be formulated as follows:

given an initial vector x(0) ∈ Rn , compute for k ≥ 0 until convergence

x(k+1) = x(k) + ±k d(k) , (7.25)

where d(k) is a suitably chosen direction and ±k is a positive parameter

(called stepsize) that measures the step along the direction d(k) . This di-

rection d(k) is a descent direction if

T

d(k) ∇f (x(k) ) < 0 if ∇f (x(k) ) = 0,

(7.26)

if ∇f (x

(k) (k)

d =0 ) = 0.

A descent method is a method like (7.25), in which the vectors d(k) are

descent directions.

Property (7.20) ensures that there exists ±k > 0, su¬ciently small, such

that

f (x(k) + ±k d(k) ) < f (x(k) ), (7.27)

7.2 Unconstrained Optimization 301

provided that f is continuously di¬erentiable. Actually, taking in (7.20)

ξ = x(k) + ‘±k d(k) with ‘ ∈ (0, 1), and employing the continuity of ∇f ,

we get

f (x(k) + ±k d(k) ) ’ f (x(k) ) = ±k ∇f (x(k) )T d(k) + µ, (7.28)

where µ tends to zero as ±k tends to zero. As a consequence, if ±k > 0 is

su¬ciently small, the sign of the left-side of (7.28) coincides with the sign

of ∇f (x(k) )T d(k) , so that (7.27) is satis¬ed if d(k) is a descent direction.

Di¬erent choices of d(k) correspond to di¬erent methods. In particular,

we recall the following ones:

- Newton™s method, in which

d(k) = ’H’1 (x(k) )∇f (x(k) ),

provided that H is positive de¬nite within a su¬ciently large neigh-

borhood of point x— ;

- inexact Newton™s methods, in which

d(k) = ’B’1 ∇f (x(k) ),

k

where Bk is a suitable approximation of H(x(k) );

- the gradient method or steepest descent method, corresponding to setting

d(k) = ’∇f (x(k) ). This method is thus an inexact Newton™s method,

in which Bk = I. It can also be regarded as a gradient-like method,

T

since d(k) ∇f (x(k) ) = ’ ∇f (x(k) ) 2 ;

2

- the conjugate gradient method, for which

d(k) = ’∇f (x(k) ) + βk d(k’1) ,

where βk is a scalar to be suitably selected in such a way that the

directions d(k) turn out to be mutually orthogonal with respect to

a suitable scalar product.