<< . .

. 43
( : 95)

. . >>

7.2 Unconstrained Optimization
We turn now to minimization problems. The point x— , the solution of (7.2),
is called a global minimizer of f , while x— is a local minimizer of f if ∃R > 0
such that

f (x— ) ¤ f (x), ∀x ∈ B(x— ; R).

Throughout this section we shall always assume that f ∈ C 1 (Rn ), and
we refer to [Lem89] for the case in which f is non di¬erentiable. We shall
denote by
‚f ‚f
∇f (x) = (x), . . . , (x) ,
‚x1 ‚xn

the gradient of f at a point x. If d is a non null vector in Rn , then the
directional derivative of f with respect to d is

f (x + ±d) ’ f (x)
(x) = lim
‚d ±

and satis¬es ‚f (x)/‚d = [∇f (x)] d. Moreover, denoting by (x, x + ±d)
the segment in Rn joining the points x and x + ±d, with ± ∈ R, Taylor™s
expansion ensures that ∃ξ ∈ (x, x + ±d) such that

f (x + ±d) ’ f (x) = ±∇f (ξ)T d. (7.20)

If f ∈ C 2 (Rn ), we shall denote by H(x) (or ∇2 f (x)) the Hessian matrix of
f evaluated at a point x, whose entries are

‚ 2 f (x)
hij (x) = , i, j = 1, . . . , n.
‚xi ‚xj

In such a case it can be shown that, if d = 0, the second-order directional
derivative exists and we have
(x) = dT H(x)d.
7.2 Unconstrained Optimization 295

For a suitable ξ ∈ (x, x + d) we also have
f (x + d) ’ f (x) = ∇f (x)T d + dT H(ξ)d.
Existence and uniqueness of solutions for (7.2) are not guaranteed in Rn .
Nevertheless, the following optimality conditions can be proved.

Property 7.4 Let x— ∈ Rn be a local minimizer of f and assume that
f ∈ C 1 (B(x— ; R)) for a suitable R > 0. Then ∇f (x— ) = 0. Moreover,
if f ∈ C 2 (B(x— ; R)) then H(x— ) is positive semide¬nite. Conversely, if
x— ∈ B(x— ; R) and H(x— ) is positive de¬nite, then x— is a local minimizer
of f in B(x— ; R).
A point x— such that ∇f (x— ) = 0, is said to be a critical point for f . This
condition is necessary for optimality to hold. However, this condition also
becomes su¬cient if f is a convex function on Rn , i.e., such that ∀x, y ∈ Rn
and for any ± ∈ [0, 1]
f [±x + (1 ’ ±)y] ¤ ±f (x) + (1 ’ ±)f (y). (7.21)
For further and more general existence results, see [Ber82].

7.2.1 Direct Search Methods
In this section we deal with direct methods for solving problem (7.2), which
only require f to be continuous. In later sections, we shall introduce the
so-called descent methods, which also involve values of the derivatives of f
and have, in general, better convergence properties.
Direct methods are employed when f is not di¬erentiable or if the com-
putation of its derivatives is a nontrivial task. They can also be used to
provide an approximate solution to employ as an initial guess for a descent
method. For further details, we refer to [Wal75] and [Wol78].

The Hooke and Jeeves Method
Assume we are searching for the minimizer of f starting from a given initial
point x(0) and requiring that the error on the residual is less than a certain
¬xed tolerance . The Hooke and Jeeves method computes a new point x(1)
using the values of f at suitable points along the orthogonal coordinate
directions around x(0) . The method consists of two steps: an exploration
step and an advancing step.
The exploration step starts by evaluating f (x(0) + h1 e1 ), where e1 is the
¬rst vector of the canonical basis of Rn and h1 is a positive real number to
be suitably chosen.
If f (x(0) + h1 e1 ) < f (x(0) ), then a success is recorded and the starting
point is moved in x(0) + h1 e1 , from which an analogous check is carried out
at point x(0) + h1 e1 + h2 e2 with h2 ∈ R+ .
296 7. Nonlinear Systems and Numerical Optimization

If, instead, f (x(0) + h1 e1 ) ≥ f (x(0) ), then a failure is recorded and a
similar check is performed at x(0) ’ h1 e1 . If a success is registered, the
method explores, as previously, the behavior of f in the direction e2 starting
from this new point, while, in case of a failure, the method passes directly
to examining direction e2 , keeping x(0) as starting point for the exploration
To achieve a certain accuracy, the step lengths hi must be selected in
such a way that the quantities

|f (x(0) ± hj ej ) ’ f (x(0) |, j = 1, . . . , n (7.22)

have comparable sizes.
The exploration step terminates as soon as all the n Cartesian directions
have been examined. Therefore, the method generates a new point, y(0) ,
after at most 2n+1 functional evaluations. Only two possibilities may arise:

1. y(0) = x(0) . In such a case, if max hi the method terminates
i=1,... ,n
and yields the approximate solution x . Otherwise, the step lengths
hi are halved and another exploration step is performed starting from
x(0) ;

max |hi | < , then the method terminates yielding
2. y(0) = x(0) . If
i=1,... ,n
y(0) as an approximate solution, otherwise the advancing step starts.
The advancing step consists of moving further from y(0) along the
direction y(0) ’ x(0) (which is the direction that recorded the maxi-
mum decrease of f during the exploration step), rather then simply
setting y(0) as a new starting point x(1) .
This new starting point is instead set equal to 2y(0) ’ x(0) . From this
point a new series of exploration moves is started. If this exploration
leads to a point y(1) such that f (y(1) ) < f (y(0) ’ x(0) ), then a new
starting point for the next exploration step has been found, otherwise
the initial guess for further explorations is set equal to y(1) = y(0) ’
x(0) .
The method is now ready to restart from the point x(1) just com-

Program 60 provides an implementation of the Hooke and Jeeves method.
The input parameters are the size n of the problem, the vector h of the
initial steps along the Cartesian directions, the variable f containing the
functional expression of f in terms of the components x(1), . . . , x(n), the
initial point x0 and the stopping tolerance toll equal to . In output, the
code returns the approximate minimizer of f , x, the value minf attained by
f at x and the number of iterations needed to compute x up to the desired
accuracy. The exploration step is performed by Program 61.
7.2 Unconstrained Optimization 297

Program 60 - hookejeeves : The method of Hooke and Jeeves (HJ)
function [x,minf,nit]=hookejeeves(n,h,f,x0,toll)
x = x0; minf = eval(f); nit = 0;
while h > toll
[y] = explore(h,n,f,x);
if y == x, h = h/2; else
x = 2*y-x; [z] = explore(h,n,f,x);
if z == x, x = y; else, x = z; end
nit = nit +1;
minf = eval(f);

Program 61 - explore : Exploration step in the HJ method
function [x]=explore(h,n,f,x0)
x = x0; f0 = eval(f);
for i=1:n
x(i) = x(i) + h(i); ¬ = eval(f);
if ¬ < f0, f0 = ¬; else
x(i) = x0(i) - h(i); ¬ = eval(f);
if ¬ < f0, f0 = ¬; else, x(i) = x0 (i); end

The Method of Nelder and Mead
This method, proposed in [NM65], employs local linear approximants of f
to generate a sequence of points x(k) , approximations of x— , starting from
simple geometrical considerations. To explain the details of the algorithm,
we begin by noticing that a plane in Rn is uniquely determined by ¬xing
n + 1 points that must not be lying on a hyperplane.
Denote such points by x(k) , for k = 0, . . . , n. They could be generated as

x(k) = x(0) + hk ek , k = 1, . . . , n (7.23)

having selected the steplengths hk ∈ R+ in such a way that the variations
(7.22) are of comparable size.
Let us now denote by x(M ) , x(m) and x(µ) those points of the set x(k)
at which f respectively attains its maximum and minimum value and the
value immediately preceding the maximum. Moreover, denote by xc the
centroid of point x(k) de¬ned as
x(k) x(j) .
298 7. Nonlinear Systems and Numerical Optimization

The method generates a sequence of approximations of x— , starting from
x(k) , by employing only three possible transformations: re¬‚ections with
respect to centroids, dilations and contractions. Let us examine the details
of the algorithm assuming that n + 1 initial points are available.
1. Determine the points x(M ) , x(m) and x(µ) .
2. Compute as an approximation of x— the point
n + 1 i=0

and check if x is su¬ciently close (in a sense to be made precise) to

x . Typically, one requires that the standard deviation of the values
f (x(0) ), . . . , f (x(n) ) from
¯ f (x(i) )
n + 1 i=0

are less than a ¬xed tolerance µ, that is
n 2
1 ¯
f (x(i) ) ’ f < µ.
n i=0

(M )
Otherwise, x(M ) is re¬‚ected with respect to xc , that is, the follow-
ing new point xr is computed
xr = (1 + ±)x(M ) ’ ±x(M ) ,

where ± ≥ 0 is a suitable re¬‚ection factor. Notice that the method
has moved along the “opposite” direction to x(M ) . This statement has
a geometrical interpretation in the case n = 2, since the points x(k)
coincide with x(M ) , x(m) and x(µ) . They thus de¬ne a plane whose
slope points from x(M ) towards x(m) and the method provides a step
along this direction.
3. If f (x(m) ) ¤ f (x(r) ) ¤ f (x(µ) ), the point x(M ) is replaced by x(r)
and the algorithm returns to step 2.
4. If f (x(r) ) < f (x(m) ) then the re¬‚ection step has produced a new
minimizer. This means that the minimizer could lie outside the set
de¬ned by the convex hull of the considered points. Therefore, this
set must be expanded by computing the new vertex
x(e) = βx(r) + (1 ’ β)x(M ) ,

where β > 1 is an expansion factor. Then, before coming back to step
2., two possibilities arise:
7.2 Unconstrained Optimization 299

4a. if f (x(e) ) < f (x(m) ) then x(M ) is replaced by x(e) ;
4b. f (x(e) ) ≥ f (x(m) ) then x(M ) is replaced by x(r) since f (x(r) ) <
f (x(m) ).
5. If f (x(r) ) > f (x(µ) ) then the minimizer probably lies within a subset
of the convex hull of points x(k) and, therefore, two di¬erent ap-
proaches can be pursued to contract this set. If f (x(r) ) < f (x(M ) ),
the contraction generates a new point of the form

x(co) = γx(r) + (1 ’ γ)x(M ) , γ ∈ (0, 1),


x(co) = γx(M ) + (1 ’ γ)x(M ) , γ ∈ (0, 1),

Finally, before returning to step 2., if f (x(co) ) < f (x(M ) ) and f (x(co) ) <
f (x(r) ), the point x(M ) is replaced by x(co) , while if f (x(co) ) ≥ f (x(M ) )
or if f (x(co) ) > f (x(r) ), then n new points x(k) are generated, with
k = 1, . . . , n, by halving the distances between the original points
and x(0) .

As far as the choice of the parameters ±, β and γ is concerned, the following
values are empirically suggested in [NM65]: ± = 1, β = 2 and γ = 1/2. The
resulting scheme is known as the Simplex method (that must not be con-
fused with a method sharing the same name used in linear programming),
since the set of the points x(k) , together with their convex combinations,
form a simplex in Rn .
The convergence rate of the method is strongly a¬ected by the orientation
of the starting simplex. To address this concern, in absence of information
about the behavior of f , the initial choice (7.23) turns out to be satisfactory
in most cases.
We ¬nally mention that the Simplex method is the basic ingredient of
the MATLAB function fmins for function minimization in n dimensions.

Example 7.5 Let us compare the performances of the Simplex method with the
Hooke and Jeeves method, in the minimization of the Rosembrock function

f (x) = 100(x2 ’ x2 )2 + (1 ’ x1 )2 . (7.24)

This function has a minimizer at (1, 1)T and represents a severe benchmark for
testing numerical methods in minimization problems. The starting point for both
methods is set equal to x(0) = (’1.2, 1)T , while the step sizes are taken equal
to h1 = 0.6 and h2 = 0.5, in such a way that (7.23) is satis¬ed. The stopping
tolerance on the residual is set equal to 10’4 . For the implementation of Simplex
method, we have used the MATLAB function fmins.
Figure 7.2 shows the iterates computed by the Hooke and Jeeves method (of
which one in every ten iterates have been reported, for the sake of clarity) and by
300 7. Nonlinear Systems and Numerical Optimization











’1.5 ’1 ’0.5 0 0.5 1 1.5

FIGURE 7.2. Convergence histories of the Hooke and Jeeves method
(crossed-line) and the Simplex method (circled-line). The level curves of the min-
imized function (7.24) are reported in dashed line

the Simplex method, superposed to the level curves of the Rosembrock function.
The graph demonstrates the di¬culty of this benchmark: actually, the function
is like a curved, narrow valley, which attains its minimum along the parabola of
equation x2 ’ x2 = 0.
The Simplex method converges in only 165 iterations, while 935 are needed for
the Hooke and Jeeves method to converge. The former scheme yields a solution
equal to (0.999987, 0.999978)T , while the latter gives the vector (0.9655, 0.9322)T .

7.2.2 Descent Methods
In this section we introduce iterative methods that are more sophisticated
than those examined in Section 7.2.1. They can be formulated as follows:

given an initial vector x(0) ∈ Rn , compute for k ≥ 0 until convergence
x(k+1) = x(k) + ±k d(k) , (7.25)
where d(k) is a suitably chosen direction and ±k is a positive parameter
(called stepsize) that measures the step along the direction d(k) . This di-
rection d(k) is a descent direction if
d(k) ∇f (x(k) ) < 0 if ∇f (x(k) ) = 0,
if ∇f (x
(k) (k)
d =0 ) = 0.
A descent method is a method like (7.25), in which the vectors d(k) are
descent directions.
Property (7.20) ensures that there exists ±k > 0, su¬ciently small, such
f (x(k) + ±k d(k) ) < f (x(k) ), (7.27)
7.2 Unconstrained Optimization 301

provided that f is continuously di¬erentiable. Actually, taking in (7.20)
ξ = x(k) + ‘±k d(k) with ‘ ∈ (0, 1), and employing the continuity of ∇f ,
we get

f (x(k) + ±k d(k) ) ’ f (x(k) ) = ±k ∇f (x(k) )T d(k) + µ, (7.28)

where µ tends to zero as ±k tends to zero. As a consequence, if ±k > 0 is
su¬ciently small, the sign of the left-side of (7.28) coincides with the sign
of ∇f (x(k) )T d(k) , so that (7.27) is satis¬ed if d(k) is a descent direction.
Di¬erent choices of d(k) correspond to di¬erent methods. In particular,
we recall the following ones:

- Newton™s method, in which

d(k) = ’H’1 (x(k) )∇f (x(k) ),

provided that H is positive de¬nite within a su¬ciently large neigh-
borhood of point x— ;
- inexact Newton™s methods, in which

d(k) = ’B’1 ∇f (x(k) ),

where Bk is a suitable approximation of H(x(k) );
- the gradient method or steepest descent method, corresponding to setting
d(k) = ’∇f (x(k) ). This method is thus an inexact Newton™s method,
in which Bk = I. It can also be regarded as a gradient-like method,
since d(k) ∇f (x(k) ) = ’ ∇f (x(k) ) 2 ;

- the conjugate gradient method, for which

d(k) = ’∇f (x(k) ) + βk d(k’1) ,

where βk is a scalar to be suitably selected in such a way that the
directions d(k) turn out to be mutually orthogonal with respect to
a suitable scalar product.

<< . .

. 43
( : 95)

. . >>