<< . .

. 44
( : 95)

. . >>

Selecting d(k) is not enough to completely identify a descent method,
since it remains an open problem how to determine ±k in such a way that
(7.27) is ful¬lled without resorting to excessively small stepsizes ±k (and,
thus, to methods with a slow convergence).
A method for computing ±k consists of solving the following minimiza-
tion problem in one dimension:

¬nd ± such that φ(±) = f (x(k) + ±d(k) ) is minimized. (7.29)

In such a case we have the following result.
302 7. Nonlinear Systems and Numerical Optimization

Theorem 7.3 Consider the descent method (7.25). If at the generic step
k, the parameter ±k is set equal to the exact solution of (7.29), then the
following orthogonality property holds

∇f (x(k+1) )T d(k) = 0.

Proof. Let ±k be a solution to (7.29). Then, the ¬rst derivative of φ, given by
‚f (k) ‚ (k) (k)
(xi + ±di ) = ∇f (x(k) + ±k d(k) )T d(k) ,
(x + ±k d(k) )
φ (±) =
‚xi ‚±

vanishes at ± = ±k . The thesis then follows, recalling the de¬nition of x(k+1) . 3

Unfortunately, except for in special cases (which are nevetherless quite
relevant, see Section 7.2.4), providing an exact solution of (7.29) is not fea-
sible, since this is a nonlinear problem. One possible strategy consists of
approximating f along the straight line x(k) + ±d(k) through an interpo-
lating polynomial and then minimizing this polynomial (see the quadratic
interpolation Powell methods and cubic interpolation Davidon methods in
Generally speaking, a process that leads to an approximate solution to
(7.29) is said to be a line search technique and is addressed in the next

7.2.3 Line Search Techniques
The methods that we are going to deal with in this section, are iterative
techniques that terminate as soon as some accuracy stopping criterion on
±k is satis¬ed. We shall assume that (7.26) holds.
Practical experience reveals that it is not necessary to solve accurately
for (7.29) in order to devise e¬cient methods, rather, it is crucial to enforce
some limitation on the step lengths (and, thus, on the admissible values for
±k ). Actually, without introducing any limitation, a reasonable request on
±k would seem be that the new iterate x(k+1) satis¬es the inequality

f (x(k+1) ) < f (x(k) ), (7.30)

where x(k) and d(k) have been ¬xed. For this purpose, the procedure based
on starting from a (su¬ciently large) value of the step length ±k and halve
this value until (7.30) is ful¬lled, can yield completely wrong results (see,

More stringent criteria than (7.30) should be adopted in the choice of
possible values for ±k . To this end, we notice that two kinds of di¬culties
arise with the above examples: a slow descent rate of the sequence and the
use of small stepsizes.
7.2 Unconstrained Optimization 303

The ¬rst di¬culty can be overcome by requiring that
0 ≥ vM (x(k+1) ) = f (x(k) ) ’ f (x(k) + ±k d(k) )
±k (7.31)
≥ ’σ∇f (x(k) )T d(k) ,

with σ ∈ (0, 1/2). This amounts to requiring that the average descent
rate vM of f along d(k) , evaluated at x(k+1) , be at least equal to a given
fraction of the initial descent rate at x(k) . To avoid the generation of too
small stepsizes, we require that the descent rate in the direction d(k) at
x(k+1) is not less than a given fraction of the descent rate at x(k)

|∇f (x(k) + ±k d(k) )T d(k) | ¤ β|∇f (x(k) )T d(k) |, (7.32)

with β ∈ (σ, 1) in such a way as to also satisfy (7.31). In computational
practice, σ ∈ [10’5 , 10’1 ] and β ∈ [10’1 , 1 ] are usual choices. Sometimes,
(7.32) is replaced by the milder condition

∇f (x(k) + ±k d(k) )T d(k) ≥ β∇f (x(k) )T d(k) (7.33)

(recall that ∇f (x(k) )T d(k) is negative, since d(k) is a descent direction).

The following property ensures that, under suitable assumptions, it is pos-
sible to ¬nd out values of ±k which satisfy (7.31)-(7.32) or (7.31)-(7.33).

Property 7.5 Assume that f (x) ≥ M for any x ∈ Rn . Then there exists
an interval I = [c, C] for the descent method, with 0 < c < C, such that
∀±k ∈ I, (7.31), (7.32) (or (7.31)-(7.33)) are satis¬ed, with σ ∈ (0, 1/2)
and β ∈ (σ, 1).
Under the constraint of ful¬lling conditions (7.31) and (7.32), several
choices for ±k are available. Among the most up-to-date strategies, we re-
call here the backtracking techniques: having ¬xed σ ∈ (0, 1/2), then start
with ±k = 1 and then keep on reducing its value by a suitable scale factor
ρ ∈ (0, 1) (backtrack step) until (7.31) is satis¬ed. This procedure is im-
plemented in Program 62, which requires as input parameters the vector x
containing x(k) , the macros f and J of the functional expressions of f and
its Jacobian, the vector d of the direction d(k) , and a value for σ (usually
of the order of 10’4 ) and the scale factor ρ. In output, the code returns
the vector x(k+1) , computed using a suitable value of ±k .

Program 62 - backtrackr : Backtraking for line search
function [xnew]= backtrackr(sigma,rho,x,f,J,d)
alphak = 1; fk = eval(f); Jfk = eval (J);
xx = x; x = x + alphak * d; fk1 = eval (f);
while fk1 > fk + sigma * alphak * Jfk™*d
304 7. Nonlinear Systems and Numerical Optimization

alphak = alphak*rho;
x = xx + alphak*d;
fk1 = eval(f);

Other commonly used strategies are those developed by Armijo and Gold-
stein (see [Arm66], [GP67]). Both use σ ∈ (0, 1/2). In the Armijo formula,
one takes ±k = β mk ±, where β ∈ (0, 1), ± > 0 and mk is the ¬rst non-
¯ ¯
negative integer such that (7.31) is satis¬ed. In the Goldstein formula, the
parameter ±k is determined in such a way that
f (x(k) + ±k d(k) ) ’ f (x(k) )
σ¤ ¤ 1 ’ σ. (7.34)
±k ∇f (x(k) )T d(k)
A procedure for computing ±k that satis¬es (7.34) is provided in [Ber82],
Chapter 1. Of course, one can even choose ±k = ± for any k, which is
clearly convenient when evaluating f is a costly task.
In any case, a good choice of the value ± is mandatory. In this respect,
one can proceed as follows. For a given value ±, the second degree poly-
nomial Π2 along the direction d is constructed, subject to the following
interpolation constraints
Π2 (x(k) ) = f (x(k) ),
Π2 (x(k) + ±d(k) ) = f (x(k) + ±d(k) ),
¯ ¯
Π2 (x(k) ) = ∇f (x(k) )T d(k) .
Next, the value ± is computed such that Π2 is minimized, then, we let
± = ±.

7.2.4 Descent Methods for Quadratic Functions
A case of remarkable interest, where the parameter ±k can be exactly com-
puted, is the problem of minimizing the quadratic function
x Ax ’ bT x,
f (x) = (7.35)
where A∈ Rn—n is a symmetric and positive de¬nite matrix and b ∈ Rn .
In such a case, as already seen in Section 4.3.3, a necessary condition for
x— to be a minimizer for f is that x— is the solution of the linear system
(3.2). Actually, it can be checked that if f is a quadratic function
∇f (x) = Ax ’ b = ’r, H(x) = A.
As a consequence, all gradient-like iterative methods developed in Section
4.3.3 for linear systems, can be extended tout-court to solve minimization
7.2 Unconstrained Optimization 305

In particular, having ¬xed a descent direction d(k) , we can determine
the optimal value of the acceleration parameter ±k that appears in (7.25),
in such a way as to ¬nd the point where the function f , restricted to the
direction d(k) , is minimized. Setting to zero the directional derivative, we
d T T
f (x(k) + ±k d(k) ) = ’d(k) r(k) + ±k d(k) Ad(k) = 0
from which the following expression for ±k is obtained
d(k) r(k)
±k = . (7.36)
d(k) Ad(k)
The error introduced by the iterative process (7.25) at the k-th step is
x(k+1) ’ x— = x(k+1) ’ x— A x(k+1) ’ x—
(k) T T
x— 2 —
’ ’x
(k) (k)
±k d(k) Ad(k) .
=x + 2±k d Ax +

On the other hand x(k) ’ x— = r(k) A’1 r(k) , so that from (7.37) it
follows that

x(k+1) ’ x— = ρk x(k) ’ x—
2 2

having denoted by ρk = 1 ’ σk , with
A’1 r(k) .
σk = (d(k) r(k) )2 / d(k) Ad(k) r(k)

Since A is symmetric and positive de¬nite, σk is always positive. Moreover,
it can be directly checked that ρk is strictly less than 1, except when d(k)
is orthogonal to r(k) , in which case ρk = 1.
The choice d(k) = r(k) , which leads to the steepest descent method, pre-
vents this last circumstance from arising. In such a case, from (7.38) we
»max ’ »min (k)
x(k+1) ’ x— x ’ x—
¤ (7.39)
»max + »min
having employed the following result.

Lemma 7.1 (Kantorovich inequality) Let A ∈ Rn—n be a symmetric
positive de¬nite matrix whose eigenvalues with largest and smallest module
are given by »max and »min , respectively. Then, ∀y ∈ Rn , y = 0,

(yT y)2 4»max »min
≥ .
(yT Ay)(yT A’1 y) (»max + »min )2
306 7. Nonlinear Systems and Numerical Optimization

It follows from (7.39) that, if A is ill-conditioned, the error reducing factor
for the steepest descent method is close to 1, yielding a slow convergence to
the minimizer x— . As done in Chapter 4, this drawback can be overcome
by introducing directions d(k) that are mutually A-conjugate, i.e.
d(k) Ad(m) = 0 if k = m.

The corresponding methods enjoy the following ¬nite termination property.

Property 7.6 A method for computing the minimizer x— of the quadratic
function (7.35) which employs A-conjugate directions terminates after at
most n steps if the acceleration parameter ±k is selected as in (7.36). More-
over, for any k, x(k+1) is the minimizer of f over the subspace generated
by the vectors x(0) , d(0) , . . . , d(k) and
r(k+1) d(m) = 0 ∀m ¤ k.

The A-conjugate directions can be determined by following the proce-
dure described in Section 4.3.4. Letting d(0) = r(0) , the conjugate gradient
method for function minimization is

d(k+1) = r(k) + βk d(k) ,
r(k+1) Ad(k) r(k+1) r(k+1)
βk = ’ = ,
d(k) Ad(k) r(k) r(k)
x(k+1) = x(k) + ±k d(k) .

It satis¬es the following error estimate
K2 (A) ’ 1
x(k) ’ x— x(0) ’ x—
¤2 A,
K2 (A) + 1

which can be improved by lowering the condition number of A, i.e., resort-
ing to the preconditioning techniques that have been dealt with in Section

Remark 7.2 (The nonquadratic case) The conjugate gradient method
can be extended to the case in which f is a non quadratic function. However,
in such an event, the acceleration parameter ±k cannot be exactly deter-
mined a priori, but requires the solution of a local minimization problem.
Moreover, the parameters βk can no longer be uniquely found. Among the
most reliable formulae, we recall the one due to Fletcher-Reeves,

∇f (x(k) ) 2
β1 = 0, βk = , for k > 1
∇f (x(k’1) ) 2
7.2 Unconstrained Optimization 307

and the one due to Polak-Ribi´re
∇f (x(k) ) (∇f (x(k) ) ’ ∇f (x(k’1) ))
β1 = 0, βk = , for k > 1.
∇f (x(k’1) ) 2

7.2.5 Newton-like Methods for Function Minimization
An alternative is provided by Newton™s method, which di¬ers from its ver-
sion for nonlinear systems in that now it is no longer applied to f , but to
its gradient.
Using the notation of Section 7.2.2, Newton™s method for function mini-
mization amounts to computing, for k = 0, 1, . . . , until convergence

d(k) = ’H’1 ∇f (x(k) ),
(k+1) (k) (k)
x =x +d ,

where x(0) ∈ Rn is a given initial vector and having set Hk = H(x(k) ). The
method can be derived by truncating Taylor™s expansion of f (x(k) ) at the
f (x(k) ) + ∇f (x(k) )T p + pT Hk p.
f (x(k) + p) (7.41)
Selecting p in (7.41) in such a way that the new vector x(k+1) = x(k) + p
satis¬es ∇f (xk+1 ) = 0, we end up with method (7.40), which thus con-
verges in one step if f is quadratic.
In the general case, a result analogous to Theorem 7.1 also holds for func-
tion minimization. Method (7.40) is therefore locally quadratically conver-
gent to the minimizer x— . However, it is not convenient to use Newton™s
method from the beginning of the computation, unless x(0) is su¬ciently
close to x— . Otherwise, indeed, Hk could not be invertible and the direc-
tions d(k) could fail to be descent directions. Moreover, if Hk is not positive
de¬nite, nothing prevents the scheme (7.40) from converging to a saddle
point or a maximizer, which are points where ∇f is equal to zero. All these
drawbacks, together with the high computational cost (recall that a linear
system with matrix Hk must be solved at each iteration), prompt suit-
ably modifying method (7.40), which leads to the so-called quasi-Newton

A ¬rst modi¬cation, which applies to the case where Hk is not posi-
tive de¬nite, yields the so-called Newton™s method with shift. The idea is
to prevent Newton™s method from converging to non-minimizers of f , by
applying the scheme to a new Hessian matrix Hk = Hk + µk In , where, as
308 7. Nonlinear Systems and Numerical Optimization

usual, In denotes the identity matrix of order n and µk is selected in such
a way that Hk is positive de¬nite. The problem is to determine the shift
µk with a reduced e¬ort. This can be done, for instance, by applying the
Gershgorin theorem to the matrix Hk (see Section 5.1). For further details
on the subject, see [DS83] and [GMW81].

7.2.6 Quasi-Newton Methods
At the generic k-th iteration, a quasi-Newton method for function mini-
mization performs the following steps:

1. compute the Hessian matrix Hk , or a suitable approximation Bk ;

2. ¬nd a descent direction d(k) (not necessarily coinciding with the di-
rection provided by Newton™s method), using Hk or Bk ;

3. compute the acceleration parameter ±k ;

4. update the solution, setting x(k+1) = x(k) + ±k d(k) , according to a
global convergence criterion.

In the particular case where d(k) = ’H’1 ∇f (x(k) ), the resulting scheme is
called the damped Newton™s method. To compute Hk or Bk , one can resort
to either Newton™s method or secant-like methods, which will be considered
in Section 7.2.7.
The criteria for selecting the parameter ±k , that have been discussed in
Section 7.2.3, can now be usefully employed to devise globally convergent
methods. Property 7.5 ensures that there exist values of ±k satisfying (7.31),
(7.33) or (7.31), (7.32).
Let us then assume that a sequence of iterates x(k) , generated by a
descent method for a given x(0) , converge to a vector x— . This vector will
not be, in general, a critical point for f . The following result gives some
conditions on the directions d(k) which ensure that the limit x— of the
sequence is also a critical point of f .

Property 7.7 (Convergence) Let f : Rn ’ R be a continuously di¬er-
entiable function, and assume that there exists L > 0 such that

∇f (x) ’ ∇f (y) ¤ L x ’ y 2.

Then, if x(k) is a sequence generated by a gradient-like method which
ful¬lls (7.31) and (7.33), then, one (and only one) of the following events
can occur:

1. ∇f (x(k) ) = 0 for some k;

2. lim f (x(k) ) = ’∞;
7.2 Unconstrained Optimization 309

∇f (x(k) )T d(k)
3. lim = 0.
d(k) 2

Thus, unless the pathological cases where the directions d(k) become too
large or too small with respect to ∇f (x(k) ) or, even, are orthogonal to
∇f (x(k) ), any limit of the sequence x(k) is a critical point of f .
The convergence result for the sequence x(k) can also be extended to the
sequence f (x(k) ). Indeed, the following result holds.

Property 7.8 Let x(k) be a convergent sequence generated by a gradient-
like method, i.e., such that any limit of the sequence is also a critical point

<< . .

. 44
( : 95)

. . >>