since it remains an open problem how to determine ±k in such a way that

(7.27) is ful¬lled without resorting to excessively small stepsizes ±k (and,

thus, to methods with a slow convergence).

A method for computing ±k consists of solving the following minimiza-

tion problem in one dimension:

¬nd ± such that φ(±) = f (x(k) + ±d(k) ) is minimized. (7.29)

In such a case we have the following result.

302 7. Nonlinear Systems and Numerical Optimization

Theorem 7.3 Consider the descent method (7.25). If at the generic step

k, the parameter ±k is set equal to the exact solution of (7.29), then the

following orthogonality property holds

∇f (x(k+1) )T d(k) = 0.

Proof. Let ±k be a solution to (7.29). Then, the ¬rst derivative of φ, given by

n

‚f (k) ‚ (k) (k)

(xi + ±di ) = ∇f (x(k) + ±k d(k) )T d(k) ,

(x + ±k d(k) )

φ (±) =

‚xi ‚±

i=1

vanishes at ± = ±k . The thesis then follows, recalling the de¬nition of x(k+1) . 3

Unfortunately, except for in special cases (which are nevetherless quite

relevant, see Section 7.2.4), providing an exact solution of (7.29) is not fea-

sible, since this is a nonlinear problem. One possible strategy consists of

approximating f along the straight line x(k) + ±d(k) through an interpo-

lating polynomial and then minimizing this polynomial (see the quadratic

interpolation Powell methods and cubic interpolation Davidon methods in

[Wal75]).

Generally speaking, a process that leads to an approximate solution to

(7.29) is said to be a line search technique and is addressed in the next

section.

7.2.3 Line Search Techniques

The methods that we are going to deal with in this section, are iterative

techniques that terminate as soon as some accuracy stopping criterion on

±k is satis¬ed. We shall assume that (7.26) holds.

Practical experience reveals that it is not necessary to solve accurately

for (7.29) in order to devise e¬cient methods, rather, it is crucial to enforce

some limitation on the step lengths (and, thus, on the admissible values for

±k ). Actually, without introducing any limitation, a reasonable request on

±k would seem be that the new iterate x(k+1) satis¬es the inequality

f (x(k+1) ) < f (x(k) ), (7.30)

where x(k) and d(k) have been ¬xed. For this purpose, the procedure based

on starting from a (su¬ciently large) value of the step length ±k and halve

this value until (7.30) is ful¬lled, can yield completely wrong results (see,

[DS83]).

More stringent criteria than (7.30) should be adopted in the choice of

possible values for ±k . To this end, we notice that two kinds of di¬culties

arise with the above examples: a slow descent rate of the sequence and the

use of small stepsizes.

7.2 Unconstrained Optimization 303

The ¬rst di¬culty can be overcome by requiring that

1

0 ≥ vM (x(k+1) ) = f (x(k) ) ’ f (x(k) + ±k d(k) )

±k (7.31)

≥ ’σ∇f (x(k) )T d(k) ,

with σ ∈ (0, 1/2). This amounts to requiring that the average descent

rate vM of f along d(k) , evaluated at x(k+1) , be at least equal to a given

fraction of the initial descent rate at x(k) . To avoid the generation of too

small stepsizes, we require that the descent rate in the direction d(k) at

x(k+1) is not less than a given fraction of the descent rate at x(k)

|∇f (x(k) + ±k d(k) )T d(k) | ¤ β|∇f (x(k) )T d(k) |, (7.32)

with β ∈ (σ, 1) in such a way as to also satisfy (7.31). In computational

practice, σ ∈ [10’5 , 10’1 ] and β ∈ [10’1 , 1 ] are usual choices. Sometimes,

2

(7.32) is replaced by the milder condition

∇f (x(k) + ±k d(k) )T d(k) ≥ β∇f (x(k) )T d(k) (7.33)

(recall that ∇f (x(k) )T d(k) is negative, since d(k) is a descent direction).

The following property ensures that, under suitable assumptions, it is pos-

sible to ¬nd out values of ±k which satisfy (7.31)-(7.32) or (7.31)-(7.33).

Property 7.5 Assume that f (x) ≥ M for any x ∈ Rn . Then there exists

an interval I = [c, C] for the descent method, with 0 < c < C, such that

∀±k ∈ I, (7.31), (7.32) (or (7.31)-(7.33)) are satis¬ed, with σ ∈ (0, 1/2)

and β ∈ (σ, 1).

Under the constraint of ful¬lling conditions (7.31) and (7.32), several

choices for ±k are available. Among the most up-to-date strategies, we re-

call here the backtracking techniques: having ¬xed σ ∈ (0, 1/2), then start

with ±k = 1 and then keep on reducing its value by a suitable scale factor

ρ ∈ (0, 1) (backtrack step) until (7.31) is satis¬ed. This procedure is im-

plemented in Program 62, which requires as input parameters the vector x

containing x(k) , the macros f and J of the functional expressions of f and

its Jacobian, the vector d of the direction d(k) , and a value for σ (usually

of the order of 10’4 ) and the scale factor ρ. In output, the code returns

the vector x(k+1) , computed using a suitable value of ±k .

Program 62 - backtrackr : Backtraking for line search

function [xnew]= backtrackr(sigma,rho,x,f,J,d)

alphak = 1; fk = eval(f); Jfk = eval (J);

xx = x; x = x + alphak * d; fk1 = eval (f);

while fk1 > fk + sigma * alphak * Jfk™*d

304 7. Nonlinear Systems and Numerical Optimization

alphak = alphak*rho;

x = xx + alphak*d;

fk1 = eval(f);

end

Other commonly used strategies are those developed by Armijo and Gold-

stein (see [Arm66], [GP67]). Both use σ ∈ (0, 1/2). In the Armijo formula,

one takes ±k = β mk ±, where β ∈ (0, 1), ± > 0 and mk is the ¬rst non-

¯ ¯

negative integer such that (7.31) is satis¬ed. In the Goldstein formula, the

parameter ±k is determined in such a way that

f (x(k) + ±k d(k) ) ’ f (x(k) )

σ¤ ¤ 1 ’ σ. (7.34)

±k ∇f (x(k) )T d(k)

A procedure for computing ±k that satis¬es (7.34) is provided in [Ber82],

Chapter 1. Of course, one can even choose ±k = ± for any k, which is

¯

clearly convenient when evaluating f is a costly task.

In any case, a good choice of the value ± is mandatory. In this respect,

¯

one can proceed as follows. For a given value ±, the second degree poly-

¯

(k)

nomial Π2 along the direction d is constructed, subject to the following

interpolation constraints

Π2 (x(k) ) = f (x(k) ),

Π2 (x(k) + ±d(k) ) = f (x(k) + ±d(k) ),

¯ ¯

Π2 (x(k) ) = ∇f (x(k) )T d(k) .

Next, the value ± is computed such that Π2 is minimized, then, we let

˜

± = ±.

¯˜

7.2.4 Descent Methods for Quadratic Functions

A case of remarkable interest, where the parameter ±k can be exactly com-

puted, is the problem of minimizing the quadratic function

1T

x Ax ’ bT x,

f (x) = (7.35)

2

where A∈ Rn—n is a symmetric and positive de¬nite matrix and b ∈ Rn .

In such a case, as already seen in Section 4.3.3, a necessary condition for

x— to be a minimizer for f is that x— is the solution of the linear system

(3.2). Actually, it can be checked that if f is a quadratic function

∇f (x) = Ax ’ b = ’r, H(x) = A.

As a consequence, all gradient-like iterative methods developed in Section

4.3.3 for linear systems, can be extended tout-court to solve minimization

problems.

7.2 Unconstrained Optimization 305

In particular, having ¬xed a descent direction d(k) , we can determine

the optimal value of the acceleration parameter ±k that appears in (7.25),

in such a way as to ¬nd the point where the function f , restricted to the

direction d(k) , is minimized. Setting to zero the directional derivative, we

get

d T T

f (x(k) + ±k d(k) ) = ’d(k) r(k) + ±k d(k) Ad(k) = 0

d±k

from which the following expression for ±k is obtained

T

d(k) r(k)

±k = . (7.36)

T

d(k) Ad(k)

The error introduced by the iterative process (7.25) at the k-th step is

T

x(k+1) ’ x— = x(k+1) ’ x— A x(k+1) ’ x—

2

A

(7.37)

(k) T T

x— 2 —

’ ’x

(k) (k)

±k d(k) Ad(k) .

2

=x + 2±k d Ax +

A

T

On the other hand x(k) ’ x— = r(k) A’1 r(k) , so that from (7.37) it

2

A

follows that

x(k+1) ’ x— = ρk x(k) ’ x—

2 2

(7.38)

A A

having denoted by ρk = 1 ’ σk , with

T T

T

A’1 r(k) .

σk = (d(k) r(k) )2 / d(k) Ad(k) r(k)

Since A is symmetric and positive de¬nite, σk is always positive. Moreover,

it can be directly checked that ρk is strictly less than 1, except when d(k)

is orthogonal to r(k) , in which case ρk = 1.

The choice d(k) = r(k) , which leads to the steepest descent method, pre-

vents this last circumstance from arising. In such a case, from (7.38) we

get

»max ’ »min (k)

x(k+1) ’ x— x ’ x—

¤ (7.39)

A A

»max + »min

having employed the following result.

Lemma 7.1 (Kantorovich inequality) Let A ∈ Rn—n be a symmetric

positive de¬nite matrix whose eigenvalues with largest and smallest module

are given by »max and »min , respectively. Then, ∀y ∈ Rn , y = 0,

(yT y)2 4»max »min

≥ .

(yT Ay)(yT A’1 y) (»max + »min )2

306 7. Nonlinear Systems and Numerical Optimization

It follows from (7.39) that, if A is ill-conditioned, the error reducing factor

for the steepest descent method is close to 1, yielding a slow convergence to

the minimizer x— . As done in Chapter 4, this drawback can be overcome

by introducing directions d(k) that are mutually A-conjugate, i.e.

T

d(k) Ad(m) = 0 if k = m.

The corresponding methods enjoy the following ¬nite termination property.

Property 7.6 A method for computing the minimizer x— of the quadratic

function (7.35) which employs A-conjugate directions terminates after at

most n steps if the acceleration parameter ±k is selected as in (7.36). More-

over, for any k, x(k+1) is the minimizer of f over the subspace generated

by the vectors x(0) , d(0) , . . . , d(k) and

T

r(k+1) d(m) = 0 ∀m ¤ k.

The A-conjugate directions can be determined by following the proce-

dure described in Section 4.3.4. Letting d(0) = r(0) , the conjugate gradient

method for function minimization is

d(k+1) = r(k) + βk d(k) ,

T T

r(k+1) Ad(k) r(k+1) r(k+1)

βk = ’ = ,

T T

d(k) Ad(k) r(k) r(k)

x(k+1) = x(k) + ±k d(k) .

It satis¬es the following error estimate

k

K2 (A) ’ 1

x(k) ’ x— x(0) ’ x—

¤2 A,

A

K2 (A) + 1

which can be improved by lowering the condition number of A, i.e., resort-

ing to the preconditioning techniques that have been dealt with in Section

4.3.2.

Remark 7.2 (The nonquadratic case) The conjugate gradient method

can be extended to the case in which f is a non quadratic function. However,

in such an event, the acceleration parameter ±k cannot be exactly deter-

mined a priori, but requires the solution of a local minimization problem.

Moreover, the parameters βk can no longer be uniquely found. Among the

most reliable formulae, we recall the one due to Fletcher-Reeves,

∇f (x(k) ) 2

2

β1 = 0, βk = , for k > 1

∇f (x(k’1) ) 2

2

7.2 Unconstrained Optimization 307

and the one due to Polak-Ribi´re

e

T

∇f (x(k) ) (∇f (x(k) ) ’ ∇f (x(k’1) ))

β1 = 0, βk = , for k > 1.

∇f (x(k’1) ) 2

2

7.2.5 Newton-like Methods for Function Minimization

An alternative is provided by Newton™s method, which di¬ers from its ver-

sion for nonlinear systems in that now it is no longer applied to f , but to

its gradient.

Using the notation of Section 7.2.2, Newton™s method for function mini-

mization amounts to computing, for k = 0, 1, . . . , until convergence

d(k) = ’H’1 ∇f (x(k) ),

k

(7.40)

(k+1) (k) (k)

x =x +d ,

where x(0) ∈ Rn is a given initial vector and having set Hk = H(x(k) ). The

method can be derived by truncating Taylor™s expansion of f (x(k) ) at the

second-order

1

f (x(k) ) + ∇f (x(k) )T p + pT Hk p.

f (x(k) + p) (7.41)

2

Selecting p in (7.41) in such a way that the new vector x(k+1) = x(k) + p

satis¬es ∇f (xk+1 ) = 0, we end up with method (7.40), which thus con-

verges in one step if f is quadratic.

In the general case, a result analogous to Theorem 7.1 also holds for func-

tion minimization. Method (7.40) is therefore locally quadratically conver-

gent to the minimizer x— . However, it is not convenient to use Newton™s

method from the beginning of the computation, unless x(0) is su¬ciently

close to x— . Otherwise, indeed, Hk could not be invertible and the direc-

tions d(k) could fail to be descent directions. Moreover, if Hk is not positive

de¬nite, nothing prevents the scheme (7.40) from converging to a saddle

point or a maximizer, which are points where ∇f is equal to zero. All these

drawbacks, together with the high computational cost (recall that a linear

system with matrix Hk must be solved at each iteration), prompt suit-

ably modifying method (7.40), which leads to the so-called quasi-Newton

methods.

A ¬rst modi¬cation, which applies to the case where Hk is not posi-

tive de¬nite, yields the so-called Newton™s method with shift. The idea is

to prevent Newton™s method from converging to non-minimizers of f , by

˜

applying the scheme to a new Hessian matrix Hk = Hk + µk In , where, as

308 7. Nonlinear Systems and Numerical Optimization

usual, In denotes the identity matrix of order n and µk is selected in such

˜

a way that Hk is positive de¬nite. The problem is to determine the shift

µk with a reduced e¬ort. This can be done, for instance, by applying the

˜

Gershgorin theorem to the matrix Hk (see Section 5.1). For further details

on the subject, see [DS83] and [GMW81].

7.2.6 Quasi-Newton Methods

At the generic k-th iteration, a quasi-Newton method for function mini-

mization performs the following steps:

1. compute the Hessian matrix Hk , or a suitable approximation Bk ;

2. ¬nd a descent direction d(k) (not necessarily coinciding with the di-

rection provided by Newton™s method), using Hk or Bk ;

3. compute the acceleration parameter ±k ;

4. update the solution, setting x(k+1) = x(k) + ±k d(k) , according to a

global convergence criterion.

In the particular case where d(k) = ’H’1 ∇f (x(k) ), the resulting scheme is

k

called the damped Newton™s method. To compute Hk or Bk , one can resort

to either Newton™s method or secant-like methods, which will be considered

in Section 7.2.7.

The criteria for selecting the parameter ±k , that have been discussed in

Section 7.2.3, can now be usefully employed to devise globally convergent

methods. Property 7.5 ensures that there exist values of ±k satisfying (7.31),

(7.33) or (7.31), (7.32).

Let us then assume that a sequence of iterates x(k) , generated by a

descent method for a given x(0) , converge to a vector x— . This vector will

not be, in general, a critical point for f . The following result gives some

conditions on the directions d(k) which ensure that the limit x— of the

sequence is also a critical point of f .

Property 7.7 (Convergence) Let f : Rn ’ R be a continuously di¬er-

entiable function, and assume that there exists L > 0 such that

∇f (x) ’ ∇f (y) ¤ L x ’ y 2.

2

Then, if x(k) is a sequence generated by a gradient-like method which

ful¬lls (7.31) and (7.33), then, one (and only one) of the following events

can occur:

1. ∇f (x(k) ) = 0 for some k;

2. lim f (x(k) ) = ’∞;

k’∞

7.2 Unconstrained Optimization 309

∇f (x(k) )T d(k)

3. lim = 0.

d(k) 2

k’∞

Thus, unless the pathological cases where the directions d(k) become too

large or too small with respect to ∇f (x(k) ) or, even, are orthogonal to

∇f (x(k) ), any limit of the sequence x(k) is a critical point of f .

The convergence result for the sequence x(k) can also be extended to the

sequence f (x(k) ). Indeed, the following result holds.

Property 7.8 Let x(k) be a convergent sequence generated by a gradient-

like method, i.e., such that any limit of the sequence is also a critical point