Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

LOCAL CONVERGENCE 29

2.5.1 Convergence Rates

We will prove the simplest of the convergence results for Newton“CG, Theorem 2.5.1, from

which Theorem 2.5.2 follows directly. We refer the reader to [74] and [154] for the proof of

Theorem 2.5.3.

Theorem 2.5.1. Let the standard assumptions hold. Then there are δ and KI such that if

xc ∈ B(δ), s satis¬es (2.47), and x+ = xc + s, then

e+ ¤ KI ( ec + ·c ) ec .

(2.48)

Proof. Let δ be small enough so that the conclusions of Lemma 2.3.1 and Theorem 2.3.2

hold. To prove the ¬rst assertion (2.48) note that if

r = ’∇2 f (xc )s ’ ∇f (xc )

is the linear residual, then

s + (∇2 f (xc ))’1 ∇f (xc ) = ’(∇2 f (xc ))’1 r

and

e+ = ec + s = ec ’ (∇2 f (xc ))’1 ∇f (xc ) ’ (∇2 f (xc ))’1 r.

(2.49)

Now, (2.47), (2.7), and (2.6) imply that

s + (∇2 f (xc ))’1 ∇f (xc ) ¤ (∇2 f (xc ))’1 ·c ∇f (xc )

¤ 4κ(∇2 f (x— ))·c ec .

Hence, using (2.49) and Theorem 2.3.2, we have that

¤ ec ’ ∇2 f (xc )’1 ∇f (xc ) + 4κ(F (x— ))·c ec

e+

2

+ 4κ(∇2 f (x— ))·c ec ,

¤ K ec

where K is the constant from (2.8). If we set

KI = K + 4κ(∇2 f (x— )),

the proof is complete.

Theorem 2.5.2. Let the standard assumptions hold. Then there are δ and · such that if

¯

x0 ∈ B(δ), {·n } ‚ [0, · ], then the inexact Newton iteration

¯

xn+1 = xn + sn ,

where

∇2 f (xn )sn + ∇f (xn ) ¤ ·n ∇f (xn ) ,

converges q-linearly to x— . Moreover

• if ·n ’ 0 the convergence is q-superlinear, and

p

• if ·n ¤ K· ∇f (xn ) for some K· > 0 the convergence is q-superlinear with q-order

1 + p.

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

30 ITERATIVE METHODS FOR OPTIMIZATION

The similarity between Theorem 2.5.2 and Theorem 2.3.5, the convergence result for the

chord method, should be clear. Rather than require that the approximate Hessian be accurate,

we demand that the linear iteration produce a suf¬ciently small relative residual. Theorem 2.5.3

is the remarkable statement that any reduction in the relative linear residual will suf¬ce for linear

convergence in a certain norm. This statement implies [154] that ∇f (xn ) will converge to

zero q-linearly, or, equivalently, that xn ’ x— q-linearly with respect to · — , which is de¬ned

by

x — = ∇2 f (x— )x .

Theorem 2.5.3. Let the standard assumptions hold. Then there is δ such that if xc ∈ B(δ),

s satis¬es (2.47), x+ = xc + s, and ·c ¤ · < · < 1, then

¯

e+ ¤ · ec

¯ —.

(2.50) —

Theorem 2.5.4. Let the standard assumptions hold. Then there is δ such that if x0 ∈ B(δ),

{·n } ‚ [0, ·] with · < · < 1, then the inexact Newton iteration

¯

xn+1 = xn + sn ,

where

∇2 f (xn )sn + ∇f (xn ) ¤ ·n ∇f (xn )

to x— . Moreover

converges q-linearly with respect to · —

• if ·n ’ 0 the convergence is q-superlinear, and

p

• if ·n ¤ K· ∇f (xn ) for some K· > 0 the convergence is q-superlinear with q-order

1 + p.

Q-linear convergence of {xn } to a local minimizer with respect to · — is equivalent to

q-linear convergence of {∇f (xn )} to zero. We will use the rate of convergence of {∇f (xn )}

in our computational examples to compare various methods.

2.5.2 Implementation of Newton“CG

Our implementation of Newton“CG approximately solves the equation for the Newton step with

CG. We make the implicit assumption that ∇f has been computed suf¬ciently accurately for

Dh f (x : w) to be a useful approximate Hessian of the Hessian“vector product ∇2 f (x)w.

2

Forward Difference CG

Algorithm fdcg is an implementation of the solution by CG of the equation for the Newton step

(2.4). In this algorithm we take care to identify failure in CG (i.e., detection of a vector p for

which pT Hp ¤ 0). This failure either means that H is singular (pT Hp = 0; see exercise 2.7.3)

or that pT Hp < 0, i.e., p is a direction of negative curvature. The algorithms we will discuss

in §3.3.7 make good use of directions of negative curvature. The initial iterate for forward

difference CG iteration should be the zero vector. In this way the ¬rst iterate will give a steepest

descent step, a fact that is very useful. The inputs to Algorithm fdcg are the current point x,

the objective f , the forcing term ·, and a limit on the number of iterations kmax. The output is

2

the inexact Newton direction d. Note that in step 2b Dh f (x : p) is used as an approximation to

∇2 f (x)p.

Algorithm 2.5.1. fdcg(x, f, ·, kmax, d)

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

LOCAL CONVERGENCE 31

1. r = ’∇f (x), ρ0 = r 2 , k = 1, d = 0.

2

√

ρk’1 > · ∇f (x) and k < kmax

2. Do While

(a) if k = 1 then p = r

else

β = ρk’1 /ρk’2 and p = r + βp

2

(b) w = Dh f (x : p)

If pT w = 0 signal inde¬niteness; stop.

If pT w < 0 signal negative curvature; stop.

(c) ± = ρk’1 /pT w

(d) d = d + ±p

(e) r = r ’ ±w

2

(f) ρk = r

(g) k = k + 1

Preconditioning can be incorporated into a Newton“CG algorithm by using a forward dif-

ference formulation, too. Here, as in [154], we denote the preconditioner by M . Aside from M ,

the inputs and output of Algorithm fdpcg are the same as that for Algorithm fdcg.

Algorithm 2.5.2. fdpcg(x, f, M, ·, kmax, d)

1. r = ’∇f (x), ρ0 = r 2 , k = 1, d = 0.

2

√

ρk’1 > · ∇f (x) and k < kmax

2. Do While

(a) z = M r

(b) „k’1 = z T r

(c) if k = 1 then β = 0 and p = z

else

β = „k’1 /„k’2 , p = z + βp

2

(d) w = Dh f (x : p)

If pT w = 0 signal inde¬niteness; stop.

If pT w < 0 signal negative curvature; stop.

(e) ± = „k’1 /pT w

(f) d = d + ±p

(g) r = r ’ ±w

(h) ρk = rT r

(i) k = k + 1

In our formulation of Algorithms fdcg and fdpcg, inde¬niteness is a signal that we are

not suf¬ciently near a minimum for the theory in this section to hold. In §3.3.7 we show how

negative curvature can be exploited when far from the solution.

One view of preconditioning is that it is no more than a rescaling of the independent variables.

Suppose, rather than (1.2), we seek to solve

ˆ

min f (y),

(2.51)

y

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

32 ITERATIVE METHODS FOR OPTIMIZATION

ˆ ˆ

where f (y) = f (M 1/2 y) and M is spd. If y — is a local minimizer of f , then x— = M 1/2 y — is

a local minimizer of f and the two problems are equivalent. Moreover, if x = M 1/2 y and ∇x

and ∇y denote gradients in the x and y coordinates, then

ˆ

∇y f (y) = M 1/2 ∇x f (x)

and

ˆ

∇2 f (y) = M 1/2 (∇2 f (x))M 1/2 .

y x

Hence, the scaling matrix plays the role of the square root of the preconditioner for the precon-

ditioned conjugate gradient algorithm.

Newton“CG

The theory guarantees that if x0 is near enough to a local minimizer then ∇2 f (xn ) will be spd

for the entire iteration and xn will converge rapidly to x— . Hence, Algorithm newtcg will not

terminate with failure because of an increase in f or an inde¬nite Hessian. Note that both the

forcing term · and the preconditioner M can change as the iteration progresses.

Algorithm 2.5.3. newtcg(x, f, „, ·)

1. rc = r0 = ∇f (x)

2. Do while ∇f (x) > „r r0 + „a

(a) Select · and a preconditioner M .

(b) fdpcg(x, f, M, ·, kmax, d)

If inde¬niteness has been detected, terminate with failure.

(c) x = x + d.

(d) Evaluate f and ∇f (x).

If f has not decreased, terminate with failure.

(e) r+ = ∇f (x) , σ = r+ /rc , rc = r+ .

The implementation of Newton“CG is simple, but, as presented in Algorithm newtcg,

incomplete. The algorithm requires substantial modi¬cation to be able to generate the good

initial data that the local theory requires. We return to this issue in §3.3.7.

There is a subtle problem with Algorithm fdpcg in that the algorithm is equivalent to the

application of the preconditioned conjugate gradient algorithm to the matrix B that is determined

by

2

Bpi = wi = Dh f (x : pi ), 1 ¤ i ¤ N.

2

However, since the map p ’ Dh f (x : p) is not linear in p, the quality of B as an approximation

to ∇2 f (x) may degrade as the linear iteration progresses. Usually this will not cause problems

unless many iterations are needed to satisfy the inexact Newton condition. However, if one does

not see the expected rate of convergence in a Newton“CG iteration, this could be a factor [128].

One partial remedy is to use a centered-difference Hessian“vector product [162], which reduces

the error in B. In exercise 2.7.15 we discuss a more complex and imaginative way to compute

accurate Hessians.

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

LOCAL CONVERGENCE 33

2.6 Examples

In this section we used the collection of MATLAB codes but disabled the features (see Chapter 3)

that assist in convergence when far from a minimizer. We took care to make certain that the

initial iterates were near enough to the minimizer so that the observations of local convergence

corresponded to the theory. In practical optimization problems, good initial data is usually not

available and the globally convergent methods discussed in Chapter 3 must be used to start the

iteration.

The plots in this section have the characteristics of local convergence in that both the gradient

norms and function values are decreasing. The reader should contrast this with the examples in

Chapter 3.

2.6.1 Parameter Identi¬cation

For this example, M = 100, and the observations are those of the exact solution with c = k = 1,

which we computed analytically. We used T = 10 and u0 = 10. We computed the displacement

and solved the sensitivity equations with the stiff solver ode15s. These results could be

obtained as well in a FORTRAN environment using, for example, the LSODE code [228]. The

relative and absolute error tolerances for the integrator were both set to 10’8 . In view of the

expected accuracy of the gradient, we set the forward difference increment for the approximate

Hessian to h = 10’4 . We terminated the iterations when ∇f < 10’4 . Our reasons for this

are that, for the zero residual problem considered here, the standard assumptions imply that

f (x) = O( ∇f (x) ) for x near the solution. Hence, since we can only resolve f to an accuracy

of 10’8 , iteration beyond the point where ∇f < 10’4 cannot be expected to lead to a further

decrease in f . In fact we observed this in our computations.

The iterations are very sensitive to the initial iterate. We used x0 = (1.1, 1.05)T ; initial

iterates much worse than that caused Newton™s method to fail. The more robust methods from

Chapter 3 should be viewed as essential components of even a simple optimization code.

In Table 2.1 we tabulate the history of the iteration for both the Newton and Gauss“Newton

methods. As expected for a small residual problem, Gauss“Newton performs well and, for this

example, even converges in fewer iterations. The real bene¬t of Gauss“Newton is that com-

putation of the Hessian can be avoided, saving considerable computational work by exploiting

the structure of the problem. In the computation reported here, the MATLAB flops com-

mand indicates that the Newton iteration took roughly 1.9 million ¬‚oating point operations and

Gauss“Newton roughly 650 thousand. This difference would be much more dramatic if there

were more than two parameters or the cost of an evaluation of f depended on N in a signi¬cant

way (which it does not in this example).

Table 2.1: Parameter identi¬cation problem, locally convergent iterations.

Newton Gauss“Newton

∇f (xn ) f (xn ) ∇f (xn ) f (xn )

n

0 2.33e+01 7.88e-01 2.33e+01 7.88e-01

1 6.87e+00 9.90e-02 1.77e+00 6.76e-03

2 4.59e-01 6.58e-04 1.01e-02 4.57e-07

3 2.96e-03 3.06e-08 9.84e-07 2.28e-14

4 2.16e-06 4.15e-14

Figure 2.1 is a graphical representation of the convergence history from Table 2.1. We think

that the plots are a more effective way to understand iteration statistics and will present mostly

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

34 ITERATIVE METHODS FOR OPTIMIZATION

graphs for the remainder of the book. The concavity of the plots of the gradient norms is the

signature of superlinear convergence.

Newton™s Method Newton™s Method

2 0

10 10

0

10

Function Value

Gradient Norm

’5

10

’2

10

’10

10

’4

10

’6 ’15

10 10

0 2 4 0 2 4

Iterations Iterations

Gauss’Newton Method Gauss’Newton Method

2 0

10 10

0

10

Function Value

Gradient Norm

’5

10

’2

10

’4

10 ’10

10

’6

10

’8 ’15

10 10

0 1 2 3 0 1 2 3

Iterations Iterations

Figure 2.1: Local Optimization for the Parameter ID Problem

We next illustrate the difference between Gauss“Newton and Newton on a nonzero residual

problem. We use the same example as before with the observations randomly perturbed. We

used the MATLAB rand function for this, perturbing the samples of the analytic solution by

.5 — rand(M, 1). The least squares residual is about 3.6 and the plots in Figure 2.2 indicate

that Newton™s method is still converging quadratically, but the rate of Gauss“Newton is linear.

The linear convergence of Gauss“Newton can be seen clearly from the linear semilog plot of the

gradient norms. Even so, the Gauss“Newton iteration was more ef¬cient, in terms of ¬‚oating

point operation, than Newton™s method. The Gauss“Newton iteration took roughly 1 million

¬‚oating point operations while the Newton iteration took 1.4 million.