So computing the gradient requires u and y, hence a solution of the state equation, and p, which

requires a solution of (1.20), a ¬nal-value problem for the adjoint equation. In the general case,

(1.17) is nonlinear, but (1.20) is a linear problem for p, which should be expected to be easier

to solve. This is the motivation for our claim that a gradient evaluation costs little more than a

function evaluation.

The discrete problems of interest here are constructed by solving (1.17) by numerical in-

tegration. After doing that, one can derive an adjoint variable and compute gradients using a

discrete form of (1.19). However, in [139] the equation for the adjoint variable of the discrete

problem is usually not a discretization of (1.20). For the forward Euler method, however, the

discretization of the adjoint equation is the adjoint equation for the discrete problem and we use

that discretization here for that reason.

The fully discrete problem is minu f , where u ∈ RN and

N

f (u) = L((y)j , (u)j , j),

j=1

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

BASIC CONCEPTS 11

and the states {xj } are given by the Euler recursion

yj+1 = yj + hφ((y)j , (u)j , j) for j = 0, . . . , N ’ 1,

where h = T /(N ’ 1) and x0 is given. Then

(∇f (u))j = (p)j φu ((y)j , (u)j , j) + Lu ((y)j , (u)j , j),

where (p)N = 0 and

(p)j’1 = (p)j + h (p)j φy ((y)j , (u)j , j) + Ly ((y)j , (u)j , j) for j = N, . . . , 1.

1.6.2 Parameter Identi¬cation

This example, taken from [13], will appear throughout the book. The problem is small with

N = 2. The goal is to identify the damping c and spring constant k of a linear spring by

minimizing the difference of a numerical prediction and measured data. The experimental

scenario is that the spring-mass system will be set into motion by an initial displacement from

equilibrium and measurements of displacements will be taken at equally spaced increments in

time.

The motion of an unforced harmonic oscillator satis¬es the initial value problem

u + cu + ku = 0; u(0) = u0 , u (0) = 0,

(1.21)

on the interval [0, T ]. We let x = (c, k)T be the vector of unknown parameters and, when the

dependence on the parameters needs to be explicit, we will write u(t : x) instead of u(t) for the

solution of (1.21). If the displacement is sampled at {tj }M , where tj = (j ’ 1)T /(M ’ 1),

j=1

and the observations for u are {uj }M , then the objective function is

j=1

M

1

|u(tj : x) ’ uj |2 .

f (x) =

(1.22)

2 j=1

This is an example of a nonlinear least squares problem.

u is differentiable with respect to x when c2 ’ 4k = 0. In that case, the gradient of f is

M ‚u(tj :x)

(u(tj : x) ’ uj )

j=1 ‚c

∇f (x) = .

(1.23) M ‚u(tj :x)

(u(tj : x) ’ uj )

j=1 ‚k

We can compute the derivatives of u(t : x) with respect to the parameters by solving the

sensitivity equations. Differentiating (1.21) with respect to c and k and setting w1 = ‚u/‚c and

w2 = ‚u/‚k we obtain

w1 + u + cw1 + kw1 = 0; w1 (0) = w1 (0) = 0,

(1.24)

w2 + cw2 + u + kw2 = 0; w2 (0) = w2 (0) = 0.

If c is large, the initial value problems (1.21) and (1.24) will be stiff and one should use a good

variable step stiff integrator. We refer the reader to [110], [8], [235] for details on this issue. In

the numerical examples in this book we used the MATLAB code ode15s from [236]. Stiffness

can also arise in the optimal control problem from §1.6.1 but does not in the speci¬c examples

we use in this book. We caution the reader that when one uses an ODE code the results may only

be expected to be accurate to the tolerances input to the code. This limitation on the accuracy

must be taken into account, for example, when approximating the Hessian by differences.

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

12 ITERATIVE METHODS FOR OPTIMIZATION

1.6.3 Convex Quadratics

While convex quadratic problems are, in a sense, the easiest of optimization problems, they

present surprising challenges to the sampling algorithms presented in Part II and can illustrate

fundamental problems with classical gradient-based methods like the steepest descent algorithm

from §3.1. Our examples will all take N = 2, b = 0, and

»s 0

H= ,

0 »l

where 0 < »s ¤ »l . The function to be minimized is

f (x) = xT Hx

and the minimizer is x— = (0, 0)T .

As »l /»s becomes large, the level curves of f become elongated. When »s = »l = 1,

minx f is the easiest problem in optimization.

1.7 Exercises on Basic Concepts

1.7.1. Prove Theorem 1.2.2.

1.7.2. Consider the parameter identi¬cation problem for x = (c, k, ω, φ)T ∈ R4 associated with

the initial value problem

u + cu + ku = sin(ωt + φ); u(0) = 10, u (0) = 0.

For what values of x is u differentiable? Derive the sensitivity equations for those values

of x for which u is differentiable.

1.7.3. Solve the system of sensitivity equations from exercise 1.7.2 numerically for c = 10,

k = 1, ω = π, and φ = 0 using the integrator of your choice. What happens if you use a

nonstiff integrator?

1.7.4. Let N = 2, d = (1, 1)T , and let f (x) = xT d + xT x. Compute, by hand, the minimizer

using conjugate gradient iteration.

1.7.5. For the same f as in exercise 1.7.4 solve the constrained optimization problem

min f (x),

x∈U

where U is the circle centered at (0, 0)T of radius 1/3. You can solve this by inspection;

no computer and very little mathematics is needed.

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

Chapter 2

Local Convergence of Newton™s

Method

By a local convergence method we mean one that requires that the initial iterate x0 is close to a

local minimizer x— at which the suf¬cient conditions hold.

2.1 Types of Convergence

We begin with the standard taxonomy of convergence rates [84], [154], [211].

De¬nition 2.1.1. Let {xn } ‚ RN and x— ∈ RN . Then

• xn ’ x— q-quadratically if xn ’ x— and there is K > 0 such that

xn+1 ’ x— ¤ K xn ’ x— 2 .

• xn ’ x— q-superlinearly with q-order ± > 1 if xn ’ x— and there is K > 0 such that

xn+1 ’ x— ¤ K xn ’ x— ±

.

• xn ’ x— q-superlinearly if

xn+1 ’ x—

lim = 0.

xn ’ x—

n’∞

• xn ’ x— q-linearly with q-factor σ ∈ (0, 1) if

xn+1 ’ x— ¤ σ xn ’ x—

for n suf¬ciently large.

De¬nition 2.1.2. An iterative method for computing x— is said to be locally (q-quadratically,

q-superlinearly, q-linearly, etc.) convergent if the iterates converge to x— (q-quadratically, q-

superlinearly, q-linearly, etc.) given that the initial data for the iteration is suf¬ciently good.

We remind the reader that a q-superlinearly convergent sequence is also q-linearly conver-

gent with q-factor σ for any σ > 0. A q-quadratically convergent sequence is q-superlinearly

convergent with q-order of 2.

13

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

14 ITERATIVE METHODS FOR OPTIMIZATION

In some cases the accuracy of the iteration can be improved by means that are external

to the algorithm, say, by evaluation of the objective function and its gradient with increasing

accuracy as the iteration progresses. In such cases, one has no guarantee that the accuracy of

the iteration is monotonically increasing but only that the accuracy of the results is improving at

a rate determined by the improving accuracy in the function“gradient evaluations. The concept

of r-type convergence captures this effect.

De¬nition 2.1.3. Let {xn } ‚ RN and x— ∈ RN . Then {xn } converges to x— r-( quadrat-

ically, superlinearly, linearly) if there is a sequence {ξn } ‚ R converging q-(quadratically,

superlinearly, linearly) to 0 such that

xn ’ x— ¤ ξn .

We say that {xn } converges r-superlinearly with r-order ± > 1 if ξn ’ 0 q-superlinearly with

q-order ±.

2.2 The Standard Assumptions

We will assume that local minimizers satisfy the standard assumptions which, like the standard

assumptions for nonlinear equations in [154], will guarantee that Newton™s method converges

q-quadratically to x— . We will assume throughout this book that f and x— satisfy Assumption

2.2.1.

Assumption 2.2.1.

1. f is twice di¬erentiable and

∇2 f (x) ’ ∇2 f (y) ¤ γ x ’ y .

(2.1)

2. ∇f (x— ) = 0.

3. ∇2 f (x— ) is positive de¬nite.

We sometimes say that f is twice Lipschitz continuously differentiable with Lipschitz constant

γ to mean that part 1 of the standard assumptions holds.

If the standard assumptions hold then Theorem 1.4.1 implies that x— is a local minimizer

of f . Moreover, the standard assumptions for nonlinear equations [154] hold for the system

∇f (x) = 0. This means that all of the local convergence results for nonlinear equations can be

applied to unconstrained optimization problems. In this chapter we will quote those results from

nonlinear equations as they apply to unconstrained optimization. However, these statements

must be understood in the context of optimization. We will use, for example, the fact that the

Hessian (the Jacobian of ∇f ) is positive de¬nite at x— in our solution of the linear equation for

the Newton step. We will also use this in our interpretation of the Newton iteration.

2.3 Newton™s Method

As in [154] we will de¬ne iterative methods in terms of the transition from a current iteration xc

to a new one x+ . In the case of a system of nonlinear equations, for example, x+ is the root of

the local linear model of F about xc

Mc (x) = F (xc ) + F (xc )(x ’ xc ).

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

LOCAL CONVERGENCE 15

Solving Mc (x+ ) = 0 leads to the standard formula for the Newton iteration

x+ = xc ’ F (xc )’1 F (xc ).

(2.2)

One could say that Newton™s method for unconstrained optimization is simply the method

for nonlinear equations applied to ∇f (x) = 0. While this is technically correct if xc is near a

minimizer, it is utterly wrong if xc is near a maximum. A more precise way of expressing the

idea is to say that x+ is a minimizer of the local quadratic model of f about xc .

1

mc (x) = f (xc ) + ∇f (xc )T (x ’ xc ) + (x ’ xc )T ∇2 f (xc )(x ’ xc ).

2

If ∇2 f (xc ) is positive de¬nite, then the minimizer x+ of mc is the unique solution of ∇mc (x) =

0. Hence,

0 = ∇mc (x+ ) = ∇f (xc ) + ∇2 f (xc )(x+ ’ xc ).

Therefore,

x+ = xc ’ (∇2 f (xc ))’1 ∇f (xc ),

(2.3)

which is the same as (2.2) with F replaced by ∇f and F by ∇2 f . Of course, x+ is not computed

by forming an inverse matrix. Rather, given xc , ∇f (xc ) is computed and the linear equation

∇2 f (xc )s = ’∇f (xc )

(2.4)

is solved for the step s. Then (2.3) simply says that x+ = xc + s.

However, if uc is far from a minimizer, ∇2 f (uc ) could have negative eigenvalues and the

quadratic model will not have local minimizers (see exercise 2.7.4), and Mc , the local linear

model of ∇f about uc , could have roots which correspond to local maxima or saddle points

of mc . Hence, we must take care when far from a minimizer in making a correspondence

between Newton™s method for minimization and Newton™s method for nonlinear equations. In

this chapter, however, we will assume that we are suf¬ciently near a local minimizer for the

standard assumptions for local optimality to imply those for nonlinear equations (as applied to

∇f ). Most of the proofs in this chapter are very similar to the corresponding results, [154], for

nonlinear equations. We include them in the interest of completeness.

We begin with a lemma from [154], which we state without proof.

Lemma 2.3.1. Assume that the standard assumptions hold. Then there is δ > 0 so that for

all x ∈ B(δ)

∇2 f (x) ¤ 2 ∇2 f (x— ) ,

(2.5)

(∇2 f (x))’1 ¤ 2 (∇2 f (x— ))’1 ,

(2.6)

and

(∇2 f (x— ))’1 ’1

e /2 ¤ ∇f (x) ¤ 2 ∇2 f (x— ) e .

(2.7)

As a ¬rst example, we prove the local convergence for Newton™s method.

Theorem 2.3.2. Let the standard assumptions hold. Then there are K > 0 and δ > 0 such

that if xc ∈ B(δ), the Newton iterate from xc given by (2.3) satis¬es

e+ ¤ K ec 2 .

(2.8)

Proof. Let δ be small enough so that the conclusions of Lemma 2.3.1 hold. By Theorem 1.2.1

1

2 ’1 2 ’1

(∇2 f (xc ) ’ ∇2 f (x— + tec ))ec dt.

e+ = ec ’ ∇ f (xc ) ∇f (xc ) = (∇ f (xc ))

0

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

16 ITERATIVE METHODS FOR OPTIMIZATION

By Lemma 2.3.1 and the Lipschitz continuity of ∇2 f ,

e+ ¤ (2 (∇2 f (x— ))’1 )γ ec 2 /2.

This completes the proof of (2.8) with K = γ (∇2 f (x— ))’1 .

As in the nonlinear equations setting, Theorem 2.3.2 implies that the complete iteration is

locally quadratically convergent.

Theorem 2.3.3. Let the standard assumptions hold. Then there is δ > 0 such that if

x0 ∈ B(δ), the Newton iteration

xn+1 = xn ’ (∇2 f (xn ))’1 ∇f (xn )

converges q-quadratically to x— .

Proof. Let δ be small enough so that the conclusions of Theorem 2.3.2 hold. Reduce δ if

needed so that Kδ = · < 1. Then if n ≥ 0 and xn ∈ B(δ), Theorem 2.3.2 implies that

2

en+1 ¤ K en ¤ · en < en

(2.9)

and hence xn+1 ∈ B(·δ) ‚ B(δ). Therefore, if xn ∈ B(δ) we may continue the iteration. Since

x0 ∈ B(δ) by assumption, the entire sequence {xn } ‚ B(δ). (2.9) then implies that xn ’ x—

q-quadratically.

Newton™s method, from the local convergence point of view, is exactly the same as that

for nonlinear equations applied to the problem of ¬nding a root of ∇f . We exploit the extra

structure of positive de¬niteness of ∇2 f with an implementation of Newton™s method based on

the Cholesky factorization [127], [249], [264]

∇2 f (u) = LLT ,

(2.10)

where L is lower triangular and has a positive diagonal.

We terminate the iteration when ∇f is suf¬ciently small (see [154]). A natural criterion is

to demand a relative decrease in ∇f and terminate when

∇f (xn ) ¤ „r ∇f (x0 ) ,

(2.11)

where „r ∈ (0, 1) is the desired reduction in the gradient norm. However, if ∇f (x0 ) is very

small, it may not be possible to satisfy (2.11) in ¬‚oating point arithmetic and an algorithm based

entirely on (2.11) might never terminate. A standard remedy is to augment the relative error

criterion and terminate the iteration using a combination of relative and absolute measures of

∇f , i.e., when

∇f (xn ) ¤ „r ∇f (x0 ) + „a .

(2.12)

In (2.12) „a is an absolute error tolerance. Hence, the termination criterion input to many of the

algorithms presented in this book will be in the form of a vector „ = („r , „a ) of relative and

absolute residuals.

Algorithm 2.3.1. newton(x, f, „ )

1. r0 = ∇f (x)

2. Do while ∇f (x) > „r r0 + „a

(a) Compute ∇2 f (x)

(b) F actor ∇2 f (x) = LLT

(c) Solve LLT s = ’∇f (x)

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

LOCAL CONVERGENCE 17

(d) x = x + s

(e) Compute ∇f (x).

Algorithm newton, as formulated above, is not completely satisfactory. The value of the

objective function f is never used and step 2b will fail if ∇2 f is not positive de¬nite. This failure,

in fact, could serve as a signal that one is too far from a minimizer for Newton™s method to be

directly applicable. However, if we are near enough (see Exercise 2.7.8) to a local minimizer,

as we assume in this chapter, all will be well and we may apply all the results from nonlinear

equations.