<< . .

. 3
( : 32)



. . >>


So computing the gradient requires u and y, hence a solution of the state equation, and p, which
requires a solution of (1.20), a ¬nal-value problem for the adjoint equation. In the general case,
(1.17) is nonlinear, but (1.20) is a linear problem for p, which should be expected to be easier
to solve. This is the motivation for our claim that a gradient evaluation costs little more than a
function evaluation.
The discrete problems of interest here are constructed by solving (1.17) by numerical in-
tegration. After doing that, one can derive an adjoint variable and compute gradients using a
discrete form of (1.19). However, in [139] the equation for the adjoint variable of the discrete
problem is usually not a discretization of (1.20). For the forward Euler method, however, the
discretization of the adjoint equation is the adjoint equation for the discrete problem and we use
that discretization here for that reason.
The fully discrete problem is minu f , where u ∈ RN and
N
f (u) = L((y)j , (u)j , j),
j=1




Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




BASIC CONCEPTS 11

and the states {xj } are given by the Euler recursion

yj+1 = yj + hφ((y)j , (u)j , j) for j = 0, . . . , N ’ 1,

where h = T /(N ’ 1) and x0 is given. Then

(∇f (u))j = (p)j φu ((y)j , (u)j , j) + Lu ((y)j , (u)j , j),

where (p)N = 0 and

(p)j’1 = (p)j + h (p)j φy ((y)j , (u)j , j) + Ly ((y)j , (u)j , j) for j = N, . . . , 1.


1.6.2 Parameter Identi¬cation
This example, taken from [13], will appear throughout the book. The problem is small with
N = 2. The goal is to identify the damping c and spring constant k of a linear spring by
minimizing the difference of a numerical prediction and measured data. The experimental
scenario is that the spring-mass system will be set into motion by an initial displacement from
equilibrium and measurements of displacements will be taken at equally spaced increments in
time.
The motion of an unforced harmonic oscillator satis¬es the initial value problem

u + cu + ku = 0; u(0) = u0 , u (0) = 0,
(1.21)

on the interval [0, T ]. We let x = (c, k)T be the vector of unknown parameters and, when the
dependence on the parameters needs to be explicit, we will write u(t : x) instead of u(t) for the
solution of (1.21). If the displacement is sampled at {tj }M , where tj = (j ’ 1)T /(M ’ 1),
j=1
and the observations for u are {uj }M , then the objective function is
j=1

M
1
|u(tj : x) ’ uj |2 .
f (x) =
(1.22)
2 j=1

This is an example of a nonlinear least squares problem.
u is differentiable with respect to x when c2 ’ 4k = 0. In that case, the gradient of f is
M ‚u(tj :x)
(u(tj : x) ’ uj )
j=1 ‚c
∇f (x) = .
(1.23) M ‚u(tj :x)
(u(tj : x) ’ uj )
j=1 ‚k

We can compute the derivatives of u(t : x) with respect to the parameters by solving the
sensitivity equations. Differentiating (1.21) with respect to c and k and setting w1 = ‚u/‚c and
w2 = ‚u/‚k we obtain

w1 + u + cw1 + kw1 = 0; w1 (0) = w1 (0) = 0,
(1.24)
w2 + cw2 + u + kw2 = 0; w2 (0) = w2 (0) = 0.
If c is large, the initial value problems (1.21) and (1.24) will be stiff and one should use a good
variable step stiff integrator. We refer the reader to [110], [8], [235] for details on this issue. In
the numerical examples in this book we used the MATLAB code ode15s from [236]. Stiffness
can also arise in the optimal control problem from §1.6.1 but does not in the speci¬c examples
we use in this book. We caution the reader that when one uses an ODE code the results may only
be expected to be accurate to the tolerances input to the code. This limitation on the accuracy
must be taken into account, for example, when approximating the Hessian by differences.


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




12 ITERATIVE METHODS FOR OPTIMIZATION

1.6.3 Convex Quadratics
While convex quadratic problems are, in a sense, the easiest of optimization problems, they
present surprising challenges to the sampling algorithms presented in Part II and can illustrate
fundamental problems with classical gradient-based methods like the steepest descent algorithm
from §3.1. Our examples will all take N = 2, b = 0, and

»s 0
H= ,
0 »l

where 0 < »s ¤ »l . The function to be minimized is

f (x) = xT Hx

and the minimizer is x— = (0, 0)T .
As »l /»s becomes large, the level curves of f become elongated. When »s = »l = 1,
minx f is the easiest problem in optimization.


1.7 Exercises on Basic Concepts
1.7.1. Prove Theorem 1.2.2.
1.7.2. Consider the parameter identi¬cation problem for x = (c, k, ω, φ)T ∈ R4 associated with
the initial value problem

u + cu + ku = sin(ωt + φ); u(0) = 10, u (0) = 0.

For what values of x is u differentiable? Derive the sensitivity equations for those values
of x for which u is differentiable.
1.7.3. Solve the system of sensitivity equations from exercise 1.7.2 numerically for c = 10,
k = 1, ω = π, and φ = 0 using the integrator of your choice. What happens if you use a
nonstiff integrator?

1.7.4. Let N = 2, d = (1, 1)T , and let f (x) = xT d + xT x. Compute, by hand, the minimizer
using conjugate gradient iteration.
1.7.5. For the same f as in exercise 1.7.4 solve the constrained optimization problem

min f (x),
x∈U

where U is the circle centered at (0, 0)T of radius 1/3. You can solve this by inspection;
no computer and very little mathematics is needed.




Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




Chapter 2

Local Convergence of Newton™s
Method

By a local convergence method we mean one that requires that the initial iterate x0 is close to a
local minimizer x— at which the suf¬cient conditions hold.


2.1 Types of Convergence
We begin with the standard taxonomy of convergence rates [84], [154], [211].
De¬nition 2.1.1. Let {xn } ‚ RN and x— ∈ RN . Then
• xn ’ x— q-quadratically if xn ’ x— and there is K > 0 such that

xn+1 ’ x— ¤ K xn ’ x— 2 .

• xn ’ x— q-superlinearly with q-order ± > 1 if xn ’ x— and there is K > 0 such that

xn+1 ’ x— ¤ K xn ’ x— ±
.

• xn ’ x— q-superlinearly if
xn+1 ’ x—
lim = 0.
xn ’ x—
n’∞


• xn ’ x— q-linearly with q-factor σ ∈ (0, 1) if

xn+1 ’ x— ¤ σ xn ’ x—

for n suf¬ciently large.


De¬nition 2.1.2. An iterative method for computing x— is said to be locally (q-quadratically,
q-superlinearly, q-linearly, etc.) convergent if the iterates converge to x— (q-quadratically, q-
superlinearly, q-linearly, etc.) given that the initial data for the iteration is suf¬ciently good.

We remind the reader that a q-superlinearly convergent sequence is also q-linearly conver-
gent with q-factor σ for any σ > 0. A q-quadratically convergent sequence is q-superlinearly
convergent with q-order of 2.

13
Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




14 ITERATIVE METHODS FOR OPTIMIZATION

In some cases the accuracy of the iteration can be improved by means that are external
to the algorithm, say, by evaluation of the objective function and its gradient with increasing
accuracy as the iteration progresses. In such cases, one has no guarantee that the accuracy of
the iteration is monotonically increasing but only that the accuracy of the results is improving at
a rate determined by the improving accuracy in the function“gradient evaluations. The concept
of r-type convergence captures this effect.
De¬nition 2.1.3. Let {xn } ‚ RN and x— ∈ RN . Then {xn } converges to x— r-( quadrat-
ically, superlinearly, linearly) if there is a sequence {ξn } ‚ R converging q-(quadratically,
superlinearly, linearly) to 0 such that

xn ’ x— ¤ ξn .

We say that {xn } converges r-superlinearly with r-order ± > 1 if ξn ’ 0 q-superlinearly with
q-order ±.



2.2 The Standard Assumptions
We will assume that local minimizers satisfy the standard assumptions which, like the standard
assumptions for nonlinear equations in [154], will guarantee that Newton™s method converges
q-quadratically to x— . We will assume throughout this book that f and x— satisfy Assumption
2.2.1.
Assumption 2.2.1.

1. f is twice di¬erentiable and

∇2 f (x) ’ ∇2 f (y) ¤ γ x ’ y .
(2.1)

2. ∇f (x— ) = 0.

3. ∇2 f (x— ) is positive de¬nite.

We sometimes say that f is twice Lipschitz continuously differentiable with Lipschitz constant
γ to mean that part 1 of the standard assumptions holds.
If the standard assumptions hold then Theorem 1.4.1 implies that x— is a local minimizer
of f . Moreover, the standard assumptions for nonlinear equations [154] hold for the system
∇f (x) = 0. This means that all of the local convergence results for nonlinear equations can be
applied to unconstrained optimization problems. In this chapter we will quote those results from
nonlinear equations as they apply to unconstrained optimization. However, these statements
must be understood in the context of optimization. We will use, for example, the fact that the
Hessian (the Jacobian of ∇f ) is positive de¬nite at x— in our solution of the linear equation for
the Newton step. We will also use this in our interpretation of the Newton iteration.


2.3 Newton™s Method
As in [154] we will de¬ne iterative methods in terms of the transition from a current iteration xc
to a new one x+ . In the case of a system of nonlinear equations, for example, x+ is the root of
the local linear model of F about xc

Mc (x) = F (xc ) + F (xc )(x ’ xc ).


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




LOCAL CONVERGENCE 15

Solving Mc (x+ ) = 0 leads to the standard formula for the Newton iteration

x+ = xc ’ F (xc )’1 F (xc ).
(2.2)

One could say that Newton™s method for unconstrained optimization is simply the method
for nonlinear equations applied to ∇f (x) = 0. While this is technically correct if xc is near a
minimizer, it is utterly wrong if xc is near a maximum. A more precise way of expressing the
idea is to say that x+ is a minimizer of the local quadratic model of f about xc .
1
mc (x) = f (xc ) + ∇f (xc )T (x ’ xc ) + (x ’ xc )T ∇2 f (xc )(x ’ xc ).
2
If ∇2 f (xc ) is positive de¬nite, then the minimizer x+ of mc is the unique solution of ∇mc (x) =
0. Hence,
0 = ∇mc (x+ ) = ∇f (xc ) + ∇2 f (xc )(x+ ’ xc ).
Therefore,
x+ = xc ’ (∇2 f (xc ))’1 ∇f (xc ),
(2.3)
which is the same as (2.2) with F replaced by ∇f and F by ∇2 f . Of course, x+ is not computed
by forming an inverse matrix. Rather, given xc , ∇f (xc ) is computed and the linear equation

∇2 f (xc )s = ’∇f (xc )
(2.4)

is solved for the step s. Then (2.3) simply says that x+ = xc + s.
However, if uc is far from a minimizer, ∇2 f (uc ) could have negative eigenvalues and the
quadratic model will not have local minimizers (see exercise 2.7.4), and Mc , the local linear
model of ∇f about uc , could have roots which correspond to local maxima or saddle points
of mc . Hence, we must take care when far from a minimizer in making a correspondence
between Newton™s method for minimization and Newton™s method for nonlinear equations. In
this chapter, however, we will assume that we are suf¬ciently near a local minimizer for the
standard assumptions for local optimality to imply those for nonlinear equations (as applied to
∇f ). Most of the proofs in this chapter are very similar to the corresponding results, [154], for
nonlinear equations. We include them in the interest of completeness.
We begin with a lemma from [154], which we state without proof.
Lemma 2.3.1. Assume that the standard assumptions hold. Then there is δ > 0 so that for
all x ∈ B(δ)
∇2 f (x) ¤ 2 ∇2 f (x— ) ,
(2.5)
(∇2 f (x))’1 ¤ 2 (∇2 f (x— ))’1 ,
(2.6)
and
(∇2 f (x— ))’1 ’1
e /2 ¤ ∇f (x) ¤ 2 ∇2 f (x— ) e .
(2.7)


As a ¬rst example, we prove the local convergence for Newton™s method.
Theorem 2.3.2. Let the standard assumptions hold. Then there are K > 0 and δ > 0 such
that if xc ∈ B(δ), the Newton iterate from xc given by (2.3) satis¬es

e+ ¤ K ec 2 .
(2.8)


Proof. Let δ be small enough so that the conclusions of Lemma 2.3.1 hold. By Theorem 1.2.1
1
2 ’1 2 ’1
(∇2 f (xc ) ’ ∇2 f (x— + tec ))ec dt.
e+ = ec ’ ∇ f (xc ) ∇f (xc ) = (∇ f (xc ))
0



Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




16 ITERATIVE METHODS FOR OPTIMIZATION

By Lemma 2.3.1 and the Lipschitz continuity of ∇2 f ,

e+ ¤ (2 (∇2 f (x— ))’1 )γ ec 2 /2.

This completes the proof of (2.8) with K = γ (∇2 f (x— ))’1 .
As in the nonlinear equations setting, Theorem 2.3.2 implies that the complete iteration is
locally quadratically convergent.
Theorem 2.3.3. Let the standard assumptions hold. Then there is δ > 0 such that if
x0 ∈ B(δ), the Newton iteration

xn+1 = xn ’ (∇2 f (xn ))’1 ∇f (xn )

converges q-quadratically to x— .
Proof. Let δ be small enough so that the conclusions of Theorem 2.3.2 hold. Reduce δ if
needed so that Kδ = · < 1. Then if n ≥ 0 and xn ∈ B(δ), Theorem 2.3.2 implies that
2
en+1 ¤ K en ¤ · en < en
(2.9)

and hence xn+1 ∈ B(·δ) ‚ B(δ). Therefore, if xn ∈ B(δ) we may continue the iteration. Since
x0 ∈ B(δ) by assumption, the entire sequence {xn } ‚ B(δ). (2.9) then implies that xn ’ x—
q-quadratically.
Newton™s method, from the local convergence point of view, is exactly the same as that
for nonlinear equations applied to the problem of ¬nding a root of ∇f . We exploit the extra
structure of positive de¬niteness of ∇2 f with an implementation of Newton™s method based on
the Cholesky factorization [127], [249], [264]

∇2 f (u) = LLT ,
(2.10)

where L is lower triangular and has a positive diagonal.
We terminate the iteration when ∇f is suf¬ciently small (see [154]). A natural criterion is
to demand a relative decrease in ∇f and terminate when

∇f (xn ) ¤ „r ∇f (x0 ) ,
(2.11)

where „r ∈ (0, 1) is the desired reduction in the gradient norm. However, if ∇f (x0 ) is very
small, it may not be possible to satisfy (2.11) in ¬‚oating point arithmetic and an algorithm based
entirely on (2.11) might never terminate. A standard remedy is to augment the relative error
criterion and terminate the iteration using a combination of relative and absolute measures of
∇f , i.e., when
∇f (xn ) ¤ „r ∇f (x0 ) + „a .
(2.12)
In (2.12) „a is an absolute error tolerance. Hence, the termination criterion input to many of the
algorithms presented in this book will be in the form of a vector „ = („r , „a ) of relative and
absolute residuals.
Algorithm 2.3.1. newton(x, f, „ )
1. r0 = ∇f (x)
2. Do while ∇f (x) > „r r0 + „a
(a) Compute ∇2 f (x)
(b) F actor ∇2 f (x) = LLT
(c) Solve LLT s = ’∇f (x)


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




LOCAL CONVERGENCE 17

(d) x = x + s
(e) Compute ∇f (x).

Algorithm newton, as formulated above, is not completely satisfactory. The value of the
objective function f is never used and step 2b will fail if ∇2 f is not positive de¬nite. This failure,
in fact, could serve as a signal that one is too far from a minimizer for Newton™s method to be
directly applicable. However, if we are near enough (see Exercise 2.7.8) to a local minimizer,
as we assume in this chapter, all will be well and we may apply all the results from nonlinear
equations.

<< . .

. 3
( : 32)



. . >>