2.6.2 Discrete Control Problem

We solve the discrete control problem from §1.6.1 with N = 400, T = 1, y0 = 0,

L(y, u, t) = (y ’ 3)2 + .5u2 , and φ(y, u, t) = uy + t2

with Newton“CG and two different choices, · = .1, .0001, of the forcing term. The initial

iterate was u0 = (10, 10, . . . , 10)T and the iteration was terminated when ∇f < 10’8 . In

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

LOCAL CONVERGENCE 35

Newton™s Method Newton™s Method

2

10 4

0

10 3.9

Function Value

Gradient Norm

’2

10 3.8

’4

3.7

10

’6

10 3.6

0 1 2 3 0 1 2 3

Iterations Iterations

Gauss’Newton Method Gauss’Newton Method

2

10 4

0

10 3.9

Function Value

Gradient Norm

’2

10 3.8

’4

3.7

10

’6

10 3.6

0 2 4 6 0 2 4 6

Iterations Iterations

Figure 2.2: Local Optimization for the Parameter ID Problem, Nonzero Residual

Figure 2.3 one can see that the small forcing term produces an iteration history with the concavity

of superlinear convergence. The limiting q-linear behavior of an iteration with constant · is not

yet visible. The iteration with the larger value of · is in the q-linearly convergent stage, as the

linear plot of ∇f against the iteration counter shows.

The cost of the computation is not re¬‚ected by the number of nonlinear iterations. When

· = .0001, the high accuracy of the linear solve is not rewarded. The computation with · = .0001

required 8 nonlinear iterations, a total of 32 CG iterations, roughly 1.25 million ¬‚oating point

operations, and 41 gradient evaluations. The optimization with · = .1 needed 10 nonlinear

iterations, a total of 13 CG iterations, roughly 820 thousand ¬‚oating point operations, and 24

gradient evaluations.

2.7 Exercises on Local Convergence

2.7.1. Apply Newton™s method with (a) analytic ¬rst and second derivatives, (b) analytic ¬rst

derivatives and forward difference second derivatives, and (c) forward difference ¬rst and

second derivatives to ¬nd a local minimum of

1. f (x) = sin2 (x),

2

2. f (x) = ex , and

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

36 ITERATIVE METHODS FOR OPTIMIZATION

eta = .1 eta = .1

5 3

10 10

Function Value

Gradient Norm

0 2

10 10

’5 1

10 10

’10 0

10 10

0 5 10 0 5 10

Iterations Iterations

eta = .0001 eta = .0001

5 3

10 10

0

10

Function Value

Gradient Norm

2

10

’5

10

1

10

’10

10

’15 0

10 10

0 2 4 6 8 0 2 4 6 8

Iterations Iterations

Figure 2.3: Newton“CG for the Discrete Control Problem: · = .1, .0001

3. f (x) = x4 .

Use difference steps of h = 10’1 , 10’2 , 10’4 , and 10’8 . Explain your results.

2.7.2. Repeat part (c) of exercise 2.7.1. Experiment with

2

f (x) = ex + 10’4 rand(x) and f (x) = x2 + 10’4 rand(x),

where rand denotes the random number generator in your computing environment. Ex-

plain the differences in the results.

2.7.3. Show that if A is symmetric, p = 0, and pT Ap = 0, then A is either singular or inde¬nite.

2.7.4. Show that if b ∈ RN and the N — N matrix A is symmetric and has a negative eigenvalue,

then the quadratic functional

m(x) = xT Ax + xT b

does not have a minimizer. Show that if u is an eigenvector corresponding to a negative

eigenvalue of the Hessian, then u is a direction of negative curvature.

2.7.5. If N = 1, the local quadratic model could easily be replaced by a local quartic (i.e.,

fourth-degree) model (what would be wrong with a cubic model?). If a method is based

on minimization of the local quartic model, what kind of local convergence would you

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

LOCAL CONVERGENCE 37

expect? How would you extend this method to the case N > 1? Look at [30] for some

results on this.

2.7.6. Show that if the standard assumptions hold, h is suf¬ciently small, and x is suf¬ciently

near x— , the difference Hessian de¬ned by (2.19), ∇2 f (x), is spd.

h

2.7.7. Write a locally convergent Newton method code based on accurate function and gradient

information and forward difference Hessians using (2.19). Be sure that your code tests for

positivity of the Hessian so that you can avoid convergence to a local maximum. Is the

test for positivity expensive? Apply your code to the parameter ID problem from §1.6.2.

If you use an ODE solver that lets you control the accuracy of the integration, try values

of the accuracy from 10’8 to 10’2 and see how the iteration performs. Be sure that your

difference Hessian re¬‚ects the accuracy in the gradient.

2.7.8. Let the standard assumptions hold and let »s > 0 be the smallest eigenvalue of ∇2 f (x— ).

Give the best (i.e., largest) bound you can for ρ such that ∇2 f (x) is positive de¬nite for

all x ∈ B(ρ).

2.7.9. Use the de¬nition of A† to prove (2.41).

2.7.10. Fill in the missing details in the proof of Theorem 2.4.2 by showing how the Kantorovich

theorem can be used to prove the existence of ξ(x).

2.7.11. Let f (x) = x2 and f (x) = sin(100x)/10. Using an initial iterate of x0 = 1, try to ¬nd

a local minimum of f + f using Newton™s method with analytic gradients and Hessians.

Repeat the experiment with difference gradients and Hessians (try forward differences

with a step size of h = .2).

2.7.12. Solve the parameter ID problem from §2.6 with the observations perturbed randomly (for

example, you could use the MATLAB rand function for this). Vary the amplitude of the

perturbation and see how the performance of Newton and Gauss“Newton changes.

2.7.13. Derive sensitivity equations for the entries of the Hessian for the parameter ID objective

function. In general, if there are P parameters, how many sensitivity equations would

need to be solved for the gradient? How many for the Hessian?

2.7.14. Solve the discrete control problem from §2.6.2 using Newton“CG with forcing terms that

depend on n. Consider ·n = .5/n, ·n = min(.1, ∇f (un ) ), and some of the choices

from [99]. Vary N and the termination criteria and compare the performance with the

constant · choice in §2.6.2.

2.7.15. Let F : RN ’ RM (where M and N need not be the same) be suf¬ciently smooth (how

smooth is that?) and be such that F can also be computed for complex arguments. Show

that [181], [245]

Im(F (x + ihu))/h = F (x)u + O(h2 ),

where Im denotes imaginary part. What happens if there is error in F ? How can you use

this fact to compute better difference gradients and Hessians?

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

Chapter 3

Global Convergence

The locally convergent algorithms discussed in Chapter 2 can and do fail when the initial iterate

is not near the root. The reasons for this failure, as we explain below, are that the Newton

direction may fail to be a direction of descent for f and that even when a search direction is a

direction of decrease of f , as ’∇f is, the length of the step can be too long. Hence, taking a

Newton (or Gauss“Newton, or inexact Newton) step can lead to an increase in the function and

divergence of the iteration (see exercise 3.5.14 for two dramatic examples of this). The globally

convergent algorithms developed in this chapter partially address this problem by either ¬nding

a local minimum or failing in one of a small number of easily detectable ways.

These are not algorithms for global optimization. When these algorithms are applied to

problems with many local minima, the results of the iteration may depend in complex ways on

the initial iterate.

3.1 The Method of Steepest Descent

The steepest descent direction from x is d = ’∇f (x). The method of steepest descent [52]

updates the current iteration xc by the formula

x+ = xc ’ »∇f (xc ).

(3.1)

If we take the simple choice » = 1, then x+ is not guaranteed to be nearer a solution than xc ,

even if xc is very near a solution that satis¬es the standard assumptions. The reason for this is

that, unlike the Newton direction, the steepest descent direction scales with f . The Newton step,

on the other hand, is the same for f as it is for cf for any c = 0 but need not be a direction of

decrease for f .

To make the method of steepest descent succeed, it is important to choose the steplength ».

One way to do this, which we analyze in §3.2, is to let » = β m , where β ∈ (0, 1) and m ≥ 0 is

the smallest nonnegative integer such that there is suf¬cient decrease in f . In the context of the

steepest descent algorithm, this means that

f (xc ’ »∇f (xc )) ’ f (xc ) < ’±» ∇f (xc ) 2 .

(3.2)

This strategy, introduced in [7] and called the Armijo rule, is an example of a line search in

which one searches on a ray from xc in a direction in which f is locally decreasing. In (3.2),

± ∈ (0, 1) is a parameter, which we discuss after we fully specify the algorithm. This strategy

of repeatedly testing for suf¬cient decrease and reducing the stepsize if the test is failed is called

backtracking for obvious reasons.

39

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

40 ITERATIVE METHODS FOR OPTIMIZATION

The motivation for (3.2) is that if we approximate f by the linear model

mc = f (xc ) + ∇f (xc )(x ’ xc ),

then the reduction in the model (i.e., the predicted reduction in f ) is

pred = mc (xc ) ’ mc (x+ ) = » ∇f (xc ) 2 .

(3.2) says that the actual reduction in f

ared = f (xc ) ’ f (x+ )

is at least as much as a fraction of the predicted reduction in the linear model. The parameter ±

is typically set to 10’4 .

The reason we demand suf¬cient decrease instead of simple decrease (i.e., f (xc ) < f (x+ )

or ± = 0) is largely theoretical; a nonzero value of ± is required within the proof to insure that

the iteration does not stagnate before convergence.

Algorithm 3.1.1. steep(x, f, kmax)

1. For k = 1, . . . , kmax

(a) Compute f and ∇f ; test for termination.

(b) Find the least integer m ≥ 0 such that (3.2) holds for » = β m .

(c) x = x + »d.

2. If k = kmax and the termination test is failed, signal failure.

The termination criterion could be based on (2.12), for example.

3.2 Line Search Methods and the Armijo Rule

We introduce a few new concepts so that our proof of convergence of Algorithm steep will

also apply to a signi¬cantly more general class of algorithms.

De¬nition 3.2.1. A vector d ∈ RN is a descent direction for f at x if

df (x + td)

= ∇f (x)T d < 0.

dt t=0

Clearly the steepest descent direction d = ’∇f (x) is a descent direction. A line search

algorithm searches for decrease in f in a descent direction, using the Armijo rule for stepsize

control, unless ∇f (x) = 0.

We will consider descent directions based on quadratic models of f of the form

1

m(x) = f (xc ) + ∇f (xc )T (x ’ xc ) + (x ’ xc )T Hc (x ’ xc ),

2

where Hc , which is sometimes called the model Hessian, is spd. We let d = x ’ xc be such that

m(x) is minimized. Hence,

∇m(x) = ∇f (xc ) + Hc (x ’ xc ) = 0

and hence

’1

d = ’Hc ∇f (xc ).

(3.3)

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

GLOBAL CONVERGENCE 41

The steepest descent direction satis¬es (3.3) with Hc = I. However, the Newton direction

d = ’∇2 f (x)’1 ∇f (x) may fail to be a descent direction if x is far from a minimizer because

∇2 f may not be spd. Hence, unlike the case for nonlinear equations [154], Newton™s method is

not a generally good global method, even with a line search, and must be modi¬ed (see [113],

[117], [231], and [100]) to make sure that the model Hessians are spd.

The algorithm we analyze in this section is an extension of Algorithm steep that allows for

descent directions that satisfy (3.3) for spd H. We modify (3.2) to account for H and the new

descent direction d = ’H ’1 ∇f (x). The general suf¬cient decrease condition is

f (xc + »d) ’ f (xc ) < ±»∇f (xc )T d.

(3.4)

Here, as in (3.2), ± ∈ (0, 1) is an algorithmic parameter. Typically ± = 10’4 .

The stepsize reduction scheme in step 1b of Algorithm steep is crude. If β is too large, too

many stepsize reductions may be needed before a step is accepted. If β is too small, the progress

of the entire iteration may be retarded. We will address this problem in two ways. In §3.2.1 we

will construct polynomial models of f along the descent direction to predict an optimal factor

by which to reduce the step. In §3.3.3 we describe a method which remembers the steplength

from the previous iteration.

Our proofs require only the following general line search strategy. If a steplength »c has

been rejected (i.e., (3.4) fails with » = »c ), construct

»+ ∈ [βlow »c , βhigh »c ],

(3.5)

where 0 < βlow ¤ βhigh < 1. The choice β = βlow = βhigh is the simple rule in Algo-

rithm steep. An exact line search, in which » is the exact minimum of f (xc + »d), is not only

not worth the extra expense but can degrade the performance of the algorithm.

Algorithm 3.2.1. optarm(x, f, kmax)

1. For k = 1, . . . , kmax

(a) Compute f and ∇f ; test for termination.

(b) Construct an spd matrix H and solve (3.3) to obtain a descent direction d.

(c) Beginning with » = 1, repeatedly reduce » using any strategy that satis¬es (3.5)

until (3.4) holds.

(d) x = x + »d.

2. If k = kmax and the termination test is failed, signal failure.

In the remainder of this section we prove that if the sequence of model Hessians remains

uniformly bounded and positive de¬nite and the sequence of function values {f (xk )} is bounded

from below, then any limit point of the sequence {xk } generated by Algorithm optarm con-

verges to a point x— that satis¬es the necessary conditions for optimality. We follow that analysis

with a local convergence theory that is much less impressive than that for Newton™s method.

We begin our analysis with a simple estimate that follows directly from the spectral theorem

for spd matrices.

Lemma 3.2.1. Let H be spd with smallest and largest eigenvalues 0 < »s < »l . Then for

all z ∈ RN ,