<< . .

. 7
( : 32)



. . >>


2.6.2 Discrete Control Problem
We solve the discrete control problem from §1.6.1 with N = 400, T = 1, y0 = 0,

L(y, u, t) = (y ’ 3)2 + .5u2 , and φ(y, u, t) = uy + t2

with Newton“CG and two different choices, · = .1, .0001, of the forcing term. The initial
iterate was u0 = (10, 10, . . . , 10)T and the iteration was terminated when ∇f < 10’8 . In


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




LOCAL CONVERGENCE 35


Newton™s Method Newton™s Method
2
10 4

0
10 3.9




Function Value
Gradient Norm




’2
10 3.8

’4
3.7
10

’6
10 3.6
0 1 2 3 0 1 2 3
Iterations Iterations
Gauss’Newton Method Gauss’Newton Method
2
10 4

0
10 3.9
Function Value
Gradient Norm




’2
10 3.8

’4
3.7
10

’6
10 3.6
0 2 4 6 0 2 4 6
Iterations Iterations

Figure 2.2: Local Optimization for the Parameter ID Problem, Nonzero Residual


Figure 2.3 one can see that the small forcing term produces an iteration history with the concavity
of superlinear convergence. The limiting q-linear behavior of an iteration with constant · is not
yet visible. The iteration with the larger value of · is in the q-linearly convergent stage, as the
linear plot of ∇f against the iteration counter shows.
The cost of the computation is not re¬‚ected by the number of nonlinear iterations. When
· = .0001, the high accuracy of the linear solve is not rewarded. The computation with · = .0001
required 8 nonlinear iterations, a total of 32 CG iterations, roughly 1.25 million ¬‚oating point
operations, and 41 gradient evaluations. The optimization with · = .1 needed 10 nonlinear
iterations, a total of 13 CG iterations, roughly 820 thousand ¬‚oating point operations, and 24
gradient evaluations.


2.7 Exercises on Local Convergence
2.7.1. Apply Newton™s method with (a) analytic ¬rst and second derivatives, (b) analytic ¬rst
derivatives and forward difference second derivatives, and (c) forward difference ¬rst and
second derivatives to ¬nd a local minimum of

1. f (x) = sin2 (x),
2
2. f (x) = ex , and


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




36 ITERATIVE METHODS FOR OPTIMIZATION

eta = .1 eta = .1
5 3
10 10




Function Value
Gradient Norm




0 2
10 10



’5 1
10 10



’10 0
10 10
0 5 10 0 5 10
Iterations Iterations

eta = .0001 eta = .0001
5 3
10 10


0
10
Function Value
Gradient Norm




2
10
’5
10
1
10
’10
10


’15 0
10 10
0 2 4 6 8 0 2 4 6 8
Iterations Iterations


Figure 2.3: Newton“CG for the Discrete Control Problem: · = .1, .0001


3. f (x) = x4 .
Use difference steps of h = 10’1 , 10’2 , 10’4 , and 10’8 . Explain your results.
2.7.2. Repeat part (c) of exercise 2.7.1. Experiment with
2
f (x) = ex + 10’4 rand(x) and f (x) = x2 + 10’4 rand(x),

where rand denotes the random number generator in your computing environment. Ex-
plain the differences in the results.
2.7.3. Show that if A is symmetric, p = 0, and pT Ap = 0, then A is either singular or inde¬nite.
2.7.4. Show that if b ∈ RN and the N — N matrix A is symmetric and has a negative eigenvalue,
then the quadratic functional

m(x) = xT Ax + xT b

does not have a minimizer. Show that if u is an eigenvector corresponding to a negative
eigenvalue of the Hessian, then u is a direction of negative curvature.
2.7.5. If N = 1, the local quadratic model could easily be replaced by a local quartic (i.e.,
fourth-degree) model (what would be wrong with a cubic model?). If a method is based
on minimization of the local quartic model, what kind of local convergence would you


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




LOCAL CONVERGENCE 37

expect? How would you extend this method to the case N > 1? Look at [30] for some
results on this.
2.7.6. Show that if the standard assumptions hold, h is suf¬ciently small, and x is suf¬ciently
near x— , the difference Hessian de¬ned by (2.19), ∇2 f (x), is spd.
h

2.7.7. Write a locally convergent Newton method code based on accurate function and gradient
information and forward difference Hessians using (2.19). Be sure that your code tests for
positivity of the Hessian so that you can avoid convergence to a local maximum. Is the
test for positivity expensive? Apply your code to the parameter ID problem from §1.6.2.
If you use an ODE solver that lets you control the accuracy of the integration, try values
of the accuracy from 10’8 to 10’2 and see how the iteration performs. Be sure that your
difference Hessian re¬‚ects the accuracy in the gradient.
2.7.8. Let the standard assumptions hold and let »s > 0 be the smallest eigenvalue of ∇2 f (x— ).
Give the best (i.e., largest) bound you can for ρ such that ∇2 f (x) is positive de¬nite for
all x ∈ B(ρ).
2.7.9. Use the de¬nition of A† to prove (2.41).

2.7.10. Fill in the missing details in the proof of Theorem 2.4.2 by showing how the Kantorovich
theorem can be used to prove the existence of ξ(x).
2.7.11. Let f (x) = x2 and f (x) = sin(100x)/10. Using an initial iterate of x0 = 1, try to ¬nd
a local minimum of f + f using Newton™s method with analytic gradients and Hessians.
Repeat the experiment with difference gradients and Hessians (try forward differences
with a step size of h = .2).
2.7.12. Solve the parameter ID problem from §2.6 with the observations perturbed randomly (for
example, you could use the MATLAB rand function for this). Vary the amplitude of the
perturbation and see how the performance of Newton and Gauss“Newton changes.

2.7.13. Derive sensitivity equations for the entries of the Hessian for the parameter ID objective
function. In general, if there are P parameters, how many sensitivity equations would
need to be solved for the gradient? How many for the Hessian?
2.7.14. Solve the discrete control problem from §2.6.2 using Newton“CG with forcing terms that
depend on n. Consider ·n = .5/n, ·n = min(.1, ∇f (un ) ), and some of the choices
from [99]. Vary N and the termination criteria and compare the performance with the
constant · choice in §2.6.2.
2.7.15. Let F : RN ’ RM (where M and N need not be the same) be suf¬ciently smooth (how
smooth is that?) and be such that F can also be computed for complex arguments. Show
that [181], [245]
Im(F (x + ihu))/h = F (x)u + O(h2 ),
where Im denotes imaginary part. What happens if there is error in F ? How can you use
this fact to compute better difference gradients and Hessians?




Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




Chapter 3

Global Convergence

The locally convergent algorithms discussed in Chapter 2 can and do fail when the initial iterate
is not near the root. The reasons for this failure, as we explain below, are that the Newton
direction may fail to be a direction of descent for f and that even when a search direction is a
direction of decrease of f , as ’∇f is, the length of the step can be too long. Hence, taking a
Newton (or Gauss“Newton, or inexact Newton) step can lead to an increase in the function and
divergence of the iteration (see exercise 3.5.14 for two dramatic examples of this). The globally
convergent algorithms developed in this chapter partially address this problem by either ¬nding
a local minimum or failing in one of a small number of easily detectable ways.
These are not algorithms for global optimization. When these algorithms are applied to
problems with many local minima, the results of the iteration may depend in complex ways on
the initial iterate.


3.1 The Method of Steepest Descent
The steepest descent direction from x is d = ’∇f (x). The method of steepest descent [52]
updates the current iteration xc by the formula

x+ = xc ’ »∇f (xc ).
(3.1)

If we take the simple choice » = 1, then x+ is not guaranteed to be nearer a solution than xc ,
even if xc is very near a solution that satis¬es the standard assumptions. The reason for this is
that, unlike the Newton direction, the steepest descent direction scales with f . The Newton step,
on the other hand, is the same for f as it is for cf for any c = 0 but need not be a direction of
decrease for f .
To make the method of steepest descent succeed, it is important to choose the steplength ».
One way to do this, which we analyze in §3.2, is to let » = β m , where β ∈ (0, 1) and m ≥ 0 is
the smallest nonnegative integer such that there is suf¬cient decrease in f . In the context of the
steepest descent algorithm, this means that

f (xc ’ »∇f (xc )) ’ f (xc ) < ’±» ∇f (xc ) 2 .
(3.2)

This strategy, introduced in [7] and called the Armijo rule, is an example of a line search in
which one searches on a ray from xc in a direction in which f is locally decreasing. In (3.2),
± ∈ (0, 1) is a parameter, which we discuss after we fully specify the algorithm. This strategy
of repeatedly testing for suf¬cient decrease and reducing the stepsize if the test is failed is called
backtracking for obvious reasons.

39
Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




40 ITERATIVE METHODS FOR OPTIMIZATION

The motivation for (3.2) is that if we approximate f by the linear model

mc = f (xc ) + ∇f (xc )(x ’ xc ),

then the reduction in the model (i.e., the predicted reduction in f ) is

pred = mc (xc ) ’ mc (x+ ) = » ∇f (xc ) 2 .

(3.2) says that the actual reduction in f

ared = f (xc ) ’ f (x+ )

is at least as much as a fraction of the predicted reduction in the linear model. The parameter ±
is typically set to 10’4 .
The reason we demand suf¬cient decrease instead of simple decrease (i.e., f (xc ) < f (x+ )
or ± = 0) is largely theoretical; a nonzero value of ± is required within the proof to insure that
the iteration does not stagnate before convergence.
Algorithm 3.1.1. steep(x, f, kmax)
1. For k = 1, . . . , kmax
(a) Compute f and ∇f ; test for termination.
(b) Find the least integer m ≥ 0 such that (3.2) holds for » = β m .
(c) x = x + »d.
2. If k = kmax and the termination test is failed, signal failure.

The termination criterion could be based on (2.12), for example.


3.2 Line Search Methods and the Armijo Rule
We introduce a few new concepts so that our proof of convergence of Algorithm steep will
also apply to a signi¬cantly more general class of algorithms.
De¬nition 3.2.1. A vector d ∈ RN is a descent direction for f at x if
df (x + td)
= ∇f (x)T d < 0.
dt t=0



Clearly the steepest descent direction d = ’∇f (x) is a descent direction. A line search
algorithm searches for decrease in f in a descent direction, using the Armijo rule for stepsize
control, unless ∇f (x) = 0.
We will consider descent directions based on quadratic models of f of the form
1
m(x) = f (xc ) + ∇f (xc )T (x ’ xc ) + (x ’ xc )T Hc (x ’ xc ),
2
where Hc , which is sometimes called the model Hessian, is spd. We let d = x ’ xc be such that
m(x) is minimized. Hence,

∇m(x) = ∇f (xc ) + Hc (x ’ xc ) = 0

and hence
’1
d = ’Hc ∇f (xc ).
(3.3)


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




GLOBAL CONVERGENCE 41

The steepest descent direction satis¬es (3.3) with Hc = I. However, the Newton direction
d = ’∇2 f (x)’1 ∇f (x) may fail to be a descent direction if x is far from a minimizer because
∇2 f may not be spd. Hence, unlike the case for nonlinear equations [154], Newton™s method is
not a generally good global method, even with a line search, and must be modi¬ed (see [113],
[117], [231], and [100]) to make sure that the model Hessians are spd.
The algorithm we analyze in this section is an extension of Algorithm steep that allows for
descent directions that satisfy (3.3) for spd H. We modify (3.2) to account for H and the new
descent direction d = ’H ’1 ∇f (x). The general suf¬cient decrease condition is

f (xc + »d) ’ f (xc ) < ±»∇f (xc )T d.
(3.4)

Here, as in (3.2), ± ∈ (0, 1) is an algorithmic parameter. Typically ± = 10’4 .
The stepsize reduction scheme in step 1b of Algorithm steep is crude. If β is too large, too
many stepsize reductions may be needed before a step is accepted. If β is too small, the progress
of the entire iteration may be retarded. We will address this problem in two ways. In §3.2.1 we
will construct polynomial models of f along the descent direction to predict an optimal factor
by which to reduce the step. In §3.3.3 we describe a method which remembers the steplength
from the previous iteration.
Our proofs require only the following general line search strategy. If a steplength »c has
been rejected (i.e., (3.4) fails with » = »c ), construct

»+ ∈ [βlow »c , βhigh »c ],
(3.5)

where 0 < βlow ¤ βhigh < 1. The choice β = βlow = βhigh is the simple rule in Algo-
rithm steep. An exact line search, in which » is the exact minimum of f (xc + »d), is not only
not worth the extra expense but can degrade the performance of the algorithm.
Algorithm 3.2.1. optarm(x, f, kmax)
1. For k = 1, . . . , kmax
(a) Compute f and ∇f ; test for termination.
(b) Construct an spd matrix H and solve (3.3) to obtain a descent direction d.
(c) Beginning with » = 1, repeatedly reduce » using any strategy that satis¬es (3.5)
until (3.4) holds.
(d) x = x + »d.
2. If k = kmax and the termination test is failed, signal failure.

In the remainder of this section we prove that if the sequence of model Hessians remains
uniformly bounded and positive de¬nite and the sequence of function values {f (xk )} is bounded
from below, then any limit point of the sequence {xk } generated by Algorithm optarm con-
verges to a point x— that satis¬es the necessary conditions for optimality. We follow that analysis
with a local convergence theory that is much less impressive than that for Newton™s method.
We begin our analysis with a simple estimate that follows directly from the spectral theorem
for spd matrices.
Lemma 3.2.1. Let H be spd with smallest and largest eigenvalues 0 < »s < »l . Then for
all z ∈ RN ,

<< . .

. 7
( : 32)



. . >>