<< . .

. 6
( : 32)



. . >>


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




LOCAL CONVERGENCE 29

2.5.1 Convergence Rates
We will prove the simplest of the convergence results for Newton“CG, Theorem 2.5.1, from
which Theorem 2.5.2 follows directly. We refer the reader to [74] and [154] for the proof of
Theorem 2.5.3.
Theorem 2.5.1. Let the standard assumptions hold. Then there are δ and KI such that if
xc ∈ B(δ), s satis¬es (2.47), and x+ = xc + s, then

e+ ¤ KI ( ec + ·c ) ec .
(2.48)



Proof. Let δ be small enough so that the conclusions of Lemma 2.3.1 and Theorem 2.3.2
hold. To prove the ¬rst assertion (2.48) note that if

r = ’∇2 f (xc )s ’ ∇f (xc )

is the linear residual, then

s + (∇2 f (xc ))’1 ∇f (xc ) = ’(∇2 f (xc ))’1 r

and
e+ = ec + s = ec ’ (∇2 f (xc ))’1 ∇f (xc ) ’ (∇2 f (xc ))’1 r.
(2.49)
Now, (2.47), (2.7), and (2.6) imply that

s + (∇2 f (xc ))’1 ∇f (xc ) ¤ (∇2 f (xc ))’1 ·c ∇f (xc )

¤ 4κ(∇2 f (x— ))·c ec .

Hence, using (2.49) and Theorem 2.3.2, we have that

¤ ec ’ ∇2 f (xc )’1 ∇f (xc ) + 4κ(F (x— ))·c ec
e+

2
+ 4κ(∇2 f (x— ))·c ec ,
¤ K ec

where K is the constant from (2.8). If we set

KI = K + 4κ(∇2 f (x— )),

the proof is complete.
Theorem 2.5.2. Let the standard assumptions hold. Then there are δ and · such that if
¯
x0 ∈ B(δ), {·n } ‚ [0, · ], then the inexact Newton iteration
¯

xn+1 = xn + sn ,

where
∇2 f (xn )sn + ∇f (xn ) ¤ ·n ∇f (xn ) ,
converges q-linearly to x— . Moreover

• if ·n ’ 0 the convergence is q-superlinear, and
p
• if ·n ¤ K· ∇f (xn ) for some K· > 0 the convergence is q-superlinear with q-order
1 + p.


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




30 ITERATIVE METHODS FOR OPTIMIZATION

The similarity between Theorem 2.5.2 and Theorem 2.3.5, the convergence result for the
chord method, should be clear. Rather than require that the approximate Hessian be accurate,
we demand that the linear iteration produce a suf¬ciently small relative residual. Theorem 2.5.3
is the remarkable statement that any reduction in the relative linear residual will suf¬ce for linear
convergence in a certain norm. This statement implies [154] that ∇f (xn ) will converge to
zero q-linearly, or, equivalently, that xn ’ x— q-linearly with respect to · — , which is de¬ned
by
x — = ∇2 f (x— )x .

Theorem 2.5.3. Let the standard assumptions hold. Then there is δ such that if xc ∈ B(δ),
s satis¬es (2.47), x+ = xc + s, and ·c ¤ · < · < 1, then
¯

e+ ¤ · ec
¯ —.
(2.50) —




Theorem 2.5.4. Let the standard assumptions hold. Then there is δ such that if x0 ∈ B(δ),
{·n } ‚ [0, ·] with · < · < 1, then the inexact Newton iteration
¯

xn+1 = xn + sn ,

where
∇2 f (xn )sn + ∇f (xn ) ¤ ·n ∇f (xn )
to x— . Moreover
converges q-linearly with respect to · —

• if ·n ’ 0 the convergence is q-superlinear, and
p
• if ·n ¤ K· ∇f (xn ) for some K· > 0 the convergence is q-superlinear with q-order
1 + p.

Q-linear convergence of {xn } to a local minimizer with respect to · — is equivalent to
q-linear convergence of {∇f (xn )} to zero. We will use the rate of convergence of {∇f (xn )}
in our computational examples to compare various methods.

2.5.2 Implementation of Newton“CG
Our implementation of Newton“CG approximately solves the equation for the Newton step with
CG. We make the implicit assumption that ∇f has been computed suf¬ciently accurately for
Dh f (x : w) to be a useful approximate Hessian of the Hessian“vector product ∇2 f (x)w.
2



Forward Difference CG
Algorithm fdcg is an implementation of the solution by CG of the equation for the Newton step
(2.4). In this algorithm we take care to identify failure in CG (i.e., detection of a vector p for
which pT Hp ¤ 0). This failure either means that H is singular (pT Hp = 0; see exercise 2.7.3)
or that pT Hp < 0, i.e., p is a direction of negative curvature. The algorithms we will discuss
in §3.3.7 make good use of directions of negative curvature. The initial iterate for forward
difference CG iteration should be the zero vector. In this way the ¬rst iterate will give a steepest
descent step, a fact that is very useful. The inputs to Algorithm fdcg are the current point x,
the objective f , the forcing term ·, and a limit on the number of iterations kmax. The output is
2
the inexact Newton direction d. Note that in step 2b Dh f (x : p) is used as an approximation to
∇2 f (x)p.
Algorithm 2.5.1. fdcg(x, f, ·, kmax, d)


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




LOCAL CONVERGENCE 31

1. r = ’∇f (x), ρ0 = r 2 , k = 1, d = 0.
2

ρk’1 > · ∇f (x) and k < kmax
2. Do While

(a) if k = 1 then p = r
else
β = ρk’1 /ρk’2 and p = r + βp
2
(b) w = Dh f (x : p)
If pT w = 0 signal inde¬niteness; stop.
If pT w < 0 signal negative curvature; stop.
(c) ± = ρk’1 /pT w
(d) d = d + ±p
(e) r = r ’ ±w
2
(f) ρk = r
(g) k = k + 1

Preconditioning can be incorporated into a Newton“CG algorithm by using a forward dif-
ference formulation, too. Here, as in [154], we denote the preconditioner by M . Aside from M ,
the inputs and output of Algorithm fdpcg are the same as that for Algorithm fdcg.
Algorithm 2.5.2. fdpcg(x, f, M, ·, kmax, d)

1. r = ’∇f (x), ρ0 = r 2 , k = 1, d = 0.
2

ρk’1 > · ∇f (x) and k < kmax
2. Do While

(a) z = M r
(b) „k’1 = z T r
(c) if k = 1 then β = 0 and p = z
else
β = „k’1 /„k’2 , p = z + βp
2
(d) w = Dh f (x : p)
If pT w = 0 signal inde¬niteness; stop.
If pT w < 0 signal negative curvature; stop.
(e) ± = „k’1 /pT w
(f) d = d + ±p
(g) r = r ’ ±w
(h) ρk = rT r
(i) k = k + 1

In our formulation of Algorithms fdcg and fdpcg, inde¬niteness is a signal that we are
not suf¬ciently near a minimum for the theory in this section to hold. In §3.3.7 we show how
negative curvature can be exploited when far from the solution.
One view of preconditioning is that it is no more than a rescaling of the independent variables.
Suppose, rather than (1.2), we seek to solve
ˆ
min f (y),
(2.51)
y




Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




32 ITERATIVE METHODS FOR OPTIMIZATION

ˆ ˆ
where f (y) = f (M 1/2 y) and M is spd. If y — is a local minimizer of f , then x— = M 1/2 y — is
a local minimizer of f and the two problems are equivalent. Moreover, if x = M 1/2 y and ∇x
and ∇y denote gradients in the x and y coordinates, then

ˆ
∇y f (y) = M 1/2 ∇x f (x)

and
ˆ
∇2 f (y) = M 1/2 (∇2 f (x))M 1/2 .
y x

Hence, the scaling matrix plays the role of the square root of the preconditioner for the precon-
ditioned conjugate gradient algorithm.


Newton“CG

The theory guarantees that if x0 is near enough to a local minimizer then ∇2 f (xn ) will be spd
for the entire iteration and xn will converge rapidly to x— . Hence, Algorithm newtcg will not
terminate with failure because of an increase in f or an inde¬nite Hessian. Note that both the
forcing term · and the preconditioner M can change as the iteration progresses.

Algorithm 2.5.3. newtcg(x, f, „, ·)

1. rc = r0 = ∇f (x)

2. Do while ∇f (x) > „r r0 + „a

(a) Select · and a preconditioner M .
(b) fdpcg(x, f, M, ·, kmax, d)
If inde¬niteness has been detected, terminate with failure.
(c) x = x + d.
(d) Evaluate f and ∇f (x).
If f has not decreased, terminate with failure.
(e) r+ = ∇f (x) , σ = r+ /rc , rc = r+ .


The implementation of Newton“CG is simple, but, as presented in Algorithm newtcg,
incomplete. The algorithm requires substantial modi¬cation to be able to generate the good
initial data that the local theory requires. We return to this issue in §3.3.7.
There is a subtle problem with Algorithm fdpcg in that the algorithm is equivalent to the
application of the preconditioned conjugate gradient algorithm to the matrix B that is determined
by
2
Bpi = wi = Dh f (x : pi ), 1 ¤ i ¤ N.
2
However, since the map p ’ Dh f (x : p) is not linear in p, the quality of B as an approximation
to ∇2 f (x) may degrade as the linear iteration progresses. Usually this will not cause problems
unless many iterations are needed to satisfy the inexact Newton condition. However, if one does
not see the expected rate of convergence in a Newton“CG iteration, this could be a factor [128].
One partial remedy is to use a centered-difference Hessian“vector product [162], which reduces
the error in B. In exercise 2.7.15 we discuss a more complex and imaginative way to compute
accurate Hessians.


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




LOCAL CONVERGENCE 33

2.6 Examples
In this section we used the collection of MATLAB codes but disabled the features (see Chapter 3)
that assist in convergence when far from a minimizer. We took care to make certain that the
initial iterates were near enough to the minimizer so that the observations of local convergence
corresponded to the theory. In practical optimization problems, good initial data is usually not
available and the globally convergent methods discussed in Chapter 3 must be used to start the
iteration.
The plots in this section have the characteristics of local convergence in that both the gradient
norms and function values are decreasing. The reader should contrast this with the examples in
Chapter 3.


2.6.1 Parameter Identi¬cation
For this example, M = 100, and the observations are those of the exact solution with c = k = 1,
which we computed analytically. We used T = 10 and u0 = 10. We computed the displacement
and solved the sensitivity equations with the stiff solver ode15s. These results could be
obtained as well in a FORTRAN environment using, for example, the LSODE code [228]. The
relative and absolute error tolerances for the integrator were both set to 10’8 . In view of the
expected accuracy of the gradient, we set the forward difference increment for the approximate
Hessian to h = 10’4 . We terminated the iterations when ∇f < 10’4 . Our reasons for this
are that, for the zero residual problem considered here, the standard assumptions imply that
f (x) = O( ∇f (x) ) for x near the solution. Hence, since we can only resolve f to an accuracy
of 10’8 , iteration beyond the point where ∇f < 10’4 cannot be expected to lead to a further
decrease in f . In fact we observed this in our computations.
The iterations are very sensitive to the initial iterate. We used x0 = (1.1, 1.05)T ; initial
iterates much worse than that caused Newton™s method to fail. The more robust methods from
Chapter 3 should be viewed as essential components of even a simple optimization code.
In Table 2.1 we tabulate the history of the iteration for both the Newton and Gauss“Newton
methods. As expected for a small residual problem, Gauss“Newton performs well and, for this
example, even converges in fewer iterations. The real bene¬t of Gauss“Newton is that com-
putation of the Hessian can be avoided, saving considerable computational work by exploiting
the structure of the problem. In the computation reported here, the MATLAB flops com-
mand indicates that the Newton iteration took roughly 1.9 million ¬‚oating point operations and
Gauss“Newton roughly 650 thousand. This difference would be much more dramatic if there
were more than two parameters or the cost of an evaluation of f depended on N in a signi¬cant
way (which it does not in this example).


Table 2.1: Parameter identi¬cation problem, locally convergent iterations.

Newton Gauss“Newton
∇f (xn ) f (xn ) ∇f (xn ) f (xn )
n
0 2.33e+01 7.88e-01 2.33e+01 7.88e-01
1 6.87e+00 9.90e-02 1.77e+00 6.76e-03
2 4.59e-01 6.58e-04 1.01e-02 4.57e-07
3 2.96e-03 3.06e-08 9.84e-07 2.28e-14
4 2.16e-06 4.15e-14

Figure 2.1 is a graphical representation of the convergence history from Table 2.1. We think
that the plots are a more effective way to understand iteration statistics and will present mostly


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




34 ITERATIVE METHODS FOR OPTIMIZATION

graphs for the remainder of the book. The concavity of the plots of the gradient norms is the
signature of superlinear convergence.

Newton™s Method Newton™s Method
2 0
10 10

0
10




Function Value
Gradient Norm




’5
10
’2
10
’10
10
’4
10

’6 ’15
10 10
0 2 4 0 2 4
Iterations Iterations
Gauss’Newton Method Gauss’Newton Method
2 0
10 10
0
10
Function Value
Gradient Norm




’5
10
’2
10
’4
10 ’10
10
’6
10
’8 ’15
10 10
0 1 2 3 0 1 2 3
Iterations Iterations

Figure 2.1: Local Optimization for the Parameter ID Problem

We next illustrate the difference between Gauss“Newton and Newton on a nonzero residual
problem. We use the same example as before with the observations randomly perturbed. We
used the MATLAB rand function for this, perturbing the samples of the analytic solution by
.5 — rand(M, 1). The least squares residual is about 3.6 and the plots in Figure 2.2 indicate
that Newton™s method is still converging quadratically, but the rate of Gauss“Newton is linear.
The linear convergence of Gauss“Newton can be seen clearly from the linear semilog plot of the
gradient norms. Even so, the Gauss“Newton iteration was more ef¬cient, in terms of ¬‚oating
point operation, than Newton™s method. The Gauss“Newton iteration took roughly 1 million
¬‚oating point operations while the Newton iteration took 1.4 million.

<< . .

. 6
( : 32)



. . >>