<< . .

. 4
( : 32)



. . >>

2.3.1 Errors in Functions, Gradients, and Hessians
In the presence of errors in functions and gradients, however, the problem of unconstrained
optimization becomes more dif¬cult than that of root ¬nding. We discuss this difference only
brie¬‚y here and for the remainder of this chapter assume that gradients are computed exactly, or
at least as accurately as f , say, either analytically or with automatic differentiation [129], [130].
However, we must carefully study the effects of errors in the evaluation of the Hessian just as
we did those of errors in the Jacobian in [154].
A signi¬cant difference from the nonlinear equations case arises if only functions are available
and gradients and Hessians must be computed with differences. A simple one-dimensional
analysis will suf¬ce to explain this. Assume that we can only compute f approximately. If we
ˆ
compute f = f + f rather than f , then a forward difference gradient with difference increment
h
ˆ ˆ
f (x + h) ’ f (x)
Dh f (x) =
h

differs from f by O(h + f /h) and this error is minimized if h = O( f ). In that case the error

in the gradient is g = O(h) = O( f ). If a forward difference Hessian is computed using Dh
as an approximation to the gradient, then the error in the Hessian will be
√ 1/4
∆ = O( g ) = O( f)
(2.13)

and the accuracy in ∇2 f will be much less than that of a Jacobian in the nonlinear equations
case.
If f is signi¬cantly larger than machine roundoff, (2.13) indicates that using numerical
Hessians based on a second numerical differentiation of the objective function will not be very
accurate. Even in the best possible case, where f is the same size as machine roundoff, ¬nite
difference Hessians will not be very accurate and will be very expensive to compute if the Hessian
is dense. If, as on most computers today, machine roundoff is (roughly) 10’16 , (2.13) indicates
that a forward difference Hessian will be accurate to roughly four decimal digits.
One can obtain better results with centered differences, but at a cost of twice the number of
function evaluations. A centered difference approximation to ∇f is
ˆ ˆ
f (x + h) ’ f (x ’ h)
Dh f (x) =
2h
1/3
and the error is O(h2 + f /h), which is minimized if h = O( f) leading to an error in the
2/3
= O( f ).
gradient of Therefore, a central difference Hessian will have an error of
g

4/9
∆ = O(( g )2/3 ) = O( f ),

which is substantially better. We will ¬nd that accurate gradients are much more important than
accurate Hessians and one option is to compute gradients with central differences and Hessians


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




18 ITERATIVE METHODS FOR OPTIMIZATION

2/3
with forward differences. If one does that the centered difference gradient error is O( f) and
therefore the forward difference Hessian error will be
√ 1/3
∆=O g = O( f ).

More elaborate schemes [22] compute a difference gradient and then reuse the function evalua-
tions in the Hessian computation.
In many optimization problems, however, accurate gradients are available. When that is the
case, numerical differentiation to compute Hessians is, like numerical computation of Jacobians
for nonlinear equations [154], a reasonable idea for many problems and the less expensive
forward differences work well.
Clever implementations of difference computation can exploit sparsity in the Hessian [67],
[59] to evaluate a forward difference approximation with far fewer than N evaluations of ∇f .
In the sparse case it is also possible [22], [23] to reuse the points from a centered difference
approximation to the gradient to create a second-order accurate Hessian.
Unless g (xn ) ’ 0 as the iteration progresses, one cannot expect convergence. For this
reason estimates like (2.14) are sometimes called local improvement [88] results. Theorem 2.3.4
is a typical example.
¯
Theorem 2.3.4. Let the standard assumptions hold. Then there are K > 0, δ > 0, and
δ1 > 0 such that if xc ∈ B(δ) and ∆(xc ) < δ1 then

x+ = xc ’ (∇2 f (xc ) + ∆(xc ))’1 (∇f (xc ) + g (xc ))

is de¬ned (i.e., ∇2 f (xc ) + ∆(xc ) is nonsingular) and satis¬es
¯ 2
e+ ¤ K( ec + ∆(xc ) ec + g (xc ) ).
(2.14)


Proof. Let δ be small enough so that the conclusions of Lemma 2.3.1 and Theorem 2.3.2
hold. Let
xN = xc ’ (∇2 f (xc ))’1 ∇f (xc )
+

and note that

x+ = xN + ((∇2 f (xc ))’1 ’ (∇2 f (xc ) + ∆(xc ))’1 )∇f (xc ) ’ (∇2 f (xc ) + ∆(xc ))’1 g (xc ).
+

Lemma 2.3.1 and Theorem 2.3.2 imply
2
+ 2 (∇2 f (xc ))’1 ’ (∇2 f (xc ) + ∆(xc ))’1 ∇2 f (x— ) ec
e+ ¤ K ec
(2.15)
+ (∇2 f (xc ) + ∆(xc ))’1 g (xc ) .
If
∆(xc ) ¤ (∇2 f (x— ))’1 ’1
/4,
then Lemma 2.3.1 implies that

∆(xc ) ¤ (∇2 f (xc ))’1 ’1
/2

and the Banach Lemma [12], [154] states that ∇2 f (xc ) + ∆(xc ) is nonsingular and

(∇2 f (xc ) + ∆(xc ))’1 ¤ 2 (∇2 f (xc ))’1 ¤ 4 (∇2 f (x— ))’1 .

Hence,

(∇2 f (xc ))’1 ’ (∇2 f (xc ) + ∆(xc ))’1 ¤ 8 (∇2 f (x— ))’1 2
∆(xc ) .


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




LOCAL CONVERGENCE 19

We use these estimates and (2.15) to obtain
2
+ 16 (∇2 f (x— ))’1 2
∇2 f (x— ) ∆(xc ) ec + 4 (∇2 f (x— ))’1
e+ ¤ K ec g (xc ) .

Setting
¯
K = K + 16 (∇2 f (x— ))’1 2
∇2 f (x— ) + 4 (∇2 f (x— ))’1
completes the proof.
As is the case with equations, (2.14) implies that one cannot hope to ¬nd a minimizer with
more accuracy that one can evaluate ∇f . In most cases the iteration will stagnate once e is
(roughly) the same size as g . The speed of convergence will be governed by the accuracy in the
Hessian.
The result for the chord method illustrates this latter point. In the chord method we form
and compute the Cholesky factorization of ∇2 f (x0 ) and use that factorization to compute all
subsequent Newton steps. Hence,

x+ = xc ’ (∇2 f (x0 ))’1 ∇f (xc )

and
∆(xc ) ¤ γ x0 ’ xc ¤ γ( e0 + ec ).
(2.16)
Algorithmically the chord iteration differs from the Newton iteration only in that the computation
and factorization of the Hessian is moved outside of the main loop.
Algorithm 2.3.2. chord(x, f, „ )
1. r0 = ∇f (x)
2. Compute ∇2 f (x)
3. F actor ∇2 f (x) = LLT
4. Do while ∇f (x) > „r r0 + „a
(a) Solve LLT s = ’∇f (x)
(b) x = x + s
(c) Compute ∇f (x).

= 0 and ∆ = O( e0 ).
The convergence theory follows from Theorem 2.3.4 with g

Theorem 2.3.5. Let the standard assumptions hold. Then there are KC > 0 and δ > 0
such that if x0 ∈ B(δ) the chord iterates converge q-linearly to x— and

en+1 ¤ KC e0 en .
(2.17)


Proof. Let δ be small enough so that the conclusions of Theorem 2.3.4 hold. Assume that
en , e0 ∈ B(δ). Combining (2.16) and (2.14) implies
¯ ¯
en+1 ¤ K( en (1 + γ) + γ e0 ) en ¤ K(1 + 2γ)δ en .

Hence, if δ is small enough so that
¯
K(1 + 2γ)δ = · < 1,

then the chord iterates converge q-linearly to x— . Q-linear convergence implies that en < e0
¯
and hence (2.17) holds with KC = K(1 + 2γ).


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




20 ITERATIVE METHODS FOR OPTIMIZATION

The Shamanskii method [233], [154], [211] is a generalization of the chord method that
updates Hessians after every m + 1 nonlinear iterations. Newton™s method corresponds to
m = 1 and the chord method to m = ∞. The convergence result is a direct consequence of
Theorems 2.3.3 and 2.3.5.
Theorem 2.3.6. Let the standard assumptions hold and let m ≥ 1 be given. Then there are
KS > 0 and δ > 0 such that if x0 ∈ B(δ) the Shamanskii iterates converge q-superlinearly to
x— with q-order m and
en+1 ¤ KS en m+1 .
(2.18)


As one more application of Theorem 2.3.4, we analyze the effects of a difference approxima-
tion of the Hessian. We follow the notation of [154] where possible. For example, to construct
a Hessian matrix, whose columns are ∇2 f (x)ej , where ej is the unit vector with jth compo-
nent 1 and other components 0, we could approximate the matrix“vector products ∇2 f (x)ej by
forward differences and then symmetrize the resulting matrix. We de¬ne
∇2 f (x) = (Ah + AT )/2,
(2.19) h h
2 2
where Ah is the matrix whose jth column is Dh f (x : ej ). Dh f (x : w), the difference approxi-
mation of the action of the Hessian ∇2 f (x) on a vector w, is de¬ned to be the quotient
±
 0, w = 0,


2
Dh f (x : w) =
(2.20)
 ∇f (x + hw/ w ) ’ ∇f (x)
 , w = 0.

h/ w
Note that we may also write
2
Dh f (x : w) = Dh (∇f )(x : w),
where the notation Dh , taken from [154], denotes numerical directional derivative. If x is very
large, then the error in computing the sum x + hw/ w will have to be taken into consideration
in the choice of h.
We warn the reader, as we did in [154], that D2 f (x : w) is not a linear map and that
D2 f (x : w) is not, in general, the same as ∇2 f (x)w.
h
If we compute ∇f (x) + g (x) and the gradient errors satisfy an estimate of the form
g (x) ¤¯
for all x, then the computed difference Hessian is ∇h (∇f + g) and satis¬es
∇2 f (x) ’ ∇h (∇f + g )(x) = O(h + ¯/h).
(2.21)

So, as in [154], the choice h ≈ ¯ is optimal in the sense that it minimizes the quantity in the
O-term in (2.21).
The local convergence theorem in this case is [88], [154], [278], as follows.
Theorem 2.3.7. Let the standard assumptions hold. Then there are δ, ¯, and KD > 0 such
that if xc ∈ B(δ), g (x) ¤ ¯c for all x ∈ B(δ), and

h≥M g (xc )

then
’1
x+ = xc ’ (∇hc (∇f (xc ) + g (xc ))) (∇f (xc ) + g (xc ))
satis¬es
e+ ¤ KD (¯c + (¯c + h) ec ).




Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




LOCAL CONVERGENCE 21

2.3.2 Termination of the Iteration
It is not safe to terminate the iteration when f (xc ) ’ f (x+ ) is small, and no conclusions can
safely be drawn by examination of the differences of the objective function values at successive
iterations. While some of the algorithms for dif¬cult problems in Part II of this book do indeed
terminate when successive function values are close, this is an act of desperation. For example,
if
n
j ’1 ,
f (xn ) = ’
j=1

then f (xn ) ’ ’∞ but f (xn+1 ) ’ f (xn ) = ’1/(n + 1) ’ 0. The reader has been warned.
If the standard assumptions hold, then one may terminate the iteration when the norm of ∇f
is suf¬ciently small relative to ∇f (x0 ) (see [154]). We will summarize the key points here and
refer the reader to [154] for the details. The idea is that if ∇2 f (x— ) is well conditioned, then a
small gradient norm implies a small error norm. Hence, for any gradient-based iterative method,
termination on small gradients is reasonable. In the special case of Newton™s method, the norm
of the step is a very good indicator of the error and if one is willing to incur the added cost of an
extra iteration, a very sharp bound on the error can be obtained, as we will see below.
Lemma 2.3.8. Assume that the standard assumptions hold. Let δ > 0 be small enough so
that the conclusions of Lemma 2.3.1 hold for x ∈ B(δ). Then for all x ∈ B(δ)

4κ(∇2 f (x— )) e
e ∇f (x)
¤ ¤ .
(2.22)
4 e0 κ(∇2 f (x— )) ∇f (x0 ) e0


The meaning of (2.22) is that, up to a constant multiplier, the norm of the relative gradient
is the same as the norm of the relative error. This partially motivates the termination condition
(2.12).
In the special case of Newton™s method, one can use the steplength as an accurate estimate
of the error because Theorem 2.3.2 implies that

ec = s + O( ec 2 ).
(2.23)

Hence, near the solution s and ec are essentially the same size. The cost of using (2.23) is that
all the information needed to compute x+ = xc + s has been computed. If we terminate the
iteration when s is smaller than our desired tolerance and then take x+ as the ¬nal result, we
have attained more accuracy than we asked for. One possibility is to terminate the iteration when
√ √
s = O( „s ) for some „s > 0. This, together with (2.23), will imply that ec = O( „s )
and hence, using Theorem 2.3.2, that

e+ = O( ec 2 ) = O(„s ).
(2.24)

For a superlinearly convergent method, termination on small steps is equally valid but one
cannot use (2.24). For a superlinearly convergent method we have

ec = s + o( ec ) and e+ = o( ec ).
(2.25)

Hence, we can conclude that e+ < „s if s < „s . This is a weaker, but still very useful,
result.
For a q-linearly convergent method, such as the chord method, making termination decisions
based on the norms of the steps is much riskier. The relative error in estimating ec by s is
| ec ’ s | ec + s e+
¤ = .
ec ec ec


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




22 ITERATIVE METHODS FOR OPTIMIZATION

Hence, estimation of errors by steps is worthwhile only if convergence is fast. One can go further
[156] if one has an estimate ρ of the q-factor that satis¬es

e+ ¤ ρ ec .

In that case,
(1 ’ ρ) ec ¤ ec ’ e+ ¤ ec ’ e+ = s .
Hence
ρ
e+ ¤ ρ ec ¤ s.
(2.26)
1’ρ
So, if the q-factor can be estimated from above by ρ and

s < (1 ’ ρ)„s /ρ,

then e+ < „s . This approach is used in ODE and DAE codes [32], [234], [228], [213],
but requires good estimates of the q-factor and we do not advocate it for q-linearly convergent
methods for optimization. The danger is that if the convergence is slow, the approximate q-factor
can be a gross underestimate and cause premature termination of the iteration.
It is not uncommon for evaluations of f and ∇f to be very expensive and optimizations are,
therefore, usually allocated a ¬xed maximum number of iterations. Some algorithms, such as
the DIRECT, [150], algorithm we discuss in §8.4.2, assign a limit to the number of function
evaluations and terminate the iteration in only this way.


2.4 Nonlinear Least Squares
Nonlinear least squares problems have objective functions of the form
M
1 1
2
R(x)T R(x).
f (x) = ri (x) =
(2.27) 2
2 2
i=1

The vector R = (r1 , . . . , rM ) is called the residual. These problems arise in data ¬tting, for
example. In that case M is the number of observations and N is the number of parameters;
for these problems M > N and we say the problem is overdetermined. If M = N we have a
nonlinear equation and the theory and methods from [154] are applicable. If M < N the problem
is underdetermined. Overdetermined least squares problems arise most often in data ¬tting
applications like the parameter identi¬cation example in §1.6.2. Underdetermined problems are
less common, but are, for example, important in the solution of high-index differential algebraic
equations [48], [50].
The local convergence theory for underdetermined problems has the additional complexity
that the limit of the Gauss“Newton iteration is not uniquely determined by the distance of the
initial iterate to the set of points where R(x— ) = 0. In §2.4.3 we describe the dif¬culties and
state a simple convergence result.
If x— is a local minimizer of f and f (x— ) = 0, the problem min f is called a zero residual
problem (a remarkable and suspicious event in the data ¬tting scenario). If f (x— ) is small, the
expected result in data ¬tting if the model (i.e., R) is good, the problem is called a small residual
problem. Otherwise one has a large residual problem.
Nonlinear least squares problems are an intermediate stage between nonlinear equations and
optimization problems and the methods for their solution re¬‚ect this. We de¬ne the M — N
Jacobian R of R by

(R (x))ij = ‚ri /‚xj , 1 ¤ i ¤ M, 1 ¤ j ¤ N.
(2.28)


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

<< . .

. 4
( : 32)



. . >>