<< . .

. 5
( : 32)



. . >>




LOCAL CONVERGENCE 23

With this notation it is easy to show that

∇f (x) = R (x)T R(x) ∈ RN .
(2.29)

The necessary conditions for optimality imply that at a minimizer x—

R (x— )T R(x— ) = 0.
(2.30)

In the underdetermined case, if R (x— ) has full row rank, (2.30) implies that R(x— ) = 0; this is
not the case for overdetermined problems.
The cost of a gradient is roughly that of a Jacobian evaluation, which, as is the case with
nonlinear equations, is the most one is willing to accept. Computation of the Hessian (an N — N
matrix)
M
∇2 f (x) = R (x)T R (x) + ri (x)T ∇2 ri (x)
(2.31)
j=1

requires computation of the M Hessians {∇2 ri } for the second-order term
M
ri (x)T ∇2 ri (x)
j=1

and is too costly to be practical.
We will also express the second-order term as
M
ri (x)T ∇2 ri (x) = R (x)T R(x),
j=1

where the second derivative R is a tensor. The notation is to be interpreted in the following
way. For v ∈ RM , R (x)T v is the N — N matrix
M
R (x)T v = ∇2 (R(x)T v) = (v)i ∇2 ri (x).
i=1

We will use the tensor notation when expanding R about x— in some of the analysis to follow.

2.4.1 Gauss“Newton Iteration
The Gauss“Newton algorithm simply discards the second-order term in ∇2 f and computes a
step
s = ’(R (xc )T R (xc ))’1 ∇f (xc )
(2.32)
= ’(R (xc )T R (xc ))’1 R (xc )T R(xc ).
The Gauss“Newton iterate is x+ = xc +s. One motivation for this approach is that R (x)T R(x)
vanishes for zero residual problems and therefore might be negligible for small residual problems.
Implicit in (2.32) is the assumption that R (xc )T R (xc ) is nonsingular, which implies that
M ≥ N . Another interpretation, which also covers underdetermined problems, is to say that the
Gauss“Newton iterate is the minimum norm solution of the local linear model of our nonlinear
least squares problem
1
min R(xc ) + R (xc )(x ’ xc ) 2 .
(2.33)
2
Using (2.33) and linear least squares methods is a more accurate way to compute the step than
using (2.32), [115], [116], [127]. In the underdetermined case, the Gauss“Newton step can
be computed with the singular value decomposition [49], [127], [249]. (2.33) is an overde-
termined, square, or underdetermined linear least squares problem if the nonlinear problem is
overdetermined, square, or underdetermined.


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




24 ITERATIVE METHODS FOR OPTIMIZATION

The standard assumptions for nonlinear least squares problems follow in Assumption 2.4.1.
Assumption 2.4.1. x— is a minimizer of R 2 , R is Lipschitz continuously differentiable
2
near x— , and R (x— )T R (x— ) has maximal rank. The rank assumption may also be stated as

• R (x— ) is nonsingular (M = N ),

• R (x— ) has f ull column rank (M > N ),

• R (x— ) has f ull row rank (M < N ).

2.4.2 Overdetermined Problems

Theorem 2.4.1. Let M > N . Let Assumption 2.4.1 hold. Then there are K > 0 and δ > 0
such that if xc ∈ B(δ) then the error in the Gauss“Newton iteration satis¬es
2
+ R(x— ) ec ).
e+ ¤ K( ec
(2.34)



Proof. Let δ be small enough so that x ’ x— < δ implies that R (x)T R (x) is nonsingular.
Let γ be the Lipschitz constant for R .
By (2.32)

= ec ’ (R (xc )T R (xc ))’1 R (xc )T R(xc )
e+
(2.35)
= (R (xc )T R (xc ))’1 R (xc )T (R (xc )ec ’ R(xc )).

Note that
R (xc )ec ’ R(xc ) = R (xc )ec ’ R(x— ) + R(x— ) ’ R(xc )

= ’R(x— ) + (R (xc )ec + R(x— ) ’ R(xc )).

Now,
R (xc )ec + R(x— ) ’ R(xc ) ¤ γ ec 2 /2
and, since R (x— )T R(x— ) = 0,

’R (xc )T R(x— ) = (R (x— ) ’ R (xc ))T R(x— ).

Hence,

¤ (R (xc )T R (xc ))’1 (R (x— ) ’ R (xc ))T R(x— )
e+

(R (xc )T R (xc ))’1 R (xc )T γ ec 2
+
2
(2.36)

R(x— ) + R (xc )T ec
T ’1
¤ (R (xc ) R (xc )) γ ec .
2

Setting
1 + R (x)T
K = γ max ( (R (x)T R (x))’1
2
x∈B(δ)




Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




LOCAL CONVERGENCE 25

completes the proof.
There are several important consequences of Theorem 2.4.1. The ¬rst is that for zero residual
problems, the local convergence rate is q-quadratic because the R(x— ) ec term on the right
side of (2.34) vanishes. For a problem other than a zero residual one, even q-linear convergence
is not guaranteed. In fact, if xc ∈ B(δ) then (2.35) will imply that e+ ¤ r ec for some
0 < r < 1 if
K(δ + R (x— ) ) ¤ r
(2.37)
and therefore the q-factor will be K R (x— ) . Hence, for small residual problems and accurate
initial data the convergence of Gauss“Newton will be fast. Gauss“Newton may not converge at
all for large residual problems.
Equation (2.36) exposes a more subtle issue when the term

(R (x— ) ’ R (xc ))T R(x— )

is considered as a whole, rather than estimated by

R(x— ) .
γ ec

Using Taylor™s theorem and the necessary conditions (R (x— )T R(x— ) = 0) we have

R (xc )T R(x— ) = [R (x— ) + R (x— )ec + O( ec 2 )]T R(x— )

= eT R (x— )T R(x— ) + O( ec 2 ).
c

Recall that
R (x— )T R(x— ) = ∇2 f (x— ) ’ R (x— )T R (x— )
and hence
(R (x— )’R (xc ))T R(x— )
(2.38)
¤ ∇2 f (x— ) ’ R (x— )T R (x— ) R(x— ) + O( ec 2 ).

In a sense (2.38) says that even for a large residual problem, convergence can be fast if the problem
is not very nonlinear (small R ). In the special case of a linear least squares problem (where
R = 0) Gauss“Newton becomes simply the solution of the normal equations and converges in
one iteration.
So, Gauss“Newton can be expected to work well for overdetermined small residual problems
and good initial iterates. For large residual problems and/or initial data far from the solution,
there is no reason to expect Gauss“Newton to give good results. We address these issues in
§3.2.3.

2.4.3 Underdetermined Problems
We begin with the linear underdetermined least squares problem

min Ax ’ b 2 .
(2.39)

If A is M — N with M < N there will not be a unique minimizer but there will be a unique
minimizer with minimum norm. The minimum norm solution can be expressed in terms of the
singular value decomposition [127], [249] of A,

A = U ΣV T .
(2.40)


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




26 ITERATIVE METHODS FOR OPTIMIZATION

In (2.40), Σ is an N — N diagonal matrix. The diagonal entries of Σ, {σi } are called the singular
values. σi ≥ 0 and σi = 0 if i > M . The columns of the M — N matrix U and the N — N
matrix V are called the left and right singular vectors. U and V have orthonormal columns and
hence the minimum norm solution of (2.39) is

x = A† b,

where A† = V Σ† U T ,
† †
Σ† = diag(σ1 , . . . , σN ),

± ’1
and
 σi , σi = 0,

σi =

0, σi = 0.
A† is called the Moore“Penrose inverse [49], [189], [212]. If A is a square nonsingular matrix,
then A† = A’1 ; if M > N then the de¬nition of A† using the singular value decomposition is
still valid; and, if A has full column rank, A† = (AT A)’1 AT .
Two simple properties of the Moore“Penrose inverse are that A† A is a projection onto the
range of A† and AA† is a projection onto the range of A. This means that

A† AA† = A† , (A† A)T = A† A, AA† A = A, and (AA† )T = AA† .
(2.41)

So the minimum norm solution of the local linear model (2.33) of an underdetermined
nonlinear least squares problem can be written as [17], [102]

s = ’R (xc )† R(xc )
(2.42)

and the Gauss“Newton iteration [17] is

x+ = xc ’ R (xc )† R(xc ).
(2.43)

The challenge in formulating a local convergence result is that there is not a unique optimal point
that attracts the iterates.
In the linear case, where R(x) = Ax ’ b, one gets

x+ = xc ’ A† (Axc ’ b) = (I ’ A† A)xc ’ A† b.

Set e = x ’ A† b and note that
A† AA† b = A† b
by (2.41). Hence
e+ = (I ’ A† A)ec .
This does not imply that x+ = A† b, the minimum norm solution, only that x+ is a solution of
the problem and the iteration converges in one step. The Gauss“Newton iteration cannot correct
for errors that are not in the range of A† .
Let
Z = {x | R(x) = 0}.
We show in Theorem 2.4.2, a special case of the result in [92], that if the standard assumptions
hold at a point x— ∈ Z, then the iteration will converge q-quadratically to a point z — ∈ Z.
However, there is no reason to expect that z — = x— . In general z — will depend on x0 , a very
different situation from the overdetermined case. The hypotheses of Theorem 2.4.2, especially
that of full column rank in R (x), are less general than those in [24], [17], [25], [92], and [90].


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




LOCAL CONVERGENCE 27

Theorem 2.4.2. Let M ¤ N and let Assumption 2.4.1 hold for some x— ∈ Z. Then there is
δ > 0 such that if
x0 ’ x— ¤ δ,
then the Gauss“Newton iterates

xn+1 = xn ’ R (xn )† R(xn )

exist and converge r-quadratically to a point z — ∈ Z.
Proof. Assumption 2.4.1 and results in [49], [126] imply that if δ is suf¬ciently small then
there is ρ1 such that R (x)† is Lipschitz continuous in the set

B1 = {x | x ’ x— ¤ ρ1 }

and the singular values of R (x) are bounded away from zero in B1 . We may, reducing ρ1 if
necessary, apply the Kantorovich theorem [154], [151], [211] to show that if x ∈ B1 and w ∈ Z
is such that
x ’ w = min x ’ z ,
z∈Z

then there is ξ = ξ(x) ∈ Z such that

w ’ ξ(x) = O( x ’ w 2 ) ¤ x ’ w /2

and ξ is in the range of R (w)† , i.e.,

R (w)† R (w)(x ’ ξ(x)) = x ’ ξ(x).

The method of the proof is to adjust δ so that the Gauss“Newton iterates remain in B1 and
R(xn ) ’ 0 suf¬ciently rapidly. We begin by requiring that δ < ρ1 /2.
Let xc ∈ B1 and let e = x ’ ξ(xc ). Taylor™s theorem, the fundamental theorem of calculus,
and (2.41) imply that
= ec ’ R (xc )† R(xc )
e+

= ec ’ (R (xc )† ’ R (ξ(x))† )R(x) ’ R (x—)† R(x)

= e0 ’ R (x— )† R(x) + O( ec 2 )

= (I ’ R (x— )† R (x— ))ec + O( ec 2 ) = O( ec 2 ).

If we de¬ne d(x) = minz∈Z x ’ z then there is K1 such that
2
¤ K1 d(xc )2 .
d(x+ ) ¤ x+ ’ ξ(xc ) ¤ K1 xc ’ ξ(xc )
(2.44)

Now let
ρ2 = min(ρ1 , (2K1 )’1 ).
So if
xc ∈ B2 = {x | x ’ x— ¤ ρ2 }
then
d(x+ ) ¤ d(xc )/2.
(2.45)
Finally, there is K2 such that
¤ xc ’ x— + x+ ’ xc = xc ’ x— + R (xc )† R(xc )
x+ ’ x—

¤ xc ’ x— + K2 xc ’ ξ(xc ) .


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




28 ITERATIVE METHODS FOR OPTIMIZATION

We now require that
ρ2
δ¤ .
(2.46)
2(1 + K2 )
We complete the proof by induction. If x0 ’ x— ¤ δ and the Gauss“Newton iterates {xk }n
k=0
are in B2 , then xn+1 is be de¬ned and, using (2.46) and (2.44),
n+1
— —
xn+1 ’ x ¤ x0 ’ x + K3 d(xk ) ¤ δ + 2K3 d(x0 ) ¤ ρ1 .
k=0

Hence, the Gauss“Newton iterates exist, remain in B0 , and dn ’ 0.
To show that the sequence of Gauss“Newton iterates does in fact converge, we observe that
there is K3 such that

x+ ’ xc = R (xc )† R(xc ) ¤ K3 xc ’ ξ(xc ) ¤ K3 d(xc ).

Therefore (2.45) implies that for any m, n ≥ 0,
n+m’1
xn+m ’ xn ¤ xl+1 ’ xl
l=n
’ 2’m
1
n+m
= l=n d(xl ) = d(xn )
2

¤ 2d(xn ) ¤ 2’n+1 d(x0 ).

Hence, {xk } is a Cauchy sequence and therefore converges to a point z — ∈ Z. Since

xn ’ z — ¤ 2d(xn ),

(2.44) implies that the convergence is r-quadratic.


2.5 Inexact Newton Methods
An inexact Newton method [74] uses an approximate Newton step s = x+ ’ xc , requiring only
that
∇2 f (xc )s + ∇f (xc ) ¤ ·c ∇f (xc ) ,
(2.47)
i.e., that the linear residual be small. We will refer to any vector s that satis¬es (2.47) with
·c < 1 as an inexact Newton step. We will refer to the parameter ·c on the right-hand side of
(2.47) as the forcing term [99] .
Inexact Newton methods are also called truncated Newton methods [75], [198], [199] in the
context of optimization. In this book, we consider Newton“iterative methods. This is the class of
inexact Newton methods in which the linear equation (2.4) for the Newton step is also solved by
an iterative method and (2.47) is the termination criterion for that linear iteration. It is standard
to refer to the sequence of Newton steps {xn } as the outer iteration and the sequence of iterates
for the linear equation as the inner iteration. The naming convention (see [33], [154], [211])
is that Newton“CG, for example, refers to the Newton“iterative method in which the conjugate
gradient [141] algorithm is used to perform the inner iteration.
Newton“CG is particularly appropriate for optimization, as we expect positive de¬nite Hes-
sians near a local minimizer. The results for inexact Newton methods from [74] and [154]
are suf¬cient to describe the local convergence behavior of Newton“CG, and we summarize
the relevant results from nonlinear equations in §2.5.1. We will discuss the implementation of
Newton“CG in §2.5.2.

<< . .

. 5
( : 32)



. . >>