LOCAL CONVERGENCE 23

With this notation it is easy to show that

∇f (x) = R (x)T R(x) ∈ RN .

(2.29)

The necessary conditions for optimality imply that at a minimizer x—

R (x— )T R(x— ) = 0.

(2.30)

In the underdetermined case, if R (x— ) has full row rank, (2.30) implies that R(x— ) = 0; this is

not the case for overdetermined problems.

The cost of a gradient is roughly that of a Jacobian evaluation, which, as is the case with

nonlinear equations, is the most one is willing to accept. Computation of the Hessian (an N — N

matrix)

M

∇2 f (x) = R (x)T R (x) + ri (x)T ∇2 ri (x)

(2.31)

j=1

requires computation of the M Hessians {∇2 ri } for the second-order term

M

ri (x)T ∇2 ri (x)

j=1

and is too costly to be practical.

We will also express the second-order term as

M

ri (x)T ∇2 ri (x) = R (x)T R(x),

j=1

where the second derivative R is a tensor. The notation is to be interpreted in the following

way. For v ∈ RM , R (x)T v is the N — N matrix

M

R (x)T v = ∇2 (R(x)T v) = (v)i ∇2 ri (x).

i=1

We will use the tensor notation when expanding R about x— in some of the analysis to follow.

2.4.1 Gauss“Newton Iteration

The Gauss“Newton algorithm simply discards the second-order term in ∇2 f and computes a

step

s = ’(R (xc )T R (xc ))’1 ∇f (xc )

(2.32)

= ’(R (xc )T R (xc ))’1 R (xc )T R(xc ).

The Gauss“Newton iterate is x+ = xc +s. One motivation for this approach is that R (x)T R(x)

vanishes for zero residual problems and therefore might be negligible for small residual problems.

Implicit in (2.32) is the assumption that R (xc )T R (xc ) is nonsingular, which implies that

M ≥ N . Another interpretation, which also covers underdetermined problems, is to say that the

Gauss“Newton iterate is the minimum norm solution of the local linear model of our nonlinear

least squares problem

1

min R(xc ) + R (xc )(x ’ xc ) 2 .

(2.33)

2

Using (2.33) and linear least squares methods is a more accurate way to compute the step than

using (2.32), [115], [116], [127]. In the underdetermined case, the Gauss“Newton step can

be computed with the singular value decomposition [49], [127], [249]. (2.33) is an overde-

termined, square, or underdetermined linear least squares problem if the nonlinear problem is

overdetermined, square, or underdetermined.

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

24 ITERATIVE METHODS FOR OPTIMIZATION

The standard assumptions for nonlinear least squares problems follow in Assumption 2.4.1.

Assumption 2.4.1. x— is a minimizer of R 2 , R is Lipschitz continuously differentiable

2

near x— , and R (x— )T R (x— ) has maximal rank. The rank assumption may also be stated as

• R (x— ) is nonsingular (M = N ),

• R (x— ) has f ull column rank (M > N ),

• R (x— ) has f ull row rank (M < N ).

2.4.2 Overdetermined Problems

Theorem 2.4.1. Let M > N . Let Assumption 2.4.1 hold. Then there are K > 0 and δ > 0

such that if xc ∈ B(δ) then the error in the Gauss“Newton iteration satis¬es

2

+ R(x— ) ec ).

e+ ¤ K( ec

(2.34)

Proof. Let δ be small enough so that x ’ x— < δ implies that R (x)T R (x) is nonsingular.

Let γ be the Lipschitz constant for R .

By (2.32)

= ec ’ (R (xc )T R (xc ))’1 R (xc )T R(xc )

e+

(2.35)

= (R (xc )T R (xc ))’1 R (xc )T (R (xc )ec ’ R(xc )).

Note that

R (xc )ec ’ R(xc ) = R (xc )ec ’ R(x— ) + R(x— ) ’ R(xc )

= ’R(x— ) + (R (xc )ec + R(x— ) ’ R(xc )).

Now,

R (xc )ec + R(x— ) ’ R(xc ) ¤ γ ec 2 /2

and, since R (x— )T R(x— ) = 0,

’R (xc )T R(x— ) = (R (x— ) ’ R (xc ))T R(x— ).

Hence,

¤ (R (xc )T R (xc ))’1 (R (x— ) ’ R (xc ))T R(x— )

e+

(R (xc )T R (xc ))’1 R (xc )T γ ec 2

+

2

(2.36)

R(x— ) + R (xc )T ec

T ’1

¤ (R (xc ) R (xc )) γ ec .

2

Setting

1 + R (x)T

K = γ max ( (R (x)T R (x))’1

2

x∈B(δ)

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

LOCAL CONVERGENCE 25

completes the proof.

There are several important consequences of Theorem 2.4.1. The ¬rst is that for zero residual

problems, the local convergence rate is q-quadratic because the R(x— ) ec term on the right

side of (2.34) vanishes. For a problem other than a zero residual one, even q-linear convergence

is not guaranteed. In fact, if xc ∈ B(δ) then (2.35) will imply that e+ ¤ r ec for some

0 < r < 1 if

K(δ + R (x— ) ) ¤ r

(2.37)

and therefore the q-factor will be K R (x— ) . Hence, for small residual problems and accurate

initial data the convergence of Gauss“Newton will be fast. Gauss“Newton may not converge at

all for large residual problems.

Equation (2.36) exposes a more subtle issue when the term

(R (x— ) ’ R (xc ))T R(x— )

is considered as a whole, rather than estimated by

R(x— ) .

γ ec

Using Taylor™s theorem and the necessary conditions (R (x— )T R(x— ) = 0) we have

R (xc )T R(x— ) = [R (x— ) + R (x— )ec + O( ec 2 )]T R(x— )

= eT R (x— )T R(x— ) + O( ec 2 ).

c

Recall that

R (x— )T R(x— ) = ∇2 f (x— ) ’ R (x— )T R (x— )

and hence

(R (x— )’R (xc ))T R(x— )

(2.38)

¤ ∇2 f (x— ) ’ R (x— )T R (x— ) R(x— ) + O( ec 2 ).

In a sense (2.38) says that even for a large residual problem, convergence can be fast if the problem

is not very nonlinear (small R ). In the special case of a linear least squares problem (where

R = 0) Gauss“Newton becomes simply the solution of the normal equations and converges in

one iteration.

So, Gauss“Newton can be expected to work well for overdetermined small residual problems

and good initial iterates. For large residual problems and/or initial data far from the solution,

there is no reason to expect Gauss“Newton to give good results. We address these issues in

§3.2.3.

2.4.3 Underdetermined Problems

We begin with the linear underdetermined least squares problem

min Ax ’ b 2 .

(2.39)

If A is M — N with M < N there will not be a unique minimizer but there will be a unique

minimizer with minimum norm. The minimum norm solution can be expressed in terms of the

singular value decomposition [127], [249] of A,

A = U ΣV T .

(2.40)

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

26 ITERATIVE METHODS FOR OPTIMIZATION

In (2.40), Σ is an N — N diagonal matrix. The diagonal entries of Σ, {σi } are called the singular

values. σi ≥ 0 and σi = 0 if i > M . The columns of the M — N matrix U and the N — N

matrix V are called the left and right singular vectors. U and V have orthonormal columns and

hence the minimum norm solution of (2.39) is

x = A† b,

where A† = V Σ† U T ,

† †

Σ† = diag(σ1 , . . . , σN ),

± ’1

and

σi , σi = 0,

†

σi =

0, σi = 0.

A† is called the Moore“Penrose inverse [49], [189], [212]. If A is a square nonsingular matrix,

then A† = A’1 ; if M > N then the de¬nition of A† using the singular value decomposition is

still valid; and, if A has full column rank, A† = (AT A)’1 AT .

Two simple properties of the Moore“Penrose inverse are that A† A is a projection onto the

range of A† and AA† is a projection onto the range of A. This means that

A† AA† = A† , (A† A)T = A† A, AA† A = A, and (AA† )T = AA† .

(2.41)

So the minimum norm solution of the local linear model (2.33) of an underdetermined

nonlinear least squares problem can be written as [17], [102]

s = ’R (xc )† R(xc )

(2.42)

and the Gauss“Newton iteration [17] is

x+ = xc ’ R (xc )† R(xc ).

(2.43)

The challenge in formulating a local convergence result is that there is not a unique optimal point

that attracts the iterates.

In the linear case, where R(x) = Ax ’ b, one gets

x+ = xc ’ A† (Axc ’ b) = (I ’ A† A)xc ’ A† b.

Set e = x ’ A† b and note that

A† AA† b = A† b

by (2.41). Hence

e+ = (I ’ A† A)ec .

This does not imply that x+ = A† b, the minimum norm solution, only that x+ is a solution of

the problem and the iteration converges in one step. The Gauss“Newton iteration cannot correct

for errors that are not in the range of A† .

Let

Z = {x | R(x) = 0}.

We show in Theorem 2.4.2, a special case of the result in [92], that if the standard assumptions

hold at a point x— ∈ Z, then the iteration will converge q-quadratically to a point z — ∈ Z.

However, there is no reason to expect that z — = x— . In general z — will depend on x0 , a very

different situation from the overdetermined case. The hypotheses of Theorem 2.4.2, especially

that of full column rank in R (x), are less general than those in [24], [17], [25], [92], and [90].

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

LOCAL CONVERGENCE 27

Theorem 2.4.2. Let M ¤ N and let Assumption 2.4.1 hold for some x— ∈ Z. Then there is

δ > 0 such that if

x0 ’ x— ¤ δ,

then the Gauss“Newton iterates

xn+1 = xn ’ R (xn )† R(xn )

exist and converge r-quadratically to a point z — ∈ Z.

Proof. Assumption 2.4.1 and results in [49], [126] imply that if δ is suf¬ciently small then

there is ρ1 such that R (x)† is Lipschitz continuous in the set

B1 = {x | x ’ x— ¤ ρ1 }

and the singular values of R (x) are bounded away from zero in B1 . We may, reducing ρ1 if

necessary, apply the Kantorovich theorem [154], [151], [211] to show that if x ∈ B1 and w ∈ Z

is such that

x ’ w = min x ’ z ,

z∈Z

then there is ξ = ξ(x) ∈ Z such that

w ’ ξ(x) = O( x ’ w 2 ) ¤ x ’ w /2

and ξ is in the range of R (w)† , i.e.,

R (w)† R (w)(x ’ ξ(x)) = x ’ ξ(x).

The method of the proof is to adjust δ so that the Gauss“Newton iterates remain in B1 and

R(xn ) ’ 0 suf¬ciently rapidly. We begin by requiring that δ < ρ1 /2.

Let xc ∈ B1 and let e = x ’ ξ(xc ). Taylor™s theorem, the fundamental theorem of calculus,

and (2.41) imply that

= ec ’ R (xc )† R(xc )

e+

= ec ’ (R (xc )† ’ R (ξ(x))† )R(x) ’ R (x—)† R(x)

= e0 ’ R (x— )† R(x) + O( ec 2 )

= (I ’ R (x— )† R (x— ))ec + O( ec 2 ) = O( ec 2 ).

If we de¬ne d(x) = minz∈Z x ’ z then there is K1 such that

2

¤ K1 d(xc )2 .

d(x+ ) ¤ x+ ’ ξ(xc ) ¤ K1 xc ’ ξ(xc )

(2.44)

Now let

ρ2 = min(ρ1 , (2K1 )’1 ).

So if

xc ∈ B2 = {x | x ’ x— ¤ ρ2 }

then

d(x+ ) ¤ d(xc )/2.

(2.45)

Finally, there is K2 such that

¤ xc ’ x— + x+ ’ xc = xc ’ x— + R (xc )† R(xc )

x+ ’ x—

¤ xc ’ x— + K2 xc ’ ξ(xc ) .

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

28 ITERATIVE METHODS FOR OPTIMIZATION

We now require that

ρ2

δ¤ .

(2.46)

2(1 + K2 )

We complete the proof by induction. If x0 ’ x— ¤ δ and the Gauss“Newton iterates {xk }n

k=0

are in B2 , then xn+1 is be de¬ned and, using (2.46) and (2.44),

n+1

— —

xn+1 ’ x ¤ x0 ’ x + K3 d(xk ) ¤ δ + 2K3 d(x0 ) ¤ ρ1 .

k=0

Hence, the Gauss“Newton iterates exist, remain in B0 , and dn ’ 0.

To show that the sequence of Gauss“Newton iterates does in fact converge, we observe that

there is K3 such that

x+ ’ xc = R (xc )† R(xc ) ¤ K3 xc ’ ξ(xc ) ¤ K3 d(xc ).

Therefore (2.45) implies that for any m, n ≥ 0,

n+m’1

xn+m ’ xn ¤ xl+1 ’ xl

l=n

’ 2’m

1

n+m

= l=n d(xl ) = d(xn )

2

¤ 2d(xn ) ¤ 2’n+1 d(x0 ).

Hence, {xk } is a Cauchy sequence and therefore converges to a point z — ∈ Z. Since

xn ’ z — ¤ 2d(xn ),

(2.44) implies that the convergence is r-quadratic.

2.5 Inexact Newton Methods

An inexact Newton method [74] uses an approximate Newton step s = x+ ’ xc , requiring only

that

∇2 f (xc )s + ∇f (xc ) ¤ ·c ∇f (xc ) ,

(2.47)

i.e., that the linear residual be small. We will refer to any vector s that satis¬es (2.47) with

·c < 1 as an inexact Newton step. We will refer to the parameter ·c on the right-hand side of

(2.47) as the forcing term [99] .

Inexact Newton methods are also called truncated Newton methods [75], [198], [199] in the

context of optimization. In this book, we consider Newton“iterative methods. This is the class of

inexact Newton methods in which the linear equation (2.4) for the Newton step is also solved by

an iterative method and (2.47) is the termination criterion for that linear iteration. It is standard

to refer to the sequence of Newton steps {xn } as the outer iteration and the sequence of iterates

for the linear equation as the inner iteration. The naming convention (see [33], [154], [211])

is that Newton“CG, for example, refers to the Newton“iterative method in which the conjugate

gradient [141] algorithm is used to perform the inner iteration.

Newton“CG is particularly appropriate for optimization, as we expect positive de¬nite Hes-

sians near a local minimizer. The results for inexact Newton methods from [74] and [154]

are suf¬cient to describe the local convergence behavior of Newton“CG, and we summarize

the relevant results from nonlinear equations in §2.5.1. We will discuss the implementation of

Newton“CG in §2.5.2.