s

l

The ¬rst step in the analysis is to use Lemma 3.2.1 to obtain a lower bound for the steplength.

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

42 ITERATIVE METHODS FOR OPTIMIZATION

Lemma 3.2.2. Assume that ∇f is Lipschitz continuous with Lipschitz constant L. Let

± ∈ (0, 1), x ∈ RN , and H be an spd matrix. Let »s > 0 be the smallest and »l ≥ »s the

largest eigenvalues of H. Let d be given by (3.3). Assume that ∇f (x) = 0. Then (3.4) holds for

any » such that

2»s (1 ’ ±)

0<»¤ .

(3.6)

Lκ(H)

Proof. Let d = ’H ’1 ∇f (x). By the fundamental theorem of calculus

1

∇f (x + t»d)T »d dt.

f (x + »d) ’ f (x) =

0

Hence

= f (x) + »∇f (x)T d

f (x + »d)

(3.7)

1

+ t»d) ’ ∇f (x))T d dt.

+» (∇f (x

0

Therefore,

»2 L

’1 T

d 2.

f (x + »d) = f (x ’ »H ∇f (x)) ¤ f (x) + »∇f (x) d +

2

Positivity of H, Lemma 3.2.1, and κ(H) = »l »’1 imply that

s

2

= H ’1 ∇f (x) 2

¤ »’2 ∇f (x)T ∇f (x)

d s

¤ ’»l »’2 ∇f (x)T d = ’κ(H)»’1 ∇f (x)T d.

s s

Hence

f (x + »d) ¤ f (x) + (» ’ »2 L»’1 κ(H)/2)∇f (x)T d,

s

which implies (3.4) if

± ¤ (1 ’ »L»’1 κ(H)/2).

s

This is equivalent to (3.6).

Lemma 3.2.3. Let ∇f be Lipschitz continuous with Lipschitz constant L. Let {xk } be the

iteration given by Algorithm optarm with spd matrices Hk that satisfy

κ(Hk ) ¤ κ

¯

(3.8)

for all k. Then the steps

’1

sk = xk+1 ’ xk = »k dk = ’»k Hk ∇f (xk )

satisfy

¯ 2βlow »s (1 ’ ±)

»k ≥ » =

(3.9)

L¯κ

and at most

2»s (1 ’ ±)

m = log / log(βhigh )

(3.10)

L¯κ

stepsize reductions will be required.

Proof. In the context of Algorithm optarm, Lemma 3.2.2 implies that the line search will

terminate when

2»s (1 ’ ±)

»¤ ,

Lκ(Hk )

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

GLOBAL CONVERGENCE 43

if not before. The most that one can overshoot this is by a factor of βlow , which proves (3.9).

The line search will require at most m stepsize reductions, where m is the least nonnegative

integer such that

2»s (1 ’ ±) m

> βhigh .

Lκ(Hk )

This implies (3.10).

The convergence theorem for Algorithm optarm says that if the condition numbers of the

matrices H and the norms of the iterates remain bounded, then every limit point of the iteration

is a stationary point. Boundedness of the sequence of iterates implies that there will be limit

points, but there is no guarantee that there is a unique limit point.

Theorem 3.2.4. Let ∇f be Lipschitz continuous with Lipschitz constant L. Assume that

the matrices Hk are spd and that there are κ and »l such that κ(Hk ) ¤ κ, and Hk ¤ »l for

¯ ¯

all k. Then either f (xk ) is unbounded from below or

lim ∇f (xk ) = 0

(3.11)

k’∞

and hence any limit point of the sequence of iterates produced by Algorithm optarm is a

stationary point.

In particular, if f (xk ) is bounded from below and xkl ’ x— is any convergent subsequence

of {xk }, then ∇f (x— ) = 0.

Proof. By construction, f (xk ) is a decreasing sequence. Therefore, if f (xk ) is bounded

from below, then limk’∞ f (xk ) = f — exists and

lim f (xk+1 ) ’ f (xk ) = 0.

(3.12)

k’∞

By (3.4) and Lemma 3.2.3 we have

’1

f (xk+1 ) ’ f (xk ) < ’±»k ∇f (xk )T Hk ∇f (xk )

¯

¤ ’±»»’1 ∇f (xk ) 2

¤ 0.

l

Hence, by (3.12)

»l (f (xk ) ’ f (xk+1 ))

2

∇f (xk ) ¤ ’0

¯

±»

as k ’ ∞. This completes the proof.

The analysis of the Armijo rule is valid for other line search methods [84], [125], [272],

[273]. The key points are that the suf¬cient decrease condition can be satis¬ed in ¬nitely many

steps and that the stepsizes are bounded away from zero.

3.2.1 Stepsize Control with Polynomial Models

Having computed a descent direction d from xc , one must decide on a stepsize reduction scheme

for iterations in which (3.4) fails for » = 1. A common approach [73], [84], [114], [197], [117]

is to model

ξ(») = f (xc + »d)

by a cubic polynomial. The data on hand initially are

ξ(0) = f (xc ), ξ (0) = ∇f (xc )T d < 0, and ξ(1) = f (x + d),

which is enough to form a quadratic model of ξ. So, if (3.4) does not hold with » = »0 = 1,

i.e.,

ξ(1) = f (xc + d) ≥ f (xc ) + ±∇f (xc )T d = ξ(0) + ±ξ (0),

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

44 ITERATIVE METHODS FOR OPTIMIZATION

we approximate ξ by the quadratic polynomial

q(») = ξ(0) + ξ (0)» + (ξ(1) ’ ξ(0) ’ ξ (0))»2

and let »1 be the minimum of q on the interval [βlow , βhigh ] ‚ (0, 1). This minimum can be

computed directly since ± ∈ (0, 1) and failure of (3.4) imply

q (») = 2(ξ(1) ’ ξ(0) ’ ξ (0)) > 2(± ’ 1)ξ (0) > 0.

Therefore, the global minimum of q is

’ξ (0)

»t = .

2(ξ(1) ’ ξ(0) ’ ξ (0))

So ±

βlow , »t ¤ βlow ,

»t , βlow < »t < βhigh ,

»+ =

(3.13)

βhigh , »t ≥ βhigh .

If our ¬rst reduced value of » does not satisfy (3.4), we base additional reductions on the

data

ξ(0) = f (xc ), ξ (0) = ∇f (xc )T d, ξ(»’ ), ξ(»c ),

where »c < »’ are the most recent values of » to fail to satisfy (3.4). This is suf¬cient data to

approximate ξ with a cubic polynomial

q(») = ξ(0) + ξ (0)» + c2 »2 + c3 »3 ,

where c2 and c3 can be determined by the equations

q(»c ) = ξ(»c ) = f (xc + »c d),

q(»’ ) = ξ(»’ ) = f (xc + »’ d),

which form the nonsingular linear system for c2 and c3

»2 »3 c2 ξ(»c ) ’ ξ(0) ’ ξ (0)»c

c c

= .

(3.14)

»2 »3 c3 ξ(»’ ) ’ ξ(0) ’ ξ (0)»’

’ ’

As with the quadratic case, q has a local minimum [84] at

c2 ’ 3c3 ξ (0)

’c2 + 2

»t = .

(3.15)

3c3

With »t in hand, we compute »+ using (3.13). The application of (3.13) is called safeguarding

and is important for the theory, as one can see from the proof of Theorem 3.2.4. Safeguarding

is also important in practice because, if the cubic model is poor, the unsafeguarded model can

make steplength reductions that are so small that the iteration can stagnate or so large (i.e., near

1) that too many reductions are needed before (3.4) holds.

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

GLOBAL CONVERGENCE 45

3.2.2 Slow Convergence of Steepest Descent

Unfortunately, methods based on steepest descent do not enjoy good local convergence prop-

erties, even for very simple functions. To illustrate this point we consider the special case of

convex quadratic objective functions

1T

x Ax ’ bT x + a,

f (x) =

2

where A is spd, b ∈ RN , and a is a scalar. We will look at a very simple example, using

the method of steepest descent with Hk = I (so »l = »s = 1) and show how the speed of

convergence depends on conditioning and scaling.

Lemma 3.2.5. Let f be a convex quadratic and let Hk = I for all k. Then the sequence

{xk } generated by Algorithm optarm converges to the unique minimizer of f .

Proof. In exercise 3.5.4 you are asked to show that f is bounded from below and that

∇f (x) = Ax ’ b. Hence ∇f (x— ) vanishes only at x— = A’1 b. Since ∇2 f (x) = A is spd, the

second-order suf¬cient conditions hold and x— is the unique minimizer of f .

Theorem 3.2.4 implies that

lim ∇f (xk ) = Axk ’ b = A(xk ’ x— ) = 0,

k’∞

and hence xk ’ x— .

Since the steepest descent iteration converges to the unique minimizer of a convex quadratic,

we can investigate the rate of convergence without concern about the initial iterate. We do this

in terms of the A-norm. The problems can be illustrated with the simplest case a = 0 and b = 0.

Proposition 3.2.6. Let f (x) = xT Ax/2 and let {xk } be given by Algorithm optarm with

Hk = I for all k. Then the sequence {xk } satis¬es

= (1 ’ O(κ(A)’2 )) xk

xk+1 A.

(3.16) A

Proof. The suf¬cient decrease condition, (3.4), implies that for all k

xT Axk+1 ’ xT Axk = 2(f (xk+1 ) ’ f (xk ))

k+1 k

¤ 2±∇f (xk )T (xk+1 ’ xk )

(3.17)

= 2±»k ∇f (xk )T d = ’2±»k (Axk )T (Axk ).

The Lipschitz constant of ∇f is simply »l = A ; hence we may write (3.9) as

¯ 2β(1 ’ ±) .

»k ≥ » =

(3.18)

»l κ(A)

In terms of the A-norm, (3.17) can be written as

¯

2 2 2

xk+1 ’ xk ¤ ’2±»»s xk A,

A A

where we use the fact that

2

= (Az)T (Az) ≥ »s z T Az = »s z 2

Az A.

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

46 ITERATIVE METHODS FOR OPTIMIZATION

Hence,

¯

2 2

¤ (1 ’ 4±(1 ’ ±)βκ(A)’2 ) xk 2

xk+1 ¤ (1 ’ 2±»»s ) xk A.

A A

This completes the proof.

Now we consider two speci¬c examples. Let N = 1 and de¬ne

f (x) = ωx2 /2,

where

ω < 2(1 ’ ±).

(3.19)

In this case x— = 0. We have ∇f (x) = f (x) = ωx and hence for all x ∈ R,

ωx2

((1 ’ ω)2 ’ 1)

f (x ’ ∇f (x)) ’ f (x) =

2

ω 2 x2

= (ω ’ 2)

2

< ’±|f (x)|2 = ’±ω 2 x2

because (3.19) implies that

ω ’ 2 < ’2±.

Hence (3.4) holds with d = ∇f (x) and » = 1 for all x ∈ R. The rate of convergence can be

computed directly since

x+ = (1 ’ ω)xc

for all xc . The convergence is q-linear with q-factor 1 ’ ω. So if ω is very small, the convergence

will be extremely slow.

Similarly, if ω is large, we see that

ω 2 x2

(»ω ’ 2) < ’±»ω 2 x2

f (x ’ »∇f (x)) ’ f (x) =

2

only if

2(1 ’ ±)

»< .

ω

So

2(1 ’ ±) 2(1 ’ ±)

< βm = » <

β .

ω ω

If ω is very large, many steplength reductions will be required with each iteration and the line

search will be very inef¬cient.

These are examples of poor scaling, where a change in f by a multiplicative factor can

dramatically improve the ef¬ciency of the line search or the convergence speed. In fact, if

ω = 1, steepest descent and Newton™s method are the same and only one iteration is required.

The case for a general convex quadratic is similar. Let »l and »s be the largest and smallest

eigenvalues of the spd matrix A. We assume that b = 0 and a = 0 for this example. We let ul

and us be unit eigenvectors corresponding to the eigenvalues »l and »s . If

»s < 2(1 ’ ±)

is small and the initial iterate is in the direction of us , convergence will require a very large

number of iterations (slow). If »l is large and the initial iterate is in the direction of ul , the line

search will be inef¬cient (many stepsize reductions at each iteration).

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

GLOBAL CONVERGENCE 47

Newton™s method does not suffer from poor scaling of f and converges rapidly with no need

for a line search when the initial iterate is near the solution. However, when far away from the

solution, the Newton direction may not be a descent direction at all and the line search may

fail. Making the transition from steepest descent, which is a good algorithm when far from

the solution, to Newton™s or some other superlinearly convergent method as the iteration moves

toward the solution, is the central problem in the design of line search algorithms. The scaling

problems discussed above must also be addressed, even when far from the solution.