<< . .

. 8
( : 32)



. . >>

»’1 z 2 ¤ z T H ’1 z ¤ »’1 z 2 .
s
l



The ¬rst step in the analysis is to use Lemma 3.2.1 to obtain a lower bound for the steplength.


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




42 ITERATIVE METHODS FOR OPTIMIZATION

Lemma 3.2.2. Assume that ∇f is Lipschitz continuous with Lipschitz constant L. Let
± ∈ (0, 1), x ∈ RN , and H be an spd matrix. Let »s > 0 be the smallest and »l ≥ »s the
largest eigenvalues of H. Let d be given by (3.3). Assume that ∇f (x) = 0. Then (3.4) holds for
any » such that
2»s (1 ’ ±)
0<»¤ .
(3.6)
Lκ(H)

Proof. Let d = ’H ’1 ∇f (x). By the fundamental theorem of calculus
1
∇f (x + t»d)T »d dt.
f (x + »d) ’ f (x) =
0

Hence
= f (x) + »∇f (x)T d
f (x + »d)
(3.7)
1
+ t»d) ’ ∇f (x))T d dt.
+» (∇f (x
0
Therefore,
»2 L
’1 T
d 2.
f (x + »d) = f (x ’ »H ∇f (x)) ¤ f (x) + »∇f (x) d +
2
Positivity of H, Lemma 3.2.1, and κ(H) = »l »’1 imply that
s

2
= H ’1 ∇f (x) 2
¤ »’2 ∇f (x)T ∇f (x)
d s


¤ ’»l »’2 ∇f (x)T d = ’κ(H)»’1 ∇f (x)T d.
s s

Hence
f (x + »d) ¤ f (x) + (» ’ »2 L»’1 κ(H)/2)∇f (x)T d,
s

which implies (3.4) if
± ¤ (1 ’ »L»’1 κ(H)/2).
s

This is equivalent to (3.6).
Lemma 3.2.3. Let ∇f be Lipschitz continuous with Lipschitz constant L. Let {xk } be the
iteration given by Algorithm optarm with spd matrices Hk that satisfy

κ(Hk ) ¤ κ
¯
(3.8)

for all k. Then the steps
’1
sk = xk+1 ’ xk = »k dk = ’»k Hk ∇f (xk )

satisfy
¯ 2βlow »s (1 ’ ±)
»k ≥ » =
(3.9)
L¯κ
and at most
2»s (1 ’ ±)
m = log / log(βhigh )
(3.10)
L¯κ
stepsize reductions will be required.
Proof. In the context of Algorithm optarm, Lemma 3.2.2 implies that the line search will
terminate when
2»s (1 ’ ±)
»¤ ,
Lκ(Hk )


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




GLOBAL CONVERGENCE 43

if not before. The most that one can overshoot this is by a factor of βlow , which proves (3.9).
The line search will require at most m stepsize reductions, where m is the least nonnegative
integer such that
2»s (1 ’ ±) m
> βhigh .
Lκ(Hk )
This implies (3.10).
The convergence theorem for Algorithm optarm says that if the condition numbers of the
matrices H and the norms of the iterates remain bounded, then every limit point of the iteration
is a stationary point. Boundedness of the sequence of iterates implies that there will be limit
points, but there is no guarantee that there is a unique limit point.
Theorem 3.2.4. Let ∇f be Lipschitz continuous with Lipschitz constant L. Assume that
the matrices Hk are spd and that there are κ and »l such that κ(Hk ) ¤ κ, and Hk ¤ »l for
¯ ¯
all k. Then either f (xk ) is unbounded from below or
lim ∇f (xk ) = 0
(3.11)
k’∞

and hence any limit point of the sequence of iterates produced by Algorithm optarm is a
stationary point.
In particular, if f (xk ) is bounded from below and xkl ’ x— is any convergent subsequence
of {xk }, then ∇f (x— ) = 0.
Proof. By construction, f (xk ) is a decreasing sequence. Therefore, if f (xk ) is bounded
from below, then limk’∞ f (xk ) = f — exists and
lim f (xk+1 ) ’ f (xk ) = 0.
(3.12)
k’∞

By (3.4) and Lemma 3.2.3 we have
’1
f (xk+1 ) ’ f (xk ) < ’±»k ∇f (xk )T Hk ∇f (xk )

¯
¤ ’±»»’1 ∇f (xk ) 2
¤ 0.
l

Hence, by (3.12)
»l (f (xk ) ’ f (xk+1 ))
2
∇f (xk ) ¤ ’0
¯
±»
as k ’ ∞. This completes the proof.
The analysis of the Armijo rule is valid for other line search methods [84], [125], [272],
[273]. The key points are that the suf¬cient decrease condition can be satis¬ed in ¬nitely many
steps and that the stepsizes are bounded away from zero.

3.2.1 Stepsize Control with Polynomial Models
Having computed a descent direction d from xc , one must decide on a stepsize reduction scheme
for iterations in which (3.4) fails for » = 1. A common approach [73], [84], [114], [197], [117]
is to model
ξ(») = f (xc + »d)
by a cubic polynomial. The data on hand initially are
ξ(0) = f (xc ), ξ (0) = ∇f (xc )T d < 0, and ξ(1) = f (x + d),
which is enough to form a quadratic model of ξ. So, if (3.4) does not hold with » = »0 = 1,
i.e.,
ξ(1) = f (xc + d) ≥ f (xc ) + ±∇f (xc )T d = ξ(0) + ±ξ (0),


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




44 ITERATIVE METHODS FOR OPTIMIZATION

we approximate ξ by the quadratic polynomial

q(») = ξ(0) + ξ (0)» + (ξ(1) ’ ξ(0) ’ ξ (0))»2

and let »1 be the minimum of q on the interval [βlow , βhigh ] ‚ (0, 1). This minimum can be
computed directly since ± ∈ (0, 1) and failure of (3.4) imply

q (») = 2(ξ(1) ’ ξ(0) ’ ξ (0)) > 2(± ’ 1)ξ (0) > 0.

Therefore, the global minimum of q is

’ξ (0)
»t = .
2(ξ(1) ’ ξ(0) ’ ξ (0))

So ±
 βlow , »t ¤ βlow ,




»t , βlow < »t < βhigh ,
»+ =
(3.13)





βhigh , »t ≥ βhigh .

If our ¬rst reduced value of » does not satisfy (3.4), we base additional reductions on the
data
ξ(0) = f (xc ), ξ (0) = ∇f (xc )T d, ξ(»’ ), ξ(»c ),

where »c < »’ are the most recent values of » to fail to satisfy (3.4). This is suf¬cient data to
approximate ξ with a cubic polynomial

q(») = ξ(0) + ξ (0)» + c2 »2 + c3 »3 ,

where c2 and c3 can be determined by the equations

q(»c ) = ξ(»c ) = f (xc + »c d),

q(»’ ) = ξ(»’ ) = f (xc + »’ d),

which form the nonsingular linear system for c2 and c3

»2 »3 c2 ξ(»c ) ’ ξ(0) ’ ξ (0)»c
c c
= .
(3.14)
»2 »3 c3 ξ(»’ ) ’ ξ(0) ’ ξ (0)»’
’ ’


As with the quadratic case, q has a local minimum [84] at

c2 ’ 3c3 ξ (0)
’c2 + 2
»t = .
(3.15)
3c3

With »t in hand, we compute »+ using (3.13). The application of (3.13) is called safeguarding
and is important for the theory, as one can see from the proof of Theorem 3.2.4. Safeguarding
is also important in practice because, if the cubic model is poor, the unsafeguarded model can
make steplength reductions that are so small that the iteration can stagnate or so large (i.e., near
1) that too many reductions are needed before (3.4) holds.


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




GLOBAL CONVERGENCE 45

3.2.2 Slow Convergence of Steepest Descent
Unfortunately, methods based on steepest descent do not enjoy good local convergence prop-
erties, even for very simple functions. To illustrate this point we consider the special case of
convex quadratic objective functions

1T
x Ax ’ bT x + a,
f (x) =
2

where A is spd, b ∈ RN , and a is a scalar. We will look at a very simple example, using
the method of steepest descent with Hk = I (so »l = »s = 1) and show how the speed of
convergence depends on conditioning and scaling.
Lemma 3.2.5. Let f be a convex quadratic and let Hk = I for all k. Then the sequence
{xk } generated by Algorithm optarm converges to the unique minimizer of f .
Proof. In exercise 3.5.4 you are asked to show that f is bounded from below and that
∇f (x) = Ax ’ b. Hence ∇f (x— ) vanishes only at x— = A’1 b. Since ∇2 f (x) = A is spd, the
second-order suf¬cient conditions hold and x— is the unique minimizer of f .
Theorem 3.2.4 implies that

lim ∇f (xk ) = Axk ’ b = A(xk ’ x— ) = 0,
k’∞

and hence xk ’ x— .
Since the steepest descent iteration converges to the unique minimizer of a convex quadratic,
we can investigate the rate of convergence without concern about the initial iterate. We do this
in terms of the A-norm. The problems can be illustrated with the simplest case a = 0 and b = 0.
Proposition 3.2.6. Let f (x) = xT Ax/2 and let {xk } be given by Algorithm optarm with
Hk = I for all k. Then the sequence {xk } satis¬es

= (1 ’ O(κ(A)’2 )) xk
xk+1 A.
(3.16) A




Proof. The suf¬cient decrease condition, (3.4), implies that for all k

xT Axk+1 ’ xT Axk = 2(f (xk+1 ) ’ f (xk ))
k+1 k


¤ 2±∇f (xk )T (xk+1 ’ xk )
(3.17)

= 2±»k ∇f (xk )T d = ’2±»k (Axk )T (Axk ).

The Lipschitz constant of ∇f is simply »l = A ; hence we may write (3.9) as

¯ 2β(1 ’ ±) .
»k ≥ » =
(3.18)
»l κ(A)

In terms of the A-norm, (3.17) can be written as
¯
2 2 2
xk+1 ’ xk ¤ ’2±»»s xk A,
A A

where we use the fact that
2
= (Az)T (Az) ≥ »s z T Az = »s z 2
Az A.




Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




46 ITERATIVE METHODS FOR OPTIMIZATION

Hence,
¯
2 2
¤ (1 ’ 4±(1 ’ ±)βκ(A)’2 ) xk 2
xk+1 ¤ (1 ’ 2±»»s ) xk A.
A A

This completes the proof.
Now we consider two speci¬c examples. Let N = 1 and de¬ne

f (x) = ωx2 /2,

where
ω < 2(1 ’ ±).
(3.19)
In this case x— = 0. We have ∇f (x) = f (x) = ωx and hence for all x ∈ R,

ωx2
((1 ’ ω)2 ’ 1)
f (x ’ ∇f (x)) ’ f (x) =
2

ω 2 x2
= (ω ’ 2)
2

< ’±|f (x)|2 = ’±ω 2 x2

because (3.19) implies that
ω ’ 2 < ’2±.
Hence (3.4) holds with d = ∇f (x) and » = 1 for all x ∈ R. The rate of convergence can be
computed directly since
x+ = (1 ’ ω)xc
for all xc . The convergence is q-linear with q-factor 1 ’ ω. So if ω is very small, the convergence
will be extremely slow.
Similarly, if ω is large, we see that

ω 2 x2
(»ω ’ 2) < ’±»ω 2 x2
f (x ’ »∇f (x)) ’ f (x) =
2
only if
2(1 ’ ±)
»< .
ω
So
2(1 ’ ±) 2(1 ’ ±)
< βm = » <
β .
ω ω
If ω is very large, many steplength reductions will be required with each iteration and the line
search will be very inef¬cient.
These are examples of poor scaling, where a change in f by a multiplicative factor can
dramatically improve the ef¬ciency of the line search or the convergence speed. In fact, if
ω = 1, steepest descent and Newton™s method are the same and only one iteration is required.
The case for a general convex quadratic is similar. Let »l and »s be the largest and smallest
eigenvalues of the spd matrix A. We assume that b = 0 and a = 0 for this example. We let ul
and us be unit eigenvectors corresponding to the eigenvalues »l and »s . If

»s < 2(1 ’ ±)

is small and the initial iterate is in the direction of us , convergence will require a very large
number of iterations (slow). If »l is large and the initial iterate is in the direction of ul , the line
search will be inef¬cient (many stepsize reductions at each iteration).


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




GLOBAL CONVERGENCE 47

Newton™s method does not suffer from poor scaling of f and converges rapidly with no need
for a line search when the initial iterate is near the solution. However, when far away from the
solution, the Newton direction may not be a descent direction at all and the line search may
fail. Making the transition from steepest descent, which is a good algorithm when far from
the solution, to Newton™s or some other superlinearly convergent method as the iteration moves
toward the solution, is the central problem in the design of line search algorithms. The scaling
problems discussed above must also be addressed, even when far from the solution.

<< . .

. 8
( : 32)



. . >>