<< . .

. 20
( : 32)



. . >>

some of the methods, such as the Nelder“Mead [204] and Hooke“Jeeves [145] algorithms are
classic, most of the convergence analysis in this part of the book was done after 1990.
The algorithms and theoretical results that we present in this part of the book are for objective
functions that are perturbations of simple, smooth functions. The surfaces in Figure 6.1 illustrate
this problem. The optimization landscape on the left of Figure 6.1, taken from [271], arose in
a problem in semiconductor design. The landscape on the right is a simple perturbation of a
convex quadratic.

4.5


4


3.5
20
3
0
2.5
-20
2
-40
1.5
-60
25
1
20
-80
15
0
0.5
5 10
10
15 5
20 0
0
25 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2




Figure 6.1: Optimization Landscapes

We do not discuss algorithms that explicitly smooth the objective function or apply a ¬lter,
such as the ones in [168] and [187]. For general problems, these must sample the variable
space in some way, for example by performing high-dimensional integration, and are too costly.
However, in some special cases these integrals can be performed analytically and impressive
results for special-purpose ¬ltering algorithms for computational chemistry have been reported
in, for example, [196] and [277]. Nor do we discuss analog methods (see [149] for a well-known

111
Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




112 ITERATIVE METHODS FOR OPTIMIZATION

example).
We also omit stochastic methods like the special-purpose methods discussed in [38] and [39],
or more radical general-purpose global optimization algorithms, such as simulated annealing
[166] (see [1] and [265] for surveys of recent work), interval methods [152], or genetic algorithms
[143], [144] (see [246] or [123] for a survey), which are random to some extent or random
search algorithms. These probabilistic methods, however, should be considered when the more
conservative algorithms such as the ones in this part of the book fail.


6.1 Problem Statement
Consider an objective function f that is a perturbation of a smooth function fs by a small function
φ
f (x) = fs (x) + φ(x).
(6.1)
Small oscillations in φ could cause f to have several local minima that would trap any conven-
tional gradient-based algorithms. The perturbation φ can, in general, be random or based on the
output of an experiment, [250], and may not return the same value when called twice with the
same argument. Hence φ need not even be a function. We assume that φ is everywhere de¬ned
and bounded to make the statement of the results simpler.


6.2 The Simplex Gradient
Most of the the algorithms in this part of the book examine a simplex of points in RN at each
iteration and then change the simplex in response. In this section we develop the tools needed to
describe and analyze these algorithms. The fundamental idea is that many sampling algorithms
require enough information to approximate the gradient by differences and that the accuracy in
that difference approximation can be used to analyze the convergence. However, for problems
of the form (6.1), one must take care not to make the difference increments so small as to attempt
to differentiate the noise.
The ideas in this section were originally used in [155] to analyze the Nelder“Mead [204]
algorithm, which we discuss in §8.1. However, the ideas can be applied to several classes of
algorithms, and we follow the development in [29] in this section.
De¬nition 6.2.1. A simplex S in RN is the convex hull of N + 1 points, {xj }N +1 . xj is
j=1
the jth vertex of S. We let V (or V (S)) denote the N — N matrix of simplex directions
V (S) = (x2 ’ x1 , x3 ’ x1 , . . . , xN +1 ’ x1 ) = (v1 , . . . , vN ).
We say S is nonsingular if V is nonsingular. The simplex diameter diam(S) is
diam(S) = max xi ’ xj .
1¤i,j¤N +1

We will refer to the l2 condition number κ(V ) of V as the simplex condition.
We let δ(f : S) denote the vector of objective function differences
δ(f : S) = (f (x2 ) ’ f (x1 ), f (x3 ) ’ f (x1 ), . . . , f (xN +1 ) ’ f (x1 ))T .
We will not use the simplex diameter directly in our estimates or algorithms. Rather we will use
two oriented lengths
σ+ (S) = max x1 ’ xj and σ’ (S) = min x1 ’ xj .
2¤j¤N +1 2¤j¤N +1

Clearly,
σ+ (S) ¤ diam(S) ¤ 2σ+ (S).



Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




BASIC CONCEPTS AND GOALS 113

6.2.1 Forward Difference Simplex Gradient

De¬nition 6.2.2. Let S be a nonsingular simplex with vertices {xj }N +1 . The simplex gradient
j=1
D(f : S) is
D(f : S) = V ’T δ(f : S).


Note that the matrix of simplex directions and the vector of objective function differences
depend on which of the vertices is labeled x1 . Most of the algorithms we consider in this part
of the book use a vertex ordering or sample on a regular stencil. In this way the algorithms, in
one way or another, use a simplex gradient.
This de¬nition of simplex gradient is motivated by the ¬rst-order estimate in Lemma 6.2.1.
Lemma 6.2.1. Let S be a simplex. Let ∇f be Lipschitz continuous in a neighborhood of S
with Lipschitz constant 2Kf . Then there is K > 0, depending only on Kf such that

∇f (x1 ) ’ D(f : S) ¤ Kκ(V )σ+ (S).
(6.2)



Proof. Our smoothness assumptions on f and Taylor™s theorem imply that for all 2 ¤ j ¤
N + 1,
|f (x1 ) ’ f (xj ) + vj ∇f (x1 )| ¤ Kf vj 2 ¤ Kf σ+ (S)2 .
T


Hence
δ(f : S) ’ V T ∇f (x1 ) ¤ N 1/2 Kf σ+ (S)2
and hence, setting K = N 1/2 Kf ,

∇f (x1 ) ’ D(f : S) ¤ K V ’T σ+ (S)2 .

The conclusion follows from the fact that σ+ (S) ¤ V .
Search algorithms are not intended, of course, for smooth problems. Minimization of ob-
jective functions of the form in (6.1) is one of the applications of these methods. A ¬rst-order
estimate that takes perturbations into account is our next result.
We will need to measure the perturbations on each simplex. To that end we de¬ne for any
set T
φ T = sup φ(x) .
x∈T

A ¬rst-order estimate also holds for the simplex gradient of an objective function that satis¬es
(6.1).
Lemma 6.2.2. Let S be a nonsingular simplex. Let f satisfy (6.1) and let ∇fs be Lipschitz
continuous in a neighborhood of S with Lipschitz constant 2Ks . Then there is K > 0, depending
only on Ks , such that

φS
∇fs (x1 ) ’ D(f : S) ¤ Kκ(V ) σ+ (S) + .
(6.3)
σ+ (S)


Proof. Lemma 6.2.1 (applied to fs ) implies

∇fs (x1 ) ’ D(fs : S) ¤ Ks N 1/2 κ(V )σ+ (S).


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




114 ITERATIVE METHODS FOR OPTIMIZATION

Now, since δ(φ : S) ¤ 2 N φ and σ+ (S) ¤ V ,
S,

¤ V ’T δ(f : S) ’ δ(fs : S) = V ’T
D(f : S) ’ D(fs : S) δ(φ : S)

φS
¤ 2N 1/2 V ’T ¤ 2N 1/2 κ(V )
φ .
S
σ+ (S)

This completes the proof with K = N 1/2 Ks + 2N 1/2 .
The constants K in (6.2) and (6.3) depend on S only through the Lipschitz constants of fs
and ∇fs in a neighborhood of S. We will express that dependence as K = K(S) when needed.
The algorithms in this section are most pro¬tably applied to problems of the form (6.1), and
the goal is to extract as much information as possible from the smooth part fs of f without
wasting effort in a futile attempt to minimize the noise. In order to formulate our goal for
convergence clearly, we explore the consequences of a small simplex gradient in the special (and
not uncommon) case that the amplitude of the noise is small in Lemma 6.2.3.
Lemma 6.2.3. Let f satisfy (6.1) and let ∇fs be continuously differentiable in a compact set
„¦ ‚ RN . Assume that fs has a unique critical point x— in „¦. Then there is K„¦ > 0 such that
for any simplex S ‚ „¦ with vertices {xj }N +1 ,
j=1

φS
x1 ’ x— ¤ K„¦ D(f : S) + κ(V ) σ+ (S) + .
σ+ (S)


Proof. The compactness of „¦ and our smoothness assumptions on fs imply that there is β0
such that
∇fs (x) ≥ β0 x ’ x—
for all x ∈ „¦. We apply (6.3) to obtain
’1
x1 ’ x— ¤ β0 ∇fs (x1 )

φS
’1
¤ β0 D(f : S) + Kκ(V ) σ+ (S) + .
σ+ (S)
’1
This completes the proof with K„¦ = β0 max(1, K).
By sampling in an organized way simplex-based algorithms, some directly and some implic-
itly, attempt to drive the simplex gradient to a small value by changing the size of the simplices
over which f is sampled. The motion of the simplices and the scheme for changing the size
(especially the reduction in size) accounts for the differences in the algorithms. Theorem 6.2.4,
a direct consequence of Lemma 6.2.3, quanti¬es this. We will consider a sequence of uniformly
well-conditioned simplices. Such simplices are generated by several of the algorithms we will
study later.
Theorem 6.2.4. Let f satisfy (6.1) and let ∇fs be continuously differentiable in a compact
set „¦ ‚ RN . Assume that fs has a unique critical point x— in „¦. Let S k be a sequence of
simplices having vertices {xk }N +1 . Assume that there is M such that
j j=1

S k ‚ „¦ and κ(V (S k )) ¤ M for all k.

Then,
1. if
φ Sk
lim σ+ (S k ) = 0, lim = 0,
k’∞ σ+ (S k )
k’∞




Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




BASIC CONCEPTS AND GOALS 115

and lim supk’∞ D(f : S k ) = , for some > 0, then there is KS > 0 such that

lim sup x— ’ xk ¤ KS ;
1
k’∞


2. if, for some > 0,
2
, lim inf σ+ (S k ) ≥ , and lim inf D(f : S k ) ¤ ,
lim sup φ ¤
Sk
k’∞ k’∞
k’∞

then there is KS > 0 such that

lim sup x— ’ xk ¤ KS ( + lim sup σ+ (S k )).
1
k’∞ k’∞




6.2.2 Centered Difference Simplex Gradient
In this section we de¬ne the centered difference simplex gradient and prove a second-order
estimate. We will then prove two variants of Theorem 6.2.4, one to show how the role of the
noise φ differs from that in the one-sided derivative case and a second to quantify how the values
of f on the stencil can be used to terminate an iteration.
De¬nition 6.2.3. Let S be a nonsingular simplex in RN with vertices {xj }N +1 and simplex
j=1
directions vj = xj+1 ’ x1 . The re¬‚ected simplex R = R(S) is the simplex with vertices x1 and

rj = x1 ’ vj for j = 1, . . . , N.

The central simplex gradient DC (f : S) is

V ’T (δ(f : S) ’ δ(f : R))
D(f : S) + D(f : R)
DC (f : S) = = .
2 2


For example, if N = 1 and x2 = x1 + h, then r2 = x1 ’ h. Hence

f (x1 + h) ’ f (x1 ) f (x1 ’ h) ’ f (x1 )
D(f : S) = and D(f : R) = .
h ’h
Therefore,
f (x1 + h) ’ f (x1 ’ h)
DC (f : S) = DC (f : R) =
2h
is the usual central difference.
Lemmas 6.2.5 and 6.2.6 are the second-order analogues of Lemmas 6.2.1 and 6.2.2.
Lemma 6.2.5. Let S be a nonsingular simplex and let ∇2 f be Lipschitz continuous in a
neighborhood of S ∪ R(S) with Lipschitz constant 3KC . Then there is K > 0 such that

∇f (x1 ) ’ DC (f : S) ¤ Kκ(V )σ+ (S)2 .
(6.4)



Proof. The Lipschitz continuity assumption implies that for all 2 ¤ j ¤ N + 1,

f (xj ) ’ f (rj ) + 2∇f (x1 )T vj ¤ KC vj 3
¤ Kc σ+ (S)3 .


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




116 ITERATIVE METHODS FOR OPTIMIZATION

As in the proof of Lemma 6.2.1 we have

V T (δ(f : S) ’ δ(f : R)) ’ V T ∇f (x1 ) ¤ N 1/2 KC σ+ (S)3 ,

and hence the result follows with K = N 1/2 KC .
Lemma 6.2.6. Let S be a nonsingular simplex. Let f satisfy (6.1) and let ∇2 fs be Lipschitz
continuous in a neighborhood of S ∪ R(S) with Lipschitz constant 3KCs . Then there is K > 0,
depending only on KCs , such that

φS
∇fs (x1 ) ’ DC (f : S) ¤ Kκ(V ) σ+ (S)2 + .
(6.5)
σ+ (S)


Proof. This proof is very similar to that of Lemma 6.2.2 and is left to the reader.
The quality of the information that can be obtained from the central simplex gradient is
higher than that of the forward. The difference in practice can be dramatic, as the examples
in §7.6 illustrate. The consequences of a small central simplex gradient follow directly from
Lemma 6.2.6.
Lemma 6.2.7. Let f satisfy (6.1) and let ∇2 fs be continuously differentiable in a compact
set „¦ ‚ RN . Assume that fs has a unique critical point x— in „¦. Then there is K„¦ > 0 such
that if a simplex S and its re¬‚ection R(S) are both contained in „¦ then

φS
x1 ’ x— ¤ K„¦ DC (f : S) + κ(V ) σ+ (S)2 + .
σ+ (S)


Lemma 6.2.7 is all one needs to conclude convergence from a sequence of small central
simplex gradients.
Theorem 6.2.8. Let f satisfy (6.1) and let ∇2 fs be continuously differentiable in a compact
set „¦ ‚ RN . Assume that fs has a unique critical point x— in „¦. Let S k be a sequence of
simplices having vertices {xk }N +1 . Assume that there is M such that
j j=1


S k , R(S k ) ‚ „¦ and κ(V (S k )) ¤ M for all k.

Then,
1. if
φ Sk
lim σ+ (S k ) = 0, lim = 0,
k’∞ σ+ (S k )
k’∞

and lim supk’∞ DC (f : S k ) = , for some > 0, then there is KS > 0 such that

lim sup x— ’ xk ¤ KS ;
1
k’∞


2. if, for some > 0,
3
, lim inf σ+ (S k ) ≥ 2
, and lim inf DC (f : S k ) ¤ 2
lim sup φ ¤ ,
Sk
k’∞ k’∞
k’∞

then there is KS > 0 such that

lim sup x— ’ xk ¤ KS ( + lim sup σ+ (S k ))2 .
1
k’∞ k’∞




Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

<< . .

. 20
( : 32)



. . >>