classic, most of the convergence analysis in this part of the book was done after 1990.

The algorithms and theoretical results that we present in this part of the book are for objective

functions that are perturbations of simple, smooth functions. The surfaces in Figure 6.1 illustrate

this problem. The optimization landscape on the left of Figure 6.1, taken from [271], arose in

a problem in semiconductor design. The landscape on the right is a simple perturbation of a

convex quadratic.

4.5

4

3.5

20

3

0

2.5

-20

2

-40

1.5

-60

25

1

20

-80

15

0

0.5

5 10

10

15 5

20 0

0

25 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Figure 6.1: Optimization Landscapes

We do not discuss algorithms that explicitly smooth the objective function or apply a ¬lter,

such as the ones in [168] and [187]. For general problems, these must sample the variable

space in some way, for example by performing high-dimensional integration, and are too costly.

However, in some special cases these integrals can be performed analytically and impressive

results for special-purpose ¬ltering algorithms for computational chemistry have been reported

in, for example, [196] and [277]. Nor do we discuss analog methods (see [149] for a well-known

111

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

112 ITERATIVE METHODS FOR OPTIMIZATION

example).

We also omit stochastic methods like the special-purpose methods discussed in [38] and [39],

or more radical general-purpose global optimization algorithms, such as simulated annealing

[166] (see [1] and [265] for surveys of recent work), interval methods [152], or genetic algorithms

[143], [144] (see [246] or [123] for a survey), which are random to some extent or random

search algorithms. These probabilistic methods, however, should be considered when the more

conservative algorithms such as the ones in this part of the book fail.

6.1 Problem Statement

Consider an objective function f that is a perturbation of a smooth function fs by a small function

φ

f (x) = fs (x) + φ(x).

(6.1)

Small oscillations in φ could cause f to have several local minima that would trap any conven-

tional gradient-based algorithms. The perturbation φ can, in general, be random or based on the

output of an experiment, [250], and may not return the same value when called twice with the

same argument. Hence φ need not even be a function. We assume that φ is everywhere de¬ned

and bounded to make the statement of the results simpler.

6.2 The Simplex Gradient

Most of the the algorithms in this part of the book examine a simplex of points in RN at each

iteration and then change the simplex in response. In this section we develop the tools needed to

describe and analyze these algorithms. The fundamental idea is that many sampling algorithms

require enough information to approximate the gradient by differences and that the accuracy in

that difference approximation can be used to analyze the convergence. However, for problems

of the form (6.1), one must take care not to make the difference increments so small as to attempt

to differentiate the noise.

The ideas in this section were originally used in [155] to analyze the Nelder“Mead [204]

algorithm, which we discuss in §8.1. However, the ideas can be applied to several classes of

algorithms, and we follow the development in [29] in this section.

De¬nition 6.2.1. A simplex S in RN is the convex hull of N + 1 points, {xj }N +1 . xj is

j=1

the jth vertex of S. We let V (or V (S)) denote the N — N matrix of simplex directions

V (S) = (x2 ’ x1 , x3 ’ x1 , . . . , xN +1 ’ x1 ) = (v1 , . . . , vN ).

We say S is nonsingular if V is nonsingular. The simplex diameter diam(S) is

diam(S) = max xi ’ xj .

1¤i,j¤N +1

We will refer to the l2 condition number κ(V ) of V as the simplex condition.

We let δ(f : S) denote the vector of objective function differences

δ(f : S) = (f (x2 ) ’ f (x1 ), f (x3 ) ’ f (x1 ), . . . , f (xN +1 ) ’ f (x1 ))T .

We will not use the simplex diameter directly in our estimates or algorithms. Rather we will use

two oriented lengths

σ+ (S) = max x1 ’ xj and σ’ (S) = min x1 ’ xj .

2¤j¤N +1 2¤j¤N +1

Clearly,

σ+ (S) ¤ diam(S) ¤ 2σ+ (S).

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

BASIC CONCEPTS AND GOALS 113

6.2.1 Forward Difference Simplex Gradient

De¬nition 6.2.2. Let S be a nonsingular simplex with vertices {xj }N +1 . The simplex gradient

j=1

D(f : S) is

D(f : S) = V ’T δ(f : S).

Note that the matrix of simplex directions and the vector of objective function differences

depend on which of the vertices is labeled x1 . Most of the algorithms we consider in this part

of the book use a vertex ordering or sample on a regular stencil. In this way the algorithms, in

one way or another, use a simplex gradient.

This de¬nition of simplex gradient is motivated by the ¬rst-order estimate in Lemma 6.2.1.

Lemma 6.2.1. Let S be a simplex. Let ∇f be Lipschitz continuous in a neighborhood of S

with Lipschitz constant 2Kf . Then there is K > 0, depending only on Kf such that

∇f (x1 ) ’ D(f : S) ¤ Kκ(V )σ+ (S).

(6.2)

Proof. Our smoothness assumptions on f and Taylor™s theorem imply that for all 2 ¤ j ¤

N + 1,

|f (x1 ) ’ f (xj ) + vj ∇f (x1 )| ¤ Kf vj 2 ¤ Kf σ+ (S)2 .

T

Hence

δ(f : S) ’ V T ∇f (x1 ) ¤ N 1/2 Kf σ+ (S)2

and hence, setting K = N 1/2 Kf ,

∇f (x1 ) ’ D(f : S) ¤ K V ’T σ+ (S)2 .

The conclusion follows from the fact that σ+ (S) ¤ V .

Search algorithms are not intended, of course, for smooth problems. Minimization of ob-

jective functions of the form in (6.1) is one of the applications of these methods. A ¬rst-order

estimate that takes perturbations into account is our next result.

We will need to measure the perturbations on each simplex. To that end we de¬ne for any

set T

φ T = sup φ(x) .

x∈T

A ¬rst-order estimate also holds for the simplex gradient of an objective function that satis¬es

(6.1).

Lemma 6.2.2. Let S be a nonsingular simplex. Let f satisfy (6.1) and let ∇fs be Lipschitz

continuous in a neighborhood of S with Lipschitz constant 2Ks . Then there is K > 0, depending

only on Ks , such that

φS

∇fs (x1 ) ’ D(f : S) ¤ Kκ(V ) σ+ (S) + .

(6.3)

σ+ (S)

Proof. Lemma 6.2.1 (applied to fs ) implies

∇fs (x1 ) ’ D(fs : S) ¤ Ks N 1/2 κ(V )σ+ (S).

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

114 ITERATIVE METHODS FOR OPTIMIZATION

√

Now, since δ(φ : S) ¤ 2 N φ and σ+ (S) ¤ V ,

S,

¤ V ’T δ(f : S) ’ δ(fs : S) = V ’T

D(f : S) ’ D(fs : S) δ(φ : S)

φS

¤ 2N 1/2 V ’T ¤ 2N 1/2 κ(V )

φ .

S

σ+ (S)

This completes the proof with K = N 1/2 Ks + 2N 1/2 .

The constants K in (6.2) and (6.3) depend on S only through the Lipschitz constants of fs

and ∇fs in a neighborhood of S. We will express that dependence as K = K(S) when needed.

The algorithms in this section are most pro¬tably applied to problems of the form (6.1), and

the goal is to extract as much information as possible from the smooth part fs of f without

wasting effort in a futile attempt to minimize the noise. In order to formulate our goal for

convergence clearly, we explore the consequences of a small simplex gradient in the special (and

not uncommon) case that the amplitude of the noise is small in Lemma 6.2.3.

Lemma 6.2.3. Let f satisfy (6.1) and let ∇fs be continuously differentiable in a compact set

„¦ ‚ RN . Assume that fs has a unique critical point x— in „¦. Then there is K„¦ > 0 such that

for any simplex S ‚ „¦ with vertices {xj }N +1 ,

j=1

φS

x1 ’ x— ¤ K„¦ D(f : S) + κ(V ) σ+ (S) + .

σ+ (S)

Proof. The compactness of „¦ and our smoothness assumptions on fs imply that there is β0

such that

∇fs (x) ≥ β0 x ’ x—

for all x ∈ „¦. We apply (6.3) to obtain

’1

x1 ’ x— ¤ β0 ∇fs (x1 )

φS

’1

¤ β0 D(f : S) + Kκ(V ) σ+ (S) + .

σ+ (S)

’1

This completes the proof with K„¦ = β0 max(1, K).

By sampling in an organized way simplex-based algorithms, some directly and some implic-

itly, attempt to drive the simplex gradient to a small value by changing the size of the simplices

over which f is sampled. The motion of the simplices and the scheme for changing the size

(especially the reduction in size) accounts for the differences in the algorithms. Theorem 6.2.4,

a direct consequence of Lemma 6.2.3, quanti¬es this. We will consider a sequence of uniformly

well-conditioned simplices. Such simplices are generated by several of the algorithms we will

study later.

Theorem 6.2.4. Let f satisfy (6.1) and let ∇fs be continuously differentiable in a compact

set „¦ ‚ RN . Assume that fs has a unique critical point x— in „¦. Let S k be a sequence of

simplices having vertices {xk }N +1 . Assume that there is M such that

j j=1

S k ‚ „¦ and κ(V (S k )) ¤ M for all k.

Then,

1. if

φ Sk

lim σ+ (S k ) = 0, lim = 0,

k’∞ σ+ (S k )

k’∞

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

BASIC CONCEPTS AND GOALS 115

and lim supk’∞ D(f : S k ) = , for some > 0, then there is KS > 0 such that

lim sup x— ’ xk ¤ KS ;

1

k’∞

2. if, for some > 0,

2

, lim inf σ+ (S k ) ≥ , and lim inf D(f : S k ) ¤ ,

lim sup φ ¤

Sk

k’∞ k’∞

k’∞

then there is KS > 0 such that

lim sup x— ’ xk ¤ KS ( + lim sup σ+ (S k )).

1

k’∞ k’∞

6.2.2 Centered Difference Simplex Gradient

In this section we de¬ne the centered difference simplex gradient and prove a second-order

estimate. We will then prove two variants of Theorem 6.2.4, one to show how the role of the

noise φ differs from that in the one-sided derivative case and a second to quantify how the values

of f on the stencil can be used to terminate an iteration.

De¬nition 6.2.3. Let S be a nonsingular simplex in RN with vertices {xj }N +1 and simplex

j=1

directions vj = xj+1 ’ x1 . The re¬‚ected simplex R = R(S) is the simplex with vertices x1 and

rj = x1 ’ vj for j = 1, . . . , N.

The central simplex gradient DC (f : S) is

V ’T (δ(f : S) ’ δ(f : R))

D(f : S) + D(f : R)

DC (f : S) = = .

2 2

For example, if N = 1 and x2 = x1 + h, then r2 = x1 ’ h. Hence

f (x1 + h) ’ f (x1 ) f (x1 ’ h) ’ f (x1 )

D(f : S) = and D(f : R) = .

h ’h

Therefore,

f (x1 + h) ’ f (x1 ’ h)

DC (f : S) = DC (f : R) =

2h

is the usual central difference.

Lemmas 6.2.5 and 6.2.6 are the second-order analogues of Lemmas 6.2.1 and 6.2.2.

Lemma 6.2.5. Let S be a nonsingular simplex and let ∇2 f be Lipschitz continuous in a

neighborhood of S ∪ R(S) with Lipschitz constant 3KC . Then there is K > 0 such that

∇f (x1 ) ’ DC (f : S) ¤ Kκ(V )σ+ (S)2 .

(6.4)

Proof. The Lipschitz continuity assumption implies that for all 2 ¤ j ¤ N + 1,

f (xj ) ’ f (rj ) + 2∇f (x1 )T vj ¤ KC vj 3

¤ Kc σ+ (S)3 .

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

116 ITERATIVE METHODS FOR OPTIMIZATION

As in the proof of Lemma 6.2.1 we have

V T (δ(f : S) ’ δ(f : R)) ’ V T ∇f (x1 ) ¤ N 1/2 KC σ+ (S)3 ,

and hence the result follows with K = N 1/2 KC .

Lemma 6.2.6. Let S be a nonsingular simplex. Let f satisfy (6.1) and let ∇2 fs be Lipschitz

continuous in a neighborhood of S ∪ R(S) with Lipschitz constant 3KCs . Then there is K > 0,

depending only on KCs , such that

φS

∇fs (x1 ) ’ DC (f : S) ¤ Kκ(V ) σ+ (S)2 + .

(6.5)

σ+ (S)

Proof. This proof is very similar to that of Lemma 6.2.2 and is left to the reader.

The quality of the information that can be obtained from the central simplex gradient is

higher than that of the forward. The difference in practice can be dramatic, as the examples

in §7.6 illustrate. The consequences of a small central simplex gradient follow directly from

Lemma 6.2.6.

Lemma 6.2.7. Let f satisfy (6.1) and let ∇2 fs be continuously differentiable in a compact

set „¦ ‚ RN . Assume that fs has a unique critical point x— in „¦. Then there is K„¦ > 0 such

that if a simplex S and its re¬‚ection R(S) are both contained in „¦ then

φS

x1 ’ x— ¤ K„¦ DC (f : S) + κ(V ) σ+ (S)2 + .

σ+ (S)

Lemma 6.2.7 is all one needs to conclude convergence from a sequence of small central

simplex gradients.

Theorem 6.2.8. Let f satisfy (6.1) and let ∇2 fs be continuously differentiable in a compact

set „¦ ‚ RN . Assume that fs has a unique critical point x— in „¦. Let S k be a sequence of

simplices having vertices {xk }N +1 . Assume that there is M such that

j j=1

S k , R(S k ) ‚ „¦ and κ(V (S k )) ¤ M for all k.

Then,

1. if

φ Sk

lim σ+ (S k ) = 0, lim = 0,

k’∞ σ+ (S k )

k’∞

and lim supk’∞ DC (f : S k ) = , for some > 0, then there is KS > 0 such that

lim sup x— ’ xk ¤ KS ;

1

k’∞

2. if, for some > 0,

3

, lim inf σ+ (S k ) ≥ 2

, and lim inf DC (f : S k ) ¤ 2

lim sup φ ¤ ,

Sk

k’∞ k’∞

k’∞

then there is KS > 0 such that

lim sup x— ’ xk ¤ KS ( + lim sup σ+ (S k ))2 .

1

k’∞ k’∞

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.