function value

0 0

10 10

1 2

10 10

2 4

10 10

0 1000 2000 3000 4000 0 1000 2000 3000 4000

function evaluations function evaluations

Forward Differences Forward Differences

2 3

10 10

simplex gradient norm

2

10

1

function value

10

1

10

0

10

0

10

1 1

10 10

0 1000 2000 3000 4000 0 1000 2000 3000 4000

function evaluations function evaluations

Figure 7.9: Perturbed Quadratic, N = 32

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

IMPLICIT FILTERING 133

If a3 = 0 then f may not return the same value when called with the same argument twice.

The reader is invited to explore the consequences of this in exercise 7.7.3.

The performance of the algorithms in this part of the book sometimes depends on the size of

the problem much more strongly than the Newton-based methods in Part I. In the case of implicit

¬ltering, that dependence is mostly a result of the cost of evaluation of the simplex gradient. To

illustrate this we consider our quadratic problems for N = 4 (Figures 7.6 and 7.8) and N = 32

(Figures 7.7 and 7.9).

For all the quadratic examples the initial iterate was

(1, 2, . . . , N )T

x0 =

10N

and the sequence of scales was {2’k }10 .

k=0

7.7 Exercises on Implicit Filtering

7.7.1. Let S be a nonsingular simplex. Show that DC (f : S) = f (x1 ) if f is a quadratic

function.

7.7.2. How would you expect forward and centered difference implicit ¬ltering to perform when

applied to f (x) = xT x? Would the performance be independent of dimension? Test your

expectation with numerical experimentation.

7.7.3. Use implicit ¬ltering to minimize the perturbed quadratic function with nonzero values of

a3 .

7.7.4. Try to solve the Lennard“Jones problem with implicit ¬ltering for various values of M

and various initial iterates. Compare your best results with those in [142], [40], and [210].

Are you doing any better than you did in exercise 6.4.3?

7.7.5. Show that Theorems 5.5.4 and 5.5.5 hold if the projected BFGS update is implemented us-

ing (7.9) and (7.10). How would these formulas affect an implementation like bfgsrecb,

which is designed for problems in which full matrices cannot be stored?

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

134 ITERATIVE METHODS FOR OPTIMIZATION

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

Chapter 8

Direct Search Algorithms

In this chapter we discuss the class of direct search algorithms. These methods use values of

f taken from a set of sample points and use that information to continue the sampling. Unlike

implicit ¬ltering, these methods do not explicitly use approximate gradient information. We will

focus on three such methods: the Nelder“Mead simplex algorithm [204], the multidirectional

search method [85], [261], [262], and the Hooke“Jeeves algorithm [145]. Each of these can be

analyzed using the simplex gradient techniques from Chapter 6. We will not discuss the very

general results based on the taxonomies of direct search methods from [263], [174], and [179] or

the recent research on the application of these methods to bound [173] or linear [175] constraints.

We include at the end of this chapter a short discussion of methods based on surrogate models

and a brief account of a very different search method, the DIRECT algorithm [150]. These two

¬nal topics do not lead to algorithms that are easy to implement, and our discussions will be

very general with pointers to the literature.

8.1 The Nelder“Mead Algorithm

8.1.1 Description and Implementation

The Nelder“Mead [204] simplex algorithm maintains a simplex S of approximations to an

optimal point. In this algorithm the vertices {xj }N +1 are sorted according to the objective

j=1

function values

f (x1 ) ¤ f (x2 ) ¤ · · · ¤ f (xN +1 ).

(8.1)

x1 is called the best vertex and xN +1 the worst. If several vertices have the same objective

value as x1 , the best vertex is not uniquely de¬ned, but this ambiguity has little effect on the

performance of the algorithm.

The algorithm attempts to replace the worst vertex xN +1 with a new point of the form

x(µ) = (1 + µ)x ’ µxN +1 ,

(8.2)

where x is the centroid of the convex hull of {xi }N

i=1

1 N

x= xi .

(8.3)

N i=1

The value of µ is selected from a sequence

’1 < µic < 0 < µoc < µr < µe

135

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

136 ITERATIVE METHODS FOR OPTIMIZATION

by rules that we formally describe in Algorithm nelder. Our formulation of the algorithm

allows for termination if either f (xN +1 ) ’ f (x1 ) is suf¬ciently small or a user-speci¬ed number

of function evaluations has been expended.

Algorithm 8.1.1. nelder(S, f, „, kmax)

1. Evaluate f at the vertices of S and sort the vertices of S so that (8.1) holds.

2. Set f count = N + 1.

3. While f (xN +1 ) ’ f (x1 ) > „

(a) Compute x, (8.3), x(µr ), (8.2), and fr = f (x(µr )). f count = f count + 1.

(b) Re¬‚ect: If f count = kmax then exit. If f (x1 ) ¤ fr < f (xN ), replace xN +1 with

x(µr ) and go to step 3g.

(c) Expand: If f count = kmax then exit. If fr < f (x1 ) then compute fe = f (x(µe )).

f count = f count + 1. If fe < fr , replace xN +1 with x(µe ); otherwise replace

xN +1 with x(µr ). Go to to step 3g.

(d) Outside Contraction: If f count = kmax then exit. If f (xN ) ¤ fr < f (xN +1 ),

compute fc = f (x(µoc )). f count = f count + 1. If fc ¤ fr replace xN +1 with

x(µoc ) and go to step 3g; otherwise go to step 3f.

(e) Inside Contraction: If f count = kmax then exit. If fr ≥ f (xN +1 ) compute

fc = f (x(µic )). f count = f count + 1. If fc < f (xN +1 ), replace xN +1 with

x(µic ) and go to step 3g; otherwise go to step 3f.

(f) Shrink: If f count ≥ kmax ’ N , exit. For 2 ¤ i ¤ N + 1: set xi = x1 ’ (xi ’

x1 )/2; compute f (xi ).

(g) Sort: Sort the vertices of S so that (8.1) holds.

A typical sequence [169] of candidate values for µ is

{µr , µe , µoc , µic } = {1, 2, 1/2, ’1/2}.

Figure 8.1 is an illustration of the options in two dimensions. The vertices labeled 1, 2, and

3 are those of the original simplex.

The Nelder“Mead algorithm is not guaranteed to converge, even for smooth problems [89],

[188]. The failure mode is stagnation at a nonoptimal point. In §8.1.3 we will present some

examples from [188] that illustrate this failure. However, the performance of the Nelder“Mead

algorithm in practice is generally good [169], [274]. The shrink step is rare in practice and we

will assume in the analysis in §8.1.2 that shrinks do not occur. In that case, while a Nelder“Mead

iterate may not result in a reduction in the best function value, the average value

1 N +1

f= f (xj )

N +1 j=1

will be reduced.

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

SEARCH ALGORITHMS 137

e

1

r

oc

ic

3 2

Figure 8.1: Nelder“Mead Simplex and New Points

8.1.2 Suf¬cient Decrease and the Simplex Gradient

Our study of the Nelder“Mead algorithm is based on the simple ideas in §3.1. We will denote

the vertices of the simplex S k at the kth iteration by {xk }N +1 . We will simplify notation by

j j=1

suppressing explicit mention of S k in what follows by denoting

V k = V (S k ), δ k = δ(f : S k ), K k = K(S k ), and Dk (f ) = D(f : S k ).

If V 0 is nonsingular then V k is nonsingular for all k > 0 [169]. Hence if S 0 is nonsingular so

is S k for all k and hence Dk (f ) is de¬ned for all k.

We formalize this by assuming that our sequence of simplices satis¬es the following assump-

tion.

Assumption 8.1.1. For all k,

• S k is nonsingular.

• The vertices satisfy (8.1).

• f k+1 < f k .

Assumption 8.1.1 is satis¬ed by the Nelder“Mead sequence if no shrink steps are taken

and the initial simplex directions are linearly independent [169]. The Nelder“Mead algorithm

demands that the average function value improve, but no control is possible on which value is

improved, and the simplex condition number can become unbounded.

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

138 ITERATIVE METHODS FOR OPTIMIZATION

We can de¬ne a suf¬cient decrease condition for search algorithms that is analogous to the

suf¬cient decrease condition for steepest descent and related algorithms (3.2). We will ask that

the k + 1st iteration satisfy

f k+1 ’ f k < ’± Dk f 2 .

(8.4)

Here ± > 0 is a small parameter. Our choice of suf¬cient decrease condition is motivated by the

¯

smooth case and steepest descent, where (3.2) and the lower bound ’» on » from Lemma 3.2.3

lead to

¯

f (xk+1 ) ’ f (xk ) ¤ ’»± ∇f (xk ) 2 ,

which is a smooth form of (8.4). Unlike the smooth case, however, we have no descent direction

¯

and must incorporate » into ±. This leads to the possibility that if the simplex diameter is much

smaller than Dk f , (8.4) could fail on the ¬rst iterate. We address this problem with the scaling

σ+ (S 0 )

± = ±0 .

D0 f

A typical choice in line search methods, which we use in our numerical results, is ±0 = 10’4 .

The convergence result for smooth functions follows easily from Lemma 6.2.1.

Theorem 8.1.1. Let a sequence of simplices satisfy Assumption 8.1.1 and let the assumptions

of Lemma 6.2.1 hold, with the Lipschitz constants K k uniformly bounded. Assume that {f k } is

bounded from below. Then if (8.4) holds for all but ¬nitely many k and

lim σ+ (S k )κ(V k ) = 0,

k’∞

then any accumulation point of the simplices is a critical point of f .

Proof. The boundedness from below of {f k } and (8.4) imply that f k ’ 0. Assumption 8.1.1

and (8.4) imply that limk’∞ Dk f = 0. Hence (6.2) implies

lim ∇f (xk ) ¤ lim Kκ(V k )σ+ (S k ) + Dk f = 0.

1

k’∞ k’∞

Hence, if x— is any accumulation point of the sequence {xk } then ∇f (x— ) = 0. This completes

1

k k

the proof since κ(V ) ≥ 1 and therefore σ+ (V ) ’ 0.

The result for the noisy functions that satisfy (6.1) with fs smooth re¬‚ects the fact that

the resolution is limited by the size of φ. In fact, if σ+ (S k ) is much smaller than φ S k , no

information on fs can be obtained by evaluating f at the vertices of S k and once σ+ (S k ) is

1/2

smaller than φ S k no conclusions on ∇fs can be drawn. If, however, the noise decays to zero

suf¬ciently rapidly near the optimal point, the conclusions of Theorem 8.1.1 still hold.

Theorem 8.1.2. Let a sequence of simplices satisfy Assumption 8.1.1 and let the assumptions

of Lemma 6.2.2 hold with the Lipschitz constants Ks uniformly bounded. Assume that {f k } is

k

bounded from below. Then if (8.4) holds for all but ¬nitely many k and if

φ Sk

lim κ(V k ) σ+ (S k ) + = 0,

(8.5)

σ+ (S k )

k’∞

then any accumulation point of the simplices is a critical point of fs .

Proof. Our assumptions, as in the proof of Theorem 8.1.1, imply that Dk f ’ 0. Recall that

Lemma 6.2.2 implies that

φ Sk

Dk fs ¤ Dk f + K k κ(V k ) σ+ (S k ) + ,

(8.6)

σ+ (S k )

and the sequence {K k } is bounded because {Ks } is. Hence, by (8.5), Dk fs ’ 0 as k ’ ∞.

k

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

SEARCH ALGORITHMS 139

Function Differences Max Oriented Length

2 2

10 10

0

10

0

10

2

10

2

10

4

10

4

10 6

10

6

10 8

10

8 10

10 10

0 50 100 150 0 50 100 150

Simplex Gradient Norm Simplex Condition

4 20

10 10

15

10

3

10

10

10

2

10 5

10

1 0

10 10

0 50 100 150 0 50 100 150

Figure 8.2: Unmodi¬ed Nelder“Mead, („, θ, φ) = (1, 15, 10)

8.1.3 McKinnon™s Examples

In this set of three examples from [188], N = 2, and

±

θφ|(x)1 |„ + (x)2 + (x)2 , (x)1 ¤ 0,

2