can be used to advantage by triggering a reduction in h. The line search parameters ±, β and

the parameter „ in the termination criterion (7.2) do not affect the convergence analysis that we

present here but can affect performance.

Algorithm 7.1.1. fdsteep(x, f, pmax, „, h, amax)

1. For p = 1, . . . , pmax

(a) Compute f and ∇h f ; terminate if (6.7) or (7.2) hold.

(b) Find the least integer 0 ¤ m ¤ amax such that (7.1) holds for » = β m . If no such

m exists, terminate.

(c) x = x ’ »∇h f (x).

Algorithm fdsteep will terminate after ¬nitely many iterations because of the limits on

the number of iterations and the number of backtracks. If the set {x | f (x) ¤ f (x0 )} is bounded

then the iterations will remain in that set. Implicit ¬ltering calls fdsteep repeatedly, reducing

h after each termination of fdsteep. Aside from the data needed by fdsteep, one must

provide a sequence of difference increments, called scales in [120].

Algorithm 7.1.2. imfilter1(x, f, pmax, „, {hk }, amax)

1. For k = 0, . . .

Call fdsteep(x, f, pmax, „, hk , amax)

The convergence result follows from the second-order estimate, (6.5), the consequences of a

stencil failure, Theorem 6.2.9, and the equalities hk = σ+ (S k ) and κ(V k ) = 1. A similar result

for forward differences would follow from (6.3).

Theorem 7.1.1. Let f satisfy (6.1) and let ∇fs be Lipschitz continuous. Let hk ’ 0, {xk }

be the implicit ¬ltering sequence, and S k = S(x, hk ). Assume that (7.1) holds (i.e., there is no

line search failure) for all but ¬nitely many k. Then if

lim (hk + h’1 φ Sk ) =0

(7.3) k

k’∞

then any limit point of the sequence {xk } is a critical point of fs .

Proof. If either (7.1) or (6.7) hold for all but ¬nitely many k then, as is standard,

∇hk f (xk ) = DC (f : S k ) ’ 0.

Hence, using (7.3) and Lemma 6.2.2,

∇fs (xk ) ’ 0,

as asserted.

7.2 Quasi-Newton Methods and Implicit Filtering

The unique feature of implicit ¬ltering is the possibility, for problems that are suf¬ciently smooth

near a minimizer, to obtain faster convergence in the terminal phase of the iteration by using a

quasi-Newton update of a model Hessian. This idea was ¬rst proposed in [250] and [120].

We begin with a quasi-Newton form of Algorithm fdsteep. In this algorithm a quasi-

Newton approximation to the Hessian is maintained and the line search is based on the quasi-

Newton direction

d = ’H ’1 ∇h f (x)

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

IMPLICIT FILTERING 125

terminating when either

f (x + »d) ’ f (x) < ±»∇h f (x)T d

(7.4)

or too many stepsize reductions have been taken. With the application to implicit ¬ltering in

mind, Algorithm fdquasi replaces the quasi-Newton H with the identity matrix when the line

search fails.

Algorithm 7.2.1. fdquasi(x, f, H, pmax, „, h, amax)

1. For p = 1, . . . , pmax

(a) Compute f , ∇h f and d = ’H ’1 ∇h f ; terminate if (7.2) holds.

(b) Find the least integer 0 ¤ m ¤ amax such that (7.4) holds for » = β m .

(c) x = x + »d.

(d) Update H with a quasi-Newton formula.

In the context of implicit ¬ltering, where N is small, the full quasi-Newton Hessian or its

inverse is maintained throughout the iteration. Our MATLAB codes store the model Hessian.

Algorithm 7.2.2. imfilter2(x, f, pmax, „, {hk }, amax)

1. H = I.

2. For k = 0, . . .

Call fdquasi(x, f, H, pmax, „, hk , amax).

In [250] and [120] the SR1 method was used because it performed somewhat better than the

BFGS method in the context of a particular application. The examples in §7.6 show the opposite

effect, and both methods have been successfully used in practice.

7.3 Implementation Considerations

Implicit ¬ltering has several iterative parameters and requires some algorithmic decisions in its

implementation. The parameters pmax, amax, and β play the same role that they do in any line

search algorithm. In our MATLAB code imfil.m, which we used for all the computations

reported in this book, we set pmax = 200 — n, amax = 10, and β = 1/2.

The performance of implicit ¬ltering can be sensitive to the value of „ [250], with small values

of „ leading to stagnation and values of „ that are too large leading to premature termination of

fdquasi. Using stencil failure as a termination criterion reduces the sensitivity to small values

of „ and we use „ = .01 in the computations.

The sequence of scales is at best a guess at the level of the noise in the problem. If several of

the scales are smaller than the level of the noise, the line search will fail immediately and work

at these scales will be wasted. Our implementation attempts to detect this by terminating the

optimization if the x is unchanged for three consecutive scales.

The simplex gradient may be a very poor approximation to the gradient. In some such cases

the function evaluation at a trial point may fail to return a value [250] and one must either trap

this failure and return an arti¬cially large value, impose bound constraints, or impose a limit on

the size of the step. In our computations we take the latter approach and limit the stepsize to

10h by setting

±

’H ’1 ∇h f (x) if H ’1 ∇h f (x) ¤ 10h,

d=

(7.5)

’10hH ’1 ∇h f (x)

otherwise.

H ’1 ∇h f (x)

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

126 ITERATIVE METHODS FOR OPTIMIZATION

The choice of a quasi-Newton method to use with implicit ¬ltering is an area of active research

[56], [55]. Both SR1 and BFGS have been used, with SR1 performing modestly better in some

applications with bound constraints [270], [251], [271], [250], [55]. The implementation of

implicit ¬ltering in the collection of MATLAB codes imfil.m uses BFGS as the default but

has SR1 as an option. We found BFGS with central differences to be consistently better in the

preparation of the (unconstrained!) computational examples in this book.

7.4 Implicit Filtering for Bound Constrained Problems

Implicit ¬ltering was initially designed as an algorithm for bound constrained problems [250],

[120]. The bound constrained version we present here is simply a projected quasi-Newton

algorithm like the one presented in §5.5.3. There are other approaches to the implementation

and no best approach has emerged. We refer the reader to [120] and [55] for discussions of the

options.

We begin with scaling and the difference gradient. Central differences perform better, but

we do not evaluate f outside of the feasible region. Hence, if a point on the centered difference

stencil is outside of the feasible region, we use a one-sided difference in that direction. In order

to guarantee that at least one point in each direction is feasible, we scale the variables so that

Li = 0, Ui = 1, and h0 ¤ 1/2.

The suf¬cient decrease condition is (compare with (5.31))

f (x(»)) ’ f (x) ¤ ±∇h f (x)T (x(») ’ x),

(7.6)

where

x(») = P(x ’ »∇h f (x)).

One could terminate the iteration at a given scale when the analogue to (7.2)

x ’ x(1) ¤ „ h

(7.7)

holds or when

f (xc ) < f (x ± rj ) for all x ± rj feasible,

(7.8)

which is the analogue to (6.7) for bound constrained problems.

Quasi-Newton methods for bound constraints can be constructed more simply for small

problems, like the ones to which implicit ¬ltering is applied, where it is practical to store the

model of the inverse of the reduced Hessian as a full matrix. By using full matrix storage, the

complexity of bfgsrecb is avoided. One such alternative [53], [54], [55] to the updates in

§5.5.3 is to update the complete reduced Hessian and then correct it with information from the

new active set. This results in a two-stage update in which a model for the inverse of reduced

Hessian is updated with (4.5) to obtain

sy T ysT ssT

’1 ’1

R1/2 = I’ T Rc I’ T +T.

(7.9)

ys ys ys

Then the new reduced Hessian is computed using the active set information at the new point

’1 ’1

R+ = PA+ + PI+ R1/2 PI+ .

(7.10)

It is easy to show that Theorems 5.5.4 and 5.5.5 hold for this form of the update.

A FORTRAN implementation [119] of implicit ¬ltering for bound constrained problems is

in the software collection. In the original version of that implementation a projected SR1 update

was used and a Cholesky factorization of the matrix R+ was performed to verify positivity. The

model Hessian was reinitialized to the identity whenever the scale or the active set changed.

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

IMPLICIT FILTERING 127

7.5 Restarting and Minima at All Scales

No algorithm in this part of the book is guaranteed to ¬nd even a local minimum, much less

a global one. One approach to improving the robustness of these algorithms is to restart the

iteration after one sweep through the scales. A point x that is not changed after a call to

Algorithm imfilter1 (or imfilter2 or the bound constrained form of either) is called a

minimum at all scales.

If f satis¬es (6.1), fs has a unique critical point that is also a local minimum that satis¬es the

standard assumptions (and hence is a global minimum for fs ), and certain (strong!) technical

assumptions on the decay of φ near the minimum hold, then [120] a minimum at all scales is near

that global minimum of fs . In the unconstrained case this statement follows from the termination

criteria ((7.2) and (6.7)) for implicit ¬ltering, Lemma 6.2.3 (or 6.2.7) and, if central differences

are used, Theorem 6.2.9. The analysis in [120] of the bound constrained case is more technical.

In practice, restarts are expensive and need not be done for most problems. However, restarts

have been reported to make a difference in some cases [178]. It is also comforting to know that

one has a minimum at all scales, and the author of this book recommends testing potential optima

with restarts before one uses the results in practice but not at the state where one is tuning the

optimizer or doing preliminary evaluation of the results.

7.6 Examples

Many of these examples are from [56]. For all the examples we report results with and without a

quasi-Newton Hessian. We report results for both forward and central differences. In the ¬gures

the solid line corresponds to the BFGS Hessian, the dashed-dotted line to the SR1 Hessian, and

the dashed line to H = I, the steepest descent form of implicit ¬ltering.

Unlike the smooth problems considered earlier, where convergence of the gradient to zero

was supported by theory, convergence of the simplex gradient to zero is limited by the noise in

the objective. We illustrate performance by plotting both the objective function value and the

norm of the simplex gradient. From these examples it is clear that the the graphs of function

value against the count of function evaluations is a better indicator of the performance of the

optimizer.

In all cases we terminated the iteration when either fdquasi had been called for each scale

or a budget of function evaluations had been exhausted. Once the code completes an iteration

and the number of function evaluations is greater than or equal to the budget, the iteration is

terminated.

The examples include both smooth and nonsmooth problems, with and without noise. A

serious problem for some algorithms of this type is their failure on very easy problems. For

most of the algorithms covered in this part of the book, we will present examples that illustrate

performance on this collection of problems.

7.6.1 Weber™s Problem

The three Weber™s function examples all have minimizers at points at which the objective is

nondifferentiable. For the computations we used an initial iterate of (10, ’10)T , a budget of

200 function evaluations, and {10 — 2’n }8 n=’2 as the sequence of scales.

In each of the examples the performance of the two quasi-Newton methods was virtually

identical and far better than that without a quasi-Newton model Hessian. Forward and central

differences for the ¬rst two problems (Figures 7.1 and 7.2) perform almost equally well, with

forward having a slight edge. In Figure 7.3, however, the forward difference version of implicit

¬ltering ¬nds a local minimum different from the global minimum that is located by central

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

128 ITERATIVE METHODS FOR OPTIMIZATION

Central Differences Central Differences

1

10 50

simplex gradient norm

100

function value

150

0

10

200

250

1

10 300

0 50 100 150 200 250 0 50 100 150 200 250

function evaluations function evaluations

Forward Differences Forward Differences

1

10 50

simplex gradient norm

100

0

10 function value

150

1

10

200

2

10

250

3

10 300

0 50 100 150 200 250 0 50 100 150 200 250

function evaluations function evaluations

Figure 7.1: First Weber Example

Central Differences Central Differences

1

10 70

60

simplex gradient norm

50

0

function value

10

40

30

1

10

20

10

2

10 0

0 50 100 150 200 250 0 50 100 150 200 250

function evaluations function evaluations

Forward Differences Forward Differences

1

10 70

60

simplex gradient norm

50

0

function value

10

40

30

1

10

20

10

2

10 0

0 50 100 150 200 250 0 50 100 150 200 250

function evaluations function evaluations

Figure 7.2: Second Weber Example

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

IMPLICIT FILTERING 129

Central Differences Central Differences

1

10 70

60

simplex gradient norm

0

10

function value

50

1

10 40

30

2

10

20

3