<< . .

. 22
( : 32)



. . >>

124 ITERATIVE METHODS FOR OPTIMIZATION

can be used to advantage by triggering a reduction in h. The line search parameters ±, β and
the parameter „ in the termination criterion (7.2) do not affect the convergence analysis that we
present here but can affect performance.
Algorithm 7.1.1. fdsteep(x, f, pmax, „, h, amax)
1. For p = 1, . . . , pmax
(a) Compute f and ∇h f ; terminate if (6.7) or (7.2) hold.
(b) Find the least integer 0 ¤ m ¤ amax such that (7.1) holds for » = β m . If no such
m exists, terminate.
(c) x = x ’ »∇h f (x).

Algorithm fdsteep will terminate after ¬nitely many iterations because of the limits on
the number of iterations and the number of backtracks. If the set {x | f (x) ¤ f (x0 )} is bounded
then the iterations will remain in that set. Implicit ¬ltering calls fdsteep repeatedly, reducing
h after each termination of fdsteep. Aside from the data needed by fdsteep, one must
provide a sequence of difference increments, called scales in [120].
Algorithm 7.1.2. imfilter1(x, f, pmax, „, {hk }, amax)
1. For k = 0, . . .
Call fdsteep(x, f, pmax, „, hk , amax)

The convergence result follows from the second-order estimate, (6.5), the consequences of a
stencil failure, Theorem 6.2.9, and the equalities hk = σ+ (S k ) and κ(V k ) = 1. A similar result
for forward differences would follow from (6.3).
Theorem 7.1.1. Let f satisfy (6.1) and let ∇fs be Lipschitz continuous. Let hk ’ 0, {xk }
be the implicit ¬ltering sequence, and S k = S(x, hk ). Assume that (7.1) holds (i.e., there is no
line search failure) for all but ¬nitely many k. Then if

lim (hk + h’1 φ Sk ) =0
(7.3) k
k’∞

then any limit point of the sequence {xk } is a critical point of fs .
Proof. If either (7.1) or (6.7) hold for all but ¬nitely many k then, as is standard,

∇hk f (xk ) = DC (f : S k ) ’ 0.

Hence, using (7.3) and Lemma 6.2.2,

∇fs (xk ) ’ 0,

as asserted.


7.2 Quasi-Newton Methods and Implicit Filtering
The unique feature of implicit ¬ltering is the possibility, for problems that are suf¬ciently smooth
near a minimizer, to obtain faster convergence in the terminal phase of the iteration by using a
quasi-Newton update of a model Hessian. This idea was ¬rst proposed in [250] and [120].
We begin with a quasi-Newton form of Algorithm fdsteep. In this algorithm a quasi-
Newton approximation to the Hessian is maintained and the line search is based on the quasi-
Newton direction
d = ’H ’1 ∇h f (x)


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




IMPLICIT FILTERING 125

terminating when either
f (x + »d) ’ f (x) < ±»∇h f (x)T d
(7.4)
or too many stepsize reductions have been taken. With the application to implicit ¬ltering in
mind, Algorithm fdquasi replaces the quasi-Newton H with the identity matrix when the line
search fails.
Algorithm 7.2.1. fdquasi(x, f, H, pmax, „, h, amax)
1. For p = 1, . . . , pmax
(a) Compute f , ∇h f and d = ’H ’1 ∇h f ; terminate if (7.2) holds.
(b) Find the least integer 0 ¤ m ¤ amax such that (7.4) holds for » = β m .
(c) x = x + »d.
(d) Update H with a quasi-Newton formula.

In the context of implicit ¬ltering, where N is small, the full quasi-Newton Hessian or its
inverse is maintained throughout the iteration. Our MATLAB codes store the model Hessian.
Algorithm 7.2.2. imfilter2(x, f, pmax, „, {hk }, amax)
1. H = I.
2. For k = 0, . . .
Call fdquasi(x, f, H, pmax, „, hk , amax).

In [250] and [120] the SR1 method was used because it performed somewhat better than the
BFGS method in the context of a particular application. The examples in §7.6 show the opposite
effect, and both methods have been successfully used in practice.


7.3 Implementation Considerations
Implicit ¬ltering has several iterative parameters and requires some algorithmic decisions in its
implementation. The parameters pmax, amax, and β play the same role that they do in any line
search algorithm. In our MATLAB code imfil.m, which we used for all the computations
reported in this book, we set pmax = 200 — n, amax = 10, and β = 1/2.
The performance of implicit ¬ltering can be sensitive to the value of „ [250], with small values
of „ leading to stagnation and values of „ that are too large leading to premature termination of
fdquasi. Using stencil failure as a termination criterion reduces the sensitivity to small values
of „ and we use „ = .01 in the computations.
The sequence of scales is at best a guess at the level of the noise in the problem. If several of
the scales are smaller than the level of the noise, the line search will fail immediately and work
at these scales will be wasted. Our implementation attempts to detect this by terminating the
optimization if the x is unchanged for three consecutive scales.
The simplex gradient may be a very poor approximation to the gradient. In some such cases
the function evaluation at a trial point may fail to return a value [250] and one must either trap
this failure and return an arti¬cially large value, impose bound constraints, or impose a limit on
the size of the step. In our computations we take the latter approach and limit the stepsize to
10h by setting
±
 ’H ’1 ∇h f (x) if H ’1 ∇h f (x) ¤ 10h,


d=
(7.5)
 ’10hH ’1 ∇h f (x)

 otherwise.
H ’1 ∇h f (x)


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




126 ITERATIVE METHODS FOR OPTIMIZATION

The choice of a quasi-Newton method to use with implicit ¬ltering is an area of active research
[56], [55]. Both SR1 and BFGS have been used, with SR1 performing modestly better in some
applications with bound constraints [270], [251], [271], [250], [55]. The implementation of
implicit ¬ltering in the collection of MATLAB codes imfil.m uses BFGS as the default but
has SR1 as an option. We found BFGS with central differences to be consistently better in the
preparation of the (unconstrained!) computational examples in this book.


7.4 Implicit Filtering for Bound Constrained Problems
Implicit ¬ltering was initially designed as an algorithm for bound constrained problems [250],
[120]. The bound constrained version we present here is simply a projected quasi-Newton
algorithm like the one presented in §5.5.3. There are other approaches to the implementation
and no best approach has emerged. We refer the reader to [120] and [55] for discussions of the
options.
We begin with scaling and the difference gradient. Central differences perform better, but
we do not evaluate f outside of the feasible region. Hence, if a point on the centered difference
stencil is outside of the feasible region, we use a one-sided difference in that direction. In order
to guarantee that at least one point in each direction is feasible, we scale the variables so that
Li = 0, Ui = 1, and h0 ¤ 1/2.
The suf¬cient decrease condition is (compare with (5.31))

f (x(»)) ’ f (x) ¤ ±∇h f (x)T (x(») ’ x),
(7.6)

where
x(») = P(x ’ »∇h f (x)).
One could terminate the iteration at a given scale when the analogue to (7.2)

x ’ x(1) ¤ „ h
(7.7)

holds or when
f (xc ) < f (x ± rj ) for all x ± rj feasible,
(7.8)
which is the analogue to (6.7) for bound constrained problems.
Quasi-Newton methods for bound constraints can be constructed more simply for small
problems, like the ones to which implicit ¬ltering is applied, where it is practical to store the
model of the inverse of the reduced Hessian as a full matrix. By using full matrix storage, the
complexity of bfgsrecb is avoided. One such alternative [53], [54], [55] to the updates in
§5.5.3 is to update the complete reduced Hessian and then correct it with information from the
new active set. This results in a two-stage update in which a model for the inverse of reduced
Hessian is updated with (4.5) to obtain

sy T ysT ssT
’1 ’1
R1/2 = I’ T Rc I’ T +T.
(7.9)
ys ys ys
Then the new reduced Hessian is computed using the active set information at the new point
’1 ’1
R+ = PA+ + PI+ R1/2 PI+ .
(7.10)

It is easy to show that Theorems 5.5.4 and 5.5.5 hold for this form of the update.
A FORTRAN implementation [119] of implicit ¬ltering for bound constrained problems is
in the software collection. In the original version of that implementation a projected SR1 update
was used and a Cholesky factorization of the matrix R+ was performed to verify positivity. The
model Hessian was reinitialized to the identity whenever the scale or the active set changed.


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




IMPLICIT FILTERING 127

7.5 Restarting and Minima at All Scales
No algorithm in this part of the book is guaranteed to ¬nd even a local minimum, much less
a global one. One approach to improving the robustness of these algorithms is to restart the
iteration after one sweep through the scales. A point x that is not changed after a call to
Algorithm imfilter1 (or imfilter2 or the bound constrained form of either) is called a
minimum at all scales.
If f satis¬es (6.1), fs has a unique critical point that is also a local minimum that satis¬es the
standard assumptions (and hence is a global minimum for fs ), and certain (strong!) technical
assumptions on the decay of φ near the minimum hold, then [120] a minimum at all scales is near
that global minimum of fs . In the unconstrained case this statement follows from the termination
criteria ((7.2) and (6.7)) for implicit ¬ltering, Lemma 6.2.3 (or 6.2.7) and, if central differences
are used, Theorem 6.2.9. The analysis in [120] of the bound constrained case is more technical.
In practice, restarts are expensive and need not be done for most problems. However, restarts
have been reported to make a difference in some cases [178]. It is also comforting to know that
one has a minimum at all scales, and the author of this book recommends testing potential optima
with restarts before one uses the results in practice but not at the state where one is tuning the
optimizer or doing preliminary evaluation of the results.


7.6 Examples
Many of these examples are from [56]. For all the examples we report results with and without a
quasi-Newton Hessian. We report results for both forward and central differences. In the ¬gures
the solid line corresponds to the BFGS Hessian, the dashed-dotted line to the SR1 Hessian, and
the dashed line to H = I, the steepest descent form of implicit ¬ltering.
Unlike the smooth problems considered earlier, where convergence of the gradient to zero
was supported by theory, convergence of the simplex gradient to zero is limited by the noise in
the objective. We illustrate performance by plotting both the objective function value and the
norm of the simplex gradient. From these examples it is clear that the the graphs of function
value against the count of function evaluations is a better indicator of the performance of the
optimizer.
In all cases we terminated the iteration when either fdquasi had been called for each scale
or a budget of function evaluations had been exhausted. Once the code completes an iteration
and the number of function evaluations is greater than or equal to the budget, the iteration is
terminated.
The examples include both smooth and nonsmooth problems, with and without noise. A
serious problem for some algorithms of this type is their failure on very easy problems. For
most of the algorithms covered in this part of the book, we will present examples that illustrate
performance on this collection of problems.

7.6.1 Weber™s Problem
The three Weber™s function examples all have minimizers at points at which the objective is
nondifferentiable. For the computations we used an initial iterate of (10, ’10)T , a budget of
200 function evaluations, and {10 — 2’n }8 n=’2 as the sequence of scales.
In each of the examples the performance of the two quasi-Newton methods was virtually
identical and far better than that without a quasi-Newton model Hessian. Forward and central
differences for the ¬rst two problems (Figures 7.1 and 7.2) perform almost equally well, with
forward having a slight edge. In Figure 7.3, however, the forward difference version of implicit
¬ltering ¬nds a local minimum different from the global minimum that is located by central



Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




128 ITERATIVE METHODS FOR OPTIMIZATION



Central Differences Central Differences
1
10 50
simplex gradient norm


100




function value
150
0
10
200

250

1
10 300
0 50 100 150 200 250 0 50 100 150 200 250
function evaluations function evaluations

Forward Differences Forward Differences
1
10 50
simplex gradient norm




100
0
10 function value
150
1
10
200
2
10
250

3
10 300
0 50 100 150 200 250 0 50 100 150 200 250
function evaluations function evaluations


Figure 7.1: First Weber Example




Central Differences Central Differences
1
10 70

60
simplex gradient norm




50
0
function value




10
40

30
1
10
20

10
2
10 0
0 50 100 150 200 250 0 50 100 150 200 250
function evaluations function evaluations

Forward Differences Forward Differences
1
10 70

60
simplex gradient norm




50
0
function value




10
40

30
1
10
20

10
2
10 0
0 50 100 150 200 250 0 50 100 150 200 250
function evaluations function evaluations


Figure 7.2: Second Weber Example




Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




IMPLICIT FILTERING 129

Central Differences Central Differences
1
10 70

60
simplex gradient norm




0
10




function value
50
1
10 40

30
2
10
20
3

<< . .

. 22
( : 32)



. . >>