<< . .

. 2
( : 32)



. . >>


In optimization de¬niteness or semide¬niteness of the Hessian plays an important role in
the necessary and suf¬cient conditions for optimality that we discuss in §1.3 and 1.4 and in our
choice of algorithms throughout this book.
De¬nition 1.2.1. An N — N matrix A is positive semide¬nite if xT Ax ≥ 0 for all x ∈ RN .
A is positive de¬nite if xT Ax > 0 for all x ∈ RN , x = 0. If A has both positive and negative
eigenvalues, we say A is inde¬nite. If A is symmetric and positive de¬nite, we will say A is spd.

We will use two forms of the fundamental theorem of calculus, one for the function“gradient
pair and one for the gradient“Hessian.
Theorem 1.2.1. Let f be twice continuously differentiable in a neighborhood of a line
segment between points x— , x = x— + e ∈ RN ; then
1

∇f (x— + te)T e dt
f (x) = f (x ) +
0



Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




BASIC CONCEPTS 5

and
1

∇2 f (x— + te)e dt.
∇f (x) = ∇f (x ) +
0


A direct consequence (see Exercise 1.7.1) of Theorem 1.2.1 is the following form of Taylor™s
theorem we will use throughout this book.
Theorem 1.2.2. Let f be twice continuously differentiable in a neighborhood of a point
x ∈ RN . Then for e ∈ RN and e suf¬ciently small



f (x— + e) = f (x— ) + ∇f (x— )T e + eT ∇2 f (x— )e/2 + o( e 2 ).
(1.7)




1.3 Necessary Conditions
Let f be twice continuously differentiable. We will use Taylor™s theorem in a simple way to
show that the gradient of f vanishes at a local minimizer and the Hessian is positive semide¬nite.
These are the necessary conditions for optimality.
The necessary conditions relate (1.1) to a nonlinear equation and allow one to use fast al-
gorithms for nonlinear equations [84], [154], [211] to compute minimizers. Therefore, the
necessary conditions for optimality will be used in a critical way in the discussion of local con-
vergence in Chapter 2. A critical ¬rst step in the design of an algorithm for a new optimization
problem is the formulation of necessary conditions. Of course, the gradient vanishes at a maxi-
mum, too, and the utility of the nonlinear equations formulation is restricted to a neighborhood
of a minimizer.
Theorem 1.3.1. Let f be twice continuously differentiable and let x— be a local minimizer
of f . Then
∇f (x— ) = 0.
Moreover ∇2 f (x— ) is positive semide¬nite.
Proof. Let u ∈ RN be given. Taylor™s theorem states that for all real t suf¬ciently small

t2 T 2
f (x + tu) = f (x ) + t∇f (x ) u + u ∇ f (x— )u + o(t2 ).
— — —T
2
Since x— is a local minimizer we must have for t suf¬ciently small 0 ¤ f (x— + tu) ’ f (x— ) and
hence
t
∇f (x— )T u + uT ∇2 f (x— )u + o(t) ≥ 0
(1.8)
2
for all t suf¬ciently small and all u ∈ RN . So if we set t = 0 and u = ’∇f (x— ) we obtain

∇f (x— ) 2
= 0.

Setting ∇f (x— ) = 0, dividing by t, and setting t = 0 in (1.8), we obtain
1T 2
u ∇ f (x— )u ≥ 0
2
for all u ∈ RN . This completes the proof.
The condition that ∇f (x— ) = 0 is called the ¬rst-order necessary condition and a point
satisfying that condition is called a stationary point or a critical point.


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




6 ITERATIVE METHODS FOR OPTIMIZATION

1.4 Suf¬cient Conditions
A stationary point need not be a minimizer. For example, the function φ(t) = ’t4 satis¬es the
necessary conditions at 0, which is a maximizer of φ. To obtain a minimizer we must require that
the second derivative be nonnegative. This alone is not suf¬cient (think of φ(t) = t3 ) and only
if the second derivative is strictly positive can we be completely certain. These are the suf¬cient
conditions for optimality.

Theorem 1.4.1. Let f be twice continuously differentiable in a neighborhood of x— . Assume
that ∇f (x— ) = 0 and that ∇2 f (x— ) is positive de¬nite. Then x— is a local minimizer of f .

Proof. Let 0 = u ∈ RN . For suf¬ciently small t we have

t2 T 2
= f (x ) + t∇f (x ) u + u ∇ f (x— )u + o(t2 )
— — —T
f (x + tu)
2

t2 T 2
= f (x ) + u ∇ f (x— )u + o(t2 ).

2

Hence, if » > 0 is the smallest eigenvalue of ∇2 f (x— ) we have

»
f (x— + tu) ’ f (x— ) ≥ 2
+ o(t2 ) > 0
tu
2

for t suf¬ciently small. Hence x— is a local minimizer.



1.5 Quadratic Objective Functions
The simplest optimization problems are those with quadratic objective functions. Here

1
f (x) = ’xT b + xT Hx.
(1.9)
2

We may, without loss of generality, assume that H is symmetric because

H + HT
T T
x Hx = x x.
(1.10)
2

Quadratic functions form the basis for most of the algorithms in Part I, which approximate an
objective function f by a quadratic model and minimize that model. In this section we discuss
some elementary issues in quadratic optimization.
Clearly,
∇2 f (x) = H

for all x. The symmetry of H implies that

∇f (x) = ’b + Hx.


De¬nition 1.5.1. The quadratic function f in (1.9) is convex if H is positive semide¬nite.



Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




BASIC CONCEPTS 7

1.5.1 Positive De¬nite Hessian
The necessary conditions for optimality imply that if a quadratic function f has a local minimum
x— , then H is positive semide¬nite and

Hx— = b.
(1.11)

In particular, if H is spd (and hence nonsingular), the unique global minimizer is the solution of
the linear system (1.11).
If H is a dense matrix and N is not too large, it is reasonable to solve (1.11) by computing
the Cholesky factorization [249], [127] of H

H = LLT ,

where L is a nonsingular lower triangular matrix with positive diagonal, and then solving (1.11)
by two triangular solves. If H is inde¬nite the Cholesky factorization will not exist and the
standard implementation [127], [249], [264] will fail because the computation of the diagonal
of L will require a real square root of a negative number or a division by zero.
If N is very large, H is sparse, or a matrix representation of H is not available, a more
ef¬cient approach is the conjugate gradient iteration [154], [141]. This iteration requires only
matrix“vector products, a feature which we will use in a direct way in §§2.5 and 3.3.7. Our
formulation of the algorithm uses x as both an input and output variable. On input x contains
x0 , the initial iterate, and on output the approximate solution. We terminate the iteration if the
relative residual is suf¬ciently small, i.e.,

b ’ Hx ¤ b

or if too many iterations have been taken.
Algorithm 1.5.1. cg(x, b, H, , kmax)

1. r = b ’ Hx, ρ0 = r 2 , k = 1.

ρk’1 > b and k < kmax
2. Do While

(a) if k = 1 then p = r
else
β = ρk’1 /ρk’2 and p = r + βp
(b) w = Hp
(c) ± = ρk’1 /pT w
(d) x = x + ±p
(e) r = r ’ ±w
2
(f) ρk = r
(g) k = k + 1

Note that if H is not spd, the denominator in ± = ρk’1 /pT w may vanish, resulting in
breakdown of the iteration.
The conjugate gradient iteration minimizes f over an increasing sequence of nested subspaces
of RN [127], [154]. We have that

f (xk ) ¤ f (x) for all x ∈ x0 + Kk ,


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




8 ITERATIVE METHODS FOR OPTIMIZATION

where Kk is the Krylov subspace

Kk = span(r0 , Hr0 , . . . , H k’1 r0 )

for k ≥ 1.
While in principle the iteration must converge after N iterations and conjugate gradient can
be regarded as a direct solver, N is, in practice, far too many iterations for the large problems to
which conjugate gradient is applied. As an iterative method, the performance of the conjugate
gradient algorithm depends both on b and on the spectrum of H (see [154] and the references
cited therein). A general convergence estimate [68], [60], which will suf¬ce for the discussion
here, is
k
κ(H) ’ 1
xk ’ x— ¤ 2 x0 ’ x— .
(1.12) H H
κ(H) + 1
In (1.12), the H-norm of a vector is de¬ned by
2
= uT Hu
u H

for an spd matrix H. κ(H) is the l2 condition number

H ’1 .
κ(H) = H

For spd H
κ(H) = »l »’1 ,
s

where »l and »s are the largest and smallest eigenvalues of H. Geometrically, κ(H) is large if
the ellipsoidal level surfaces of f are very far from spherical.
The conjugate gradient iteration will perform well if κ(H) is near 1 and may perform very
poorly if κ(H) is large. The performance can be improved by preconditioning, which transforms
(1.11) into one with a coef¬cient matrix having eigenvalues near 1. Suppose that M is spd and
a suf¬ciently good approximation to H ’1 so that

κ(M 1/2 HM 1/2 )

is much smaller that κ(H). In that case, (1.12) would indicate that far fewer conjugate gradient
iterations might be needed to solve

M 1/2 HM 1/2 z = M 1/2 b
(1.13)

than to solve (1.11). Moreover, the solution x— of (1.11) could be recovered from the solution
z — of (1.13) by
x = M 1/2 z.
(1.14)
In practice, the square root of the preconditioning matrix M need not be computed. The algo-
rithm, using the same conventions that we used for cg, is
Algorithm 1.5.2. pcg(x, b, H, M, , kmax)

1. r = b ’ Hx, ρ0 = r 2 , k = 1

2. Do While ρk’1 > b and k < kmax

(a) z = M r
(b) „k’1 = z T r


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




BASIC CONCEPTS 9

(c) if k = 1 then β = 0 and p = z
else
β = „k’1 /„k’2 , p = z + βp
(d) w = Hp
(e) ± = „k’1 /pT w
(f) x = x + ±p
(g) r = r ’ ±w
(h) ρk = rT r
(i) k = k + 1

Note that only products of M with vectors in RN are needed and that a matrix representation
of M need not be stored. We refer the reader to [11], [15], [127], and [154] for more discussion
of preconditioners and their construction.


1.5.2 Inde¬nite Hessian
If H is inde¬nite, the necessary conditions, Theorem 1.3.1, imply that there will be no local
minimum. Even so, it will be important to understand some properties of quadratic problems
with inde¬nite Hessians when we design algorithms with initial iterates far from local minimizers
and we discuss some of the issues here.
If
uT Hu < 0,
we say that u is a direction of negative curvature. If u is a direction of negative curvature, then
f (x + tu) will decrease to ’∞ as t ’ ∞.


1.6 Examples
It will be useful to have some example problems to solve as we develop the algorithms. The
examples here are included to encourage the reader to experiment with the algorithms and play
with the MATLAB codes. The codes for the problems themselves are included with the set of
MATLAB codes. The author of this book does not encourage the reader to regard the examples
as anything more than examples. In particular, they are not real-world problems, and should not
be used as an exhaustive test suite for a code. While there are documented collections of test
problems (for example, [10] and [26]), the reader should always evaluate and compare algorithms
in the context of his/her own problems.
Some of the problems are directly related to applications. When that is the case we will cite
some of the relevant literature. Other examples are included because they are small, simple, and
illustrate important effects that can be hidden by the complexity of more serious problems.


1.6.1 Discrete Optimal Control
This is a classic example of a problem in which gradient evaluations cost little more than function
evaluations.
We begin with the continuous optimal control problems and discuss how gradients are com-
puted and then move to the discretizations. We will not dwell on the functional analytic issues
surrounding the rigorous de¬nition of gradients of maps on function spaces, but the reader should
be aware that control problems require careful attention to this. The most important results can


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




10 ITERATIVE METHODS FOR OPTIMIZATION

be found in [151]. The function space setting for the particular control problems of interest in this
section can be found in [170], [158], and [159], as can a discussion of more general problems.
The in¬nite-dimensional problem is

min f,
(1.15)
u

where
T
f (u) = L(y(t), u(t), t) dt,
(1.16)
0

and we seek an optimal point u ∈ L∞ [0, T ]. u is called the control variable or simply the
control. The function L is given and y, the state variable, satis¬es the initial value problem
(with y = dy/dt)

y(t) = φ(y(t), u(t), t), y(0) = y0 .

(1.17)
One could view the problem (1.15)“(1.17) as a constrained optimization problem or, as we
do here, think of the evaluation of f as requiring the solution of (1.17) before the integral on the
right side of (1.16) can be evaluated. This means that evaluation of f requires the solution of
(1.17), which is called the state equation.
∇f (u), the gradient of f at u with respect to the L2 inner product, is uniquely determined,
if it exists, by
T
f (u + w) ’ f (u) ’ (∇f (u))(t)w(t) dt = o( w )
(1.18)
0

as w ’ 0, uniformly in w. If ∇f (u) exists then
T
df (u + ξw)
(∇f (u))(t)w(t) dt = .

0 ξ=0


If L and φ are continuously differentiable, then ∇f (u), as a function of t, is given by

∇f (u)(t) = p(t)φu (y(t), u(t), t) + Lu (y(t), u(t), t).
(1.19)

In (1.19) p, the adjoint variable, satis¬es the ¬nal-value problem on [0, T ]

’p(t) = p(t)φy (y(t), u(t), t) + Ly (y(t), u(t), t), p(T ) = 0.

(1.20)

<< . .

. 2
( : 32)



. . >>