In optimization de¬niteness or semide¬niteness of the Hessian plays an important role in

the necessary and suf¬cient conditions for optimality that we discuss in §1.3 and 1.4 and in our

choice of algorithms throughout this book.

De¬nition 1.2.1. An N — N matrix A is positive semide¬nite if xT Ax ≥ 0 for all x ∈ RN .

A is positive de¬nite if xT Ax > 0 for all x ∈ RN , x = 0. If A has both positive and negative

eigenvalues, we say A is inde¬nite. If A is symmetric and positive de¬nite, we will say A is spd.

We will use two forms of the fundamental theorem of calculus, one for the function“gradient

pair and one for the gradient“Hessian.

Theorem 1.2.1. Let f be twice continuously differentiable in a neighborhood of a line

segment between points x— , x = x— + e ∈ RN ; then

1

—

∇f (x— + te)T e dt

f (x) = f (x ) +

0

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

BASIC CONCEPTS 5

and

1

—

∇2 f (x— + te)e dt.

∇f (x) = ∇f (x ) +

0

A direct consequence (see Exercise 1.7.1) of Theorem 1.2.1 is the following form of Taylor™s

theorem we will use throughout this book.

Theorem 1.2.2. Let f be twice continuously differentiable in a neighborhood of a point

x ∈ RN . Then for e ∈ RN and e suf¬ciently small

—

f (x— + e) = f (x— ) + ∇f (x— )T e + eT ∇2 f (x— )e/2 + o( e 2 ).

(1.7)

1.3 Necessary Conditions

Let f be twice continuously differentiable. We will use Taylor™s theorem in a simple way to

show that the gradient of f vanishes at a local minimizer and the Hessian is positive semide¬nite.

These are the necessary conditions for optimality.

The necessary conditions relate (1.1) to a nonlinear equation and allow one to use fast al-

gorithms for nonlinear equations [84], [154], [211] to compute minimizers. Therefore, the

necessary conditions for optimality will be used in a critical way in the discussion of local con-

vergence in Chapter 2. A critical ¬rst step in the design of an algorithm for a new optimization

problem is the formulation of necessary conditions. Of course, the gradient vanishes at a maxi-

mum, too, and the utility of the nonlinear equations formulation is restricted to a neighborhood

of a minimizer.

Theorem 1.3.1. Let f be twice continuously differentiable and let x— be a local minimizer

of f . Then

∇f (x— ) = 0.

Moreover ∇2 f (x— ) is positive semide¬nite.

Proof. Let u ∈ RN be given. Taylor™s theorem states that for all real t suf¬ciently small

t2 T 2

f (x + tu) = f (x ) + t∇f (x ) u + u ∇ f (x— )u + o(t2 ).

— — —T

2

Since x— is a local minimizer we must have for t suf¬ciently small 0 ¤ f (x— + tu) ’ f (x— ) and

hence

t

∇f (x— )T u + uT ∇2 f (x— )u + o(t) ≥ 0

(1.8)

2

for all t suf¬ciently small and all u ∈ RN . So if we set t = 0 and u = ’∇f (x— ) we obtain

∇f (x— ) 2

= 0.

Setting ∇f (x— ) = 0, dividing by t, and setting t = 0 in (1.8), we obtain

1T 2

u ∇ f (x— )u ≥ 0

2

for all u ∈ RN . This completes the proof.

The condition that ∇f (x— ) = 0 is called the ¬rst-order necessary condition and a point

satisfying that condition is called a stationary point or a critical point.

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

6 ITERATIVE METHODS FOR OPTIMIZATION

1.4 Suf¬cient Conditions

A stationary point need not be a minimizer. For example, the function φ(t) = ’t4 satis¬es the

necessary conditions at 0, which is a maximizer of φ. To obtain a minimizer we must require that

the second derivative be nonnegative. This alone is not suf¬cient (think of φ(t) = t3 ) and only

if the second derivative is strictly positive can we be completely certain. These are the suf¬cient

conditions for optimality.

Theorem 1.4.1. Let f be twice continuously differentiable in a neighborhood of x— . Assume

that ∇f (x— ) = 0 and that ∇2 f (x— ) is positive de¬nite. Then x— is a local minimizer of f .

Proof. Let 0 = u ∈ RN . For suf¬ciently small t we have

t2 T 2

= f (x ) + t∇f (x ) u + u ∇ f (x— )u + o(t2 )

— — —T

f (x + tu)

2

t2 T 2

= f (x ) + u ∇ f (x— )u + o(t2 ).

—

2

Hence, if » > 0 is the smallest eigenvalue of ∇2 f (x— ) we have

»

f (x— + tu) ’ f (x— ) ≥ 2

+ o(t2 ) > 0

tu

2

for t suf¬ciently small. Hence x— is a local minimizer.

1.5 Quadratic Objective Functions

The simplest optimization problems are those with quadratic objective functions. Here

1

f (x) = ’xT b + xT Hx.

(1.9)

2

We may, without loss of generality, assume that H is symmetric because

H + HT

T T

x Hx = x x.

(1.10)

2

Quadratic functions form the basis for most of the algorithms in Part I, which approximate an

objective function f by a quadratic model and minimize that model. In this section we discuss

some elementary issues in quadratic optimization.

Clearly,

∇2 f (x) = H

for all x. The symmetry of H implies that

∇f (x) = ’b + Hx.

De¬nition 1.5.1. The quadratic function f in (1.9) is convex if H is positive semide¬nite.

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

BASIC CONCEPTS 7

1.5.1 Positive De¬nite Hessian

The necessary conditions for optimality imply that if a quadratic function f has a local minimum

x— , then H is positive semide¬nite and

Hx— = b.

(1.11)

In particular, if H is spd (and hence nonsingular), the unique global minimizer is the solution of

the linear system (1.11).

If H is a dense matrix and N is not too large, it is reasonable to solve (1.11) by computing

the Cholesky factorization [249], [127] of H

H = LLT ,

where L is a nonsingular lower triangular matrix with positive diagonal, and then solving (1.11)

by two triangular solves. If H is inde¬nite the Cholesky factorization will not exist and the

standard implementation [127], [249], [264] will fail because the computation of the diagonal

of L will require a real square root of a negative number or a division by zero.

If N is very large, H is sparse, or a matrix representation of H is not available, a more

ef¬cient approach is the conjugate gradient iteration [154], [141]. This iteration requires only

matrix“vector products, a feature which we will use in a direct way in §§2.5 and 3.3.7. Our

formulation of the algorithm uses x as both an input and output variable. On input x contains

x0 , the initial iterate, and on output the approximate solution. We terminate the iteration if the

relative residual is suf¬ciently small, i.e.,

b ’ Hx ¤ b

or if too many iterations have been taken.

Algorithm 1.5.1. cg(x, b, H, , kmax)

1. r = b ’ Hx, ρ0 = r 2 , k = 1.

√

ρk’1 > b and k < kmax

2. Do While

(a) if k = 1 then p = r

else

β = ρk’1 /ρk’2 and p = r + βp

(b) w = Hp

(c) ± = ρk’1 /pT w

(d) x = x + ±p

(e) r = r ’ ±w

2

(f) ρk = r

(g) k = k + 1

Note that if H is not spd, the denominator in ± = ρk’1 /pT w may vanish, resulting in

breakdown of the iteration.

The conjugate gradient iteration minimizes f over an increasing sequence of nested subspaces

of RN [127], [154]. We have that

f (xk ) ¤ f (x) for all x ∈ x0 + Kk ,

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

8 ITERATIVE METHODS FOR OPTIMIZATION

where Kk is the Krylov subspace

Kk = span(r0 , Hr0 , . . . , H k’1 r0 )

for k ≥ 1.

While in principle the iteration must converge after N iterations and conjugate gradient can

be regarded as a direct solver, N is, in practice, far too many iterations for the large problems to

which conjugate gradient is applied. As an iterative method, the performance of the conjugate

gradient algorithm depends both on b and on the spectrum of H (see [154] and the references

cited therein). A general convergence estimate [68], [60], which will suf¬ce for the discussion

here, is

k

κ(H) ’ 1

xk ’ x— ¤ 2 x0 ’ x— .

(1.12) H H

κ(H) + 1

In (1.12), the H-norm of a vector is de¬ned by

2

= uT Hu

u H

for an spd matrix H. κ(H) is the l2 condition number

H ’1 .

κ(H) = H

For spd H

κ(H) = »l »’1 ,

s

where »l and »s are the largest and smallest eigenvalues of H. Geometrically, κ(H) is large if

the ellipsoidal level surfaces of f are very far from spherical.

The conjugate gradient iteration will perform well if κ(H) is near 1 and may perform very

poorly if κ(H) is large. The performance can be improved by preconditioning, which transforms

(1.11) into one with a coef¬cient matrix having eigenvalues near 1. Suppose that M is spd and

a suf¬ciently good approximation to H ’1 so that

κ(M 1/2 HM 1/2 )

is much smaller that κ(H). In that case, (1.12) would indicate that far fewer conjugate gradient

iterations might be needed to solve

M 1/2 HM 1/2 z = M 1/2 b

(1.13)

than to solve (1.11). Moreover, the solution x— of (1.11) could be recovered from the solution

z — of (1.13) by

x = M 1/2 z.

(1.14)

In practice, the square root of the preconditioning matrix M need not be computed. The algo-

rithm, using the same conventions that we used for cg, is

Algorithm 1.5.2. pcg(x, b, H, M, , kmax)

1. r = b ’ Hx, ρ0 = r 2 , k = 1

√

2. Do While ρk’1 > b and k < kmax

(a) z = M r

(b) „k’1 = z T r

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

BASIC CONCEPTS 9

(c) if k = 1 then β = 0 and p = z

else

β = „k’1 /„k’2 , p = z + βp

(d) w = Hp

(e) ± = „k’1 /pT w

(f) x = x + ±p

(g) r = r ’ ±w

(h) ρk = rT r

(i) k = k + 1

Note that only products of M with vectors in RN are needed and that a matrix representation

of M need not be stored. We refer the reader to [11], [15], [127], and [154] for more discussion

of preconditioners and their construction.

1.5.2 Inde¬nite Hessian

If H is inde¬nite, the necessary conditions, Theorem 1.3.1, imply that there will be no local

minimum. Even so, it will be important to understand some properties of quadratic problems

with inde¬nite Hessians when we design algorithms with initial iterates far from local minimizers

and we discuss some of the issues here.

If

uT Hu < 0,

we say that u is a direction of negative curvature. If u is a direction of negative curvature, then

f (x + tu) will decrease to ’∞ as t ’ ∞.

1.6 Examples

It will be useful to have some example problems to solve as we develop the algorithms. The

examples here are included to encourage the reader to experiment with the algorithms and play

with the MATLAB codes. The codes for the problems themselves are included with the set of

MATLAB codes. The author of this book does not encourage the reader to regard the examples

as anything more than examples. In particular, they are not real-world problems, and should not

be used as an exhaustive test suite for a code. While there are documented collections of test

problems (for example, [10] and [26]), the reader should always evaluate and compare algorithms

in the context of his/her own problems.

Some of the problems are directly related to applications. When that is the case we will cite

some of the relevant literature. Other examples are included because they are small, simple, and

illustrate important effects that can be hidden by the complexity of more serious problems.

1.6.1 Discrete Optimal Control

This is a classic example of a problem in which gradient evaluations cost little more than function

evaluations.

We begin with the continuous optimal control problems and discuss how gradients are com-

puted and then move to the discretizations. We will not dwell on the functional analytic issues

surrounding the rigorous de¬nition of gradients of maps on function spaces, but the reader should

be aware that control problems require careful attention to this. The most important results can

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

10 ITERATIVE METHODS FOR OPTIMIZATION

be found in [151]. The function space setting for the particular control problems of interest in this

section can be found in [170], [158], and [159], as can a discussion of more general problems.

The in¬nite-dimensional problem is

min f,

(1.15)

u

where

T

f (u) = L(y(t), u(t), t) dt,

(1.16)

0

and we seek an optimal point u ∈ L∞ [0, T ]. u is called the control variable or simply the

control. The function L is given and y, the state variable, satis¬es the initial value problem

(with y = dy/dt)

™

y(t) = φ(y(t), u(t), t), y(0) = y0 .

™

(1.17)

One could view the problem (1.15)“(1.17) as a constrained optimization problem or, as we

do here, think of the evaluation of f as requiring the solution of (1.17) before the integral on the

right side of (1.16) can be evaluated. This means that evaluation of f requires the solution of

(1.17), which is called the state equation.

∇f (u), the gradient of f at u with respect to the L2 inner product, is uniquely determined,

if it exists, by

T

f (u + w) ’ f (u) ’ (∇f (u))(t)w(t) dt = o( w )

(1.18)

0

as w ’ 0, uniformly in w. If ∇f (u) exists then

T

df (u + ξw)

(∇f (u))(t)w(t) dt = .

dξ

0 ξ=0

If L and φ are continuously differentiable, then ∇f (u), as a function of t, is given by

∇f (u)(t) = p(t)φu (y(t), u(t), t) + Lu (y(t), u(t), t).

(1.19)

In (1.19) p, the adjoint variable, satis¬es the ¬nal-value problem on [0, T ]

’p(t) = p(t)φy (y(t), u(t), t) + Ly (y(t), u(t), t), p(T ) = 0.

™

(1.20)