partial derivatives. Here are a couple.

(1) If f is a multinomial, i.e. f (x, y) = P Q pq

q=0 ap,q x y , then f,12 =

p=0

f,21 . But smooth functions are very close to being polynomial, so we would

expect the result to be true in general.

(2) Although we cannot interchange limits in general, it is plausible, that

if f is well behaved, then

f,12 (x, y) = lim lim h’1 k ’1 (f (x + h, y + k) ’ f (x + h, y) ’ f (x, y + k) + f (x, y))

h’0 k’0

= lim lim h’1 k ’1 (f (x + h, y + k) ’ f (x + h, y) ’ f (x, y + k) + f (x, y))

k’0 h’0

= f,21 (x, y).

However, these are merely plausible arguments. They do not make clear the

rˆle of the continuity of the second derivative (in Example 7.3.18 we shall see

o

that the result may fail for discontinuous second partial derivatives). More

fundamentally, they are algebraic arguments and, as the use of the mean value

theorem indicates, the result is one of analysis. The same kind of argument

which shows that the local Taylor theorem fails over Q (see Example 7.1.8)

shows that it fails over Q2 and that the symmetry of partial derivatives fails

with it (see [33]).

If we use the D notation, Theorem 7.2.6 states that (under appropriate

conditions)

D1 D2 f = D2 D1 f.

If we write Dij = Di Dj , as is often done, we get

D12 f = D21 f.

What happens if a function has higher partial derivatives? It is not hard

to guess and prove the appropriate theorem.

151

Please send corrections however trivial to twk@dpmms.cam.ac.uk

Exercise 7.2.7. Suppose δ > 0, x ∈ Rm B(x, δ) ⊆ E ⊆ Rm and that

f : E ’ R. Show that, if all the partial derivatives f,j , f,jk , f,ijk , . . . up to

the nth order exist in B(x, δ) and are continuous at x, then, writing

m m m m m m

1 1

f (x + h) = f (x) + f,j (x)hj + f,jk (x)hj hk + f,jkl (x)hj hk hl

2! 3!

j=1 j=1 j=1 k=1 l=1

k=1

+ · · · + sum up to nth powers + (h) h n ,

we have (h) ’ 0 as h ’ 0.

Notice that you do not have to prove results like

f,jkl (x) = f,ljk (x) = f,klj (x) = f,lkj (x) = f,jlk (x) = f,kjl (x)

since they follow directly from Theorem 7.2.6.

Applying Exercise 7.2.7 to the components fi of a function f , we obtain

our full many dimensional Taylor theorem.

Theorem 7.2.8 (The local Taylor™s theorem). Suppose δ > 0, x ∈ Rm ,

B(x, δ) ⊆ E ⊆ Rm and that f : E ’ Rp . If all the partial derivatives fi,j ,

fi,jk , fi,jkl , . . . exist in B(x, δ) and are continuous at x, then, writing

m m m

1

fi (x + h) = fi (x) + fi,j (x)hj + fi,jk (x)hj hk

2!

j=1 j=1 k=1

m m m

1

+ fi,jkl (x)hj hk hl

3! j=1 k=1 l=1

+ · · · + sum up to nth powers + i (h) h n ,

(h) ’ 0 as h ’ 0.

we have

The reader will remark that Theorem 7.2.8 bristles with subscripts, con-

trary to our announced intention of seeking a geometric, coordinate free view.

However, it is very easy to restate the main formula of Theorem 7.2.8 in a

coordinate free way as

f (x + h) = f (x) + ±1 (h) + ±2 (h, h) + · · · + ±n (h, h, . . . , h) + (h) h n ,

where ±k : Rm — Rm · · · — Rm ’ Rp is linear in each variable (i.e. a k-

linear function) and symmetric (i.e. interchanging any two variables leaves

the value of ±k unchanged).

Anyone who feels that the higher derivatives are best studied using co-

ordinates should re¬‚ect that, if f : R3 ’ R3 is well behaved, then the

152 A COMPANION TO ANALYSIS

˜third derivative behaviour™ of f at a single point is apparently given by

the 3 — 3 — 3 — 3 = 81 numbers fi,jkl (x). By symmetry (see Theorem 7.2.6)

only 30 of the numbers are distinct but these 30 numbers are independent

(consider polynomials in three variables for which the total degree of each

term is 3). How can we understand the information carried by an array of

30 real numbers?

Exercise 7.2.9. (i) Verify the statements in the last paragraph. How large

an array is required to give the ˜third derivative behaviour™ of a well behaved

function f : R4 ’ R4 at a point? How large an array is required to give the

˜fourth derivative behaviour™ of a well behaved function f : R3 ’ R3 at a

point?

(ii) (Ignore this if the notation is not familiar.) Consider a well behaved

function f : R3 ’ R3 . How large an array is required to give curl f = — f

and div f = · f ? How large an array is required to give Df ?

In many circumstances curl f and div f give the physically interesting part

of Df but physicists also use

3 3 3

(a · )f = aj f1,j , aj f2,j , aj f3,j .

j=1 j=1 j=1

How large an array is required to give (a · )f for all a ∈ R3 ?

In subjects like elasticity the description of nature requires the full Jaco-

bian matrix (fi,j ) and the treatment of di¬erentiation used is closer to that

of the pure mathematician.

Most readers will be happy to ¬nish this section here2 . However, some of

them3 will observe that in our coordinate free statement of the local Taylor™s

theorem the ˜second derivative behaviour™ is given by a bilinear map ±2 :

Rm — Rm ’ Rp and we de¬ned derivatives in terms of linear maps.

Let us be more precise. We suppose f is a well behaved function on an

open set U ⊆ Rp taking values in Rm . If we write L(E, F ) for the space

of linear maps from a ¬nite dimensional vector space E to a vector space F

then, for each ¬xed x ∈ U , we have Df (x) ∈ L(Rm , Rp ). Thus, allowing x to

vary freely, we see that we have a function

Df : U ’ L(Rm , Rp ).

2

The rest of this section is marked with a ™.

3

Boas notes that ˜There is a test for identifying some of the future professional math-

ematicians at an early age. These are students who instantly comprehend a sentence

beginning “Let X be an ordered quintuple (a, T, π, σ, B) where . . . ”. They are even more

promising if they add, “I never really understood it before.” ™ ([8] page 231.)

153

Please send corrections however trivial to twk@dpmms.cam.ac.uk

We now observe that L(Rm , Rp ) is a ¬nite dimensional vector space over R

of dimension mp, in other words, L(Rm , Rp ) can be identi¬ed with Rmp . We

know how to de¬ne the derivative of a well behaved function g : U ’ Rmp

at x as a function

Dg(x) ∈ L(Rm , Rmp )

so we know how to de¬ne the derivative of Df at x as a function

D(Df )(x) ∈ L(Rm , L(Rm , Rp )).

We have thus shown how to de¬ne the second derivative D 2 f (x) = D(Df )(x).

But D2 f (x) lies in L(Rm , L(Rm , Rp )) and ±2 lies in the space E(Rm , Rm ; Rp )

of bilinear maps from Rm — Rm to Rp . How, the reader may ask, can we

identify L(Rm , L(Rm , Rp )) with E(Rm , Rm ; Rp )? Fortunately this question

answers itself with hardly any outside intervention.

Exercise 7.2.10. Let E, F and G be ¬nite dimensional vector spaces over

R. We write E(E, F ; G) for the space of bilinear maps ± : E — F ’ G.

De¬ne

(˜(±)(u))(v) = ±(u, v)

for all ± ∈ E(E, F ; G), u ∈ E and v ∈ F .

(i) Show that ˜(±)(u) ∈ L(F, G).

(ii) Show that, if v is ¬xed,

˜(±)(»1 u1 + »2 u2 ) (v) = »1 ˜(±)(u1 ) + »2 ˜(±)(u2 ) (v)

and deduce that

˜(±)(»1 u1 + »2 u2 ) = »1 ˜(±)(u1 ) + »2 ˜(±)(u2 )

for all »1 , »2 ∈ R and u1 , u2 ∈ E. Conclude that ˜(±) ∈ L(E, L(F, G)).

(iii) By arguments similar in spirit to those of (ii), show that ˜ : E(E, F ; G) ’

L(E, L(F, G)) is linear.

(iv) Show that if (˜(±)(u))(v) = 0 for all u ∈ E, v ∈ F , then ± = 0.

Deduce that ˜ is injective.

(v) By computing the dimensions of E(E, F ; G) and L(E, L(F, G)), show

that ˜ is an isomorphism.

Since our de¬nition of ˜ does not depend on a choice of basis, we say that

˜ gives a natural isomorphism of E(E, F ; G) and L(E, L(F, G)). If we use

this isomorphism to identify E(E, F ; G) and L(E, L(F, G)) then D 2 f (x) ∈

154 A COMPANION TO ANALYSIS

E(Rm , Rm ; Rp ). If we treat the higher derivatives in the same manner, the

central formula of the local Taylor theorem takes the satisfying form

12 1

D f (x)(h, h) + · · · + Dn f (x)(h, h, . . . , h) + (h) h n .

f (x + h) = f (x) + Df (x)(h) +

2! n!

For more details, consult sections 11 and 13 of chapter VIII of Dieudonn´™s

e

Foundations of Modern Analysis [13] where the higher derivatives are dealt

with in a coordinate free way. Like Hardy™s book [23], Dieudonn´™s is ae

4

masterpiece but in very di¬erent tradition .

7.3 Critical points

In this section we mix informal and formal argument, deliberately using

words like ˜well behaved™ without de¬ning them. Our object is to use the

local Taylor formula to produce results about maxima, minima and related

objects.

Let U be an open subset of Rm containing 0. We are interested in the

behaviour of a well behaved function f : U ’ R near 0.

Since f is well behaved, the ¬rst order local Taylor theorem (which re-

duces to the de¬nition of di¬erentiation) gives

f (h) = f (0) + ±h + (h) h

where (h) ’ 0 as h ’ 0 and ± = Df (0) is a linear map from Rm to R.

By a very simple result of linear algebra, we can choose a set of orthogonal

coordinates so that ±(x1 , x2 , . . . , xm ) = ax1 with a ≥ 0.

Exercise 7.3.1. If ± : Rm ’ R is linear show that, with respect to any

particular chosen orthogonal coordinates,

±(x1 , x2 , . . . , xm ) = a1 x1 + a2 x2 + · · · + am xm

for some aj ∈ R. Deduce that there is a vector a such that ±x = a · x for all

x ∈ Rm . Conclude that we can choose a set of orthogonal coordinates so that

±(x1 , x2 , . . . , xm ) = ax1 with a ≥ 0.

In applied mathematics we write a = f . A longer, but very instructive

proof, of the result of this exercise is given in Exercise K.31.

In the coordinate system just chosen

f (h1 , h2 , . . . , hm ) = f (0) + ah1 + (h) h

155

Please send corrections however trivial to twk@dpmms.cam.ac.uk

Figure 7.1: Contour lines when the derivative is not zero.

where (h) ’ 0 as h ’ 0. Thus, speaking informally, if a = 0 the ˜contour

lines™ f (h) = c close to 0 will look like parallel ˜hyperplanes™ perpendicular to

the x1 axis. Figure 7.1 illustrates the case m = 2. In particular, our contour

lines look like those describing a side of a hill but not its peak.

Using our informal insight we can prove a formal lemma.

Lemma 7.3.2. Let U be an open subset of Rm containing x. Suppose that

f : U ’ R is di¬erentiable at x. If f (x) ≥ f (y) for all y ∈ U then

Df (x) = 0 (more precisely, Df (x)h = 0 for all h ∈ Rm ).

Proof. There is no loss in generality in supposing x = 0. Suppose that

Df (0) = 0. Then we can ¬nd an orthogonal coordinate system and a strictly

positive real number a such that Df (0)(h1 , h2 , . . . , hn ) = ah1 . Thus, from

the de¬nition of the derivative,

f (h1 , h2 , . . . , hn ) = f (0) + ah1 + (h) h

where (h) ’ 0 as h ’ 0.

Choose · > 0 such that, whenever h < ·, we have h ∈ U and (h) <

a/2. Now choose any real h with 0 < h < ·. If we set h = (h, 0, 0, . . . , 0), we

have

f (h) = f (0) + ah + (h)h > f (0) + ah ’ ah/2 = f (0) + ah/2 > f (0).

The distinctions made in the following de¬nition are probably familiar to

the reader.

De¬nition 7.3.3. Let E be a subset of Rm containing x and let f be a

function from E to R.

4

See the quotation from Boas in the previous footnote.

156 A COMPANION TO ANALYSIS

(i) We say that f has a global maximum at x if f (x) ≥ f (y) for all

y ∈ E.

(ii) We say that f has a strict global maximum at x if f (x) > f (y) for

all y ∈ E with x = y.

(iii) We say that f has a local maximum (respectively a strict local maxi-

mum) at x if there exists an · > 0 such that the restriction of f to E ©B(x, ·)

has a global maximum (respectively a strict global maximum) at x.

(iv) If we can ¬nd an · > 0 such that E ⊇ B(x, ·) and f is di¬erentiable

at x with Df (x) = 0, we say that x is a critical or stationary point 5 of f .

It is usual to refer to the point x where f takes a (global or local) maxi-

mum as a (global or local) maximum and this convention rarely causes con-

fusion. When mathematicians omit the words local or global in referring to

maximum they usually mean the local version (but this convention, which I

shall follow, is not universal).

Here are some easy exercises involving these ideas.

Exercise 7.3.4. (i) Let U be an open subset of Rm containing x. Suppose

that f : U ’ R is di¬erentiable on U and that Df is continuous at x. Show

that, if f has a local maximum at x, then Df (x) = 0 .

(ii) Suppose that f : Rm ’ R is di¬erentiable everywhere and E is a

closed subset of Rm containing x. Show that, even if x is a global maximum

of the restriction of f to E, it need not be true that Df (x) = 0. [Hint: We

have already met this fact when we thought about Rolle™s theorem.] Explain

informally why the proof of Lemma 7.3.2 fails in this case.

(iii) State the de¬nitions corresponding to De¬nition 7.3.3 that we need

to deal with minima.

(iv) Let E be any subset of Rm containing y and let f be a function from

E to R. If y is both a global maximum and a global minimum for f show that

f is constant. What can you say if we replace the word ˜global™ by ˜local™ ?

We saw above how f behaved locally near 0 if Df (0) = 0. What can we

say if Df (0) = 0? In this case, the second order Taylor expansion gives

2

f (h) = f (0) + β(h, h) + (h) h

where

m m

1

β(h, h) = f,ij (0)hi hj

2 i=1 j=1

5

In other words a stationary point is one where the ground is ¬‚at. Since ¬‚at ground

drains badly, the stationary points we meet in hill walking tend to be boggy. Thus we

encounter boggy ground at the top of hills and when crossing passes as well as at lowest

points (at least in the UK, other countries may be drier or have better draining soils).

157

Please send corrections however trivial to twk@dpmms.cam.ac.uk

Figure 7.2: Contour lines when the derivative is zero but the second derivative

is non-singular

and (h) ’ 0 as h ’ 0. We write β = 1 D2 f and call the matrix

2

K = (f,ij (0)) the Hessian matrix. As we noted in the previous section,

the symmetry of the second partial derivatives (Theorem 7.2.6) tells us that

the Hessian matrix is a symmetric matrix and the associated bilinear map

D2 f is symmetric. It follows from a well known result in linear algebra (see

e.g. Exercise K.30) that Rn has an orthonormal basis of eigenvectors of K.

Choosing coordinate axes along those vectors, we obtain

m

2

»i h 2

D f (h, h) = i

i=1

where the »i are the eigenvalues associated with the eigenvectors.

In the coordinate system just chosen

m

1

»i h2 + (h) h 2

f (h1 , h2 , . . . , hm ) = f (0) + i

2 i=1

where (h) ’ 0 as h ’ 0. Thus, speaking informally, if all the »i are

non-zero, the ˜contour lines™ f (h) = c close to 0 will look like ˜quadratic hy-

persurfaces™ (that is m dimensional versions of conics). Figure 7.2 illustrates

the two possible contour patterns when m = 2. The ¬rst type of pattern is

that of a summit (if the contour lines are for increasing heights as we ap-

proach 0) or a bottom (lowest point)6 (if the contour lines are for decreasing

heights as we approach 0). The second is that of a pass (often called a sad-

dle). Notice that, for merchants, wishing to get from one valley to another,

the pass is the highest point in their journey but, for mountaineers, wishing

to get from one mountain to another, the pass is the lowest point.

6

The English language is rich in synonyms for highest points (summits, peaks, crowns,

. . . ) but has few for lowest points. This may be because the English climate ensures that

most lowest points are under water.

158 A COMPANION TO ANALYSIS

When looking at Figure 7.2 it is important to realise that the di¬erence

in heights of successive contour lines is not constant. In e¬ect we have drawn

contour lines at heights f (0), f (0)+·, f (0)+22 ·, f (0)+32 ·, . . . , f (0)+n2 ·.

Exercise 7.3.5. (i) Redraw Figure 7.2 with contour lines at heights f (0),

f (0) + ·, f (0) + 2·, f (0) + 3·, . . . , f (0) + n·.

(ii) What (roughly speaking) can you say about the di¬erence in heights

of successive contour lines in Figure 7.1?

Using our informal insight we can prove a formal lemma.

Lemma 7.3.6. Let U be an open subset of Rm containing x. Suppose that f :

U ’ R has second order partial derivatives on U and these partial derivatives

are continuous at x. If Df (x) = 0 and D 2 f (x) is non-singular then

(i) f has a minimum at x if and only if D 2 f (x) is positive de¬nite.

(ii) f has a maximum at x if and only if D 2 f (x) is negative de¬nite.

The conditions of the second sentence of the hypothesis ensure that we

have a local second order Taylor expansion. In most applications f will be

much better behaved than this. We say that D 2 f (x) is positive de¬nite if all

the associated eigenvalues (that is all the eigenvalues of the Hessian matrix)

are strictly positive and that D 2 f (x) is negative de¬nite if all the associated

eigenvalues are strictly negative.

Exercise 7.3.7. Prove Lemma 7.3.6 following the style of the proof of Lemma 7.3.2.

It is a non-trivial task to tell whether a given Hessian is positive or neg-

ative de¬nite.

Exercise 7.3.8. Let f (x, y) = x2 + 6xy + y 2 . Show that Df (0, 0) = 0, that

all the entries in the Hessian matrix K at (0, 0) are positive and that K

is non-singular but that D 2 f (0, 0) is neither positive de¬nite nor negative

de¬nite. (So (0, 0) is a saddle point.)

Exercise K.105 gives one method of resolving the problem.

Because it is non-trivial to use the Hessian to determine whether a sin-

gular point, that is a point x where Df (x) = 0 is a maximum, a minimum

or neither, mathematicians frequently seek short cuts.

Exercise 7.3.9. Suppose that f : Rm ’ R is continuous, that f (x) ’ 0 as

x ’ ∞ and that f (x) > 0 for all x ∈ Rm .

(i) Explain why there exists an R > 0 such that f (x) < f (0) for all

x ≥ R.

159

Please send corrections however trivial to twk@dpmms.cam.ac.uk

(ii) Explain why there exists an x0 with x0 ¤ R and f (x0 ) ≥ f (x) for

all x ¤ R.

(iii) Explain why f (x0 ) ≥ f (x) for all x ∈ Rm .

(iv) If f is everywhere di¬erentiable and has exactly one singular point

y0 show that f attains a global maximum at y0 .

(v) In statistics we frequently wish to maximise functions of the form