f (t) = b + a(t ’ x) + small error

close to x. If we think a little harder about the nature of the ˜smallest error™

possible we see that it ˜ought to decrease faster than linear™ that is

f (t) = b + a(t ’ x) + E(t)|t ’ x|

with E(t) ’ 0 as t ’ x.

Exercise 6.1.1. Suppose that f : R ’ R. Show that the following two

statements are equivalent.

f (t) ’ f (x)

’ a as t ’ x.

(i)

t’x

(ii) f (t) = f (x) + a(t ’ x) + E(t)|t ’ x| with E(t) ’ 0 as t ’ x.

Rewriting our equations slightly, we see that f is di¬erentiable at x if

f (t) ’ f (x) = a(t ’ x) + E(t)|t ’ x|

with E(t) ’ 0 as t ’ 0. A ¬nal rewrite now gives f is di¬erentiable at x if

f (x + h) ’ f (x) = ah + (h)|h|.

where (h) ’ 0 as h ’ x. The derivative f (x) = a is the slope of the

tangent at x.

The obvious extension to well behaved functions f : Rm ’ R is to con-

sider the tangent plane at (x, f (x)). Just as the equation of a non-vertical

123

Please send corrections however trivial to twk@dpmms.cam.ac.uk

line through the origin in R — R is y = bt, so the equation of an appropriate

plane (or ˜hyperplane™ if the reader prefers) in Rm — R is y = ±(x) where

± : Rm ’ R is linear. This suggests that we say that f is di¬erentiable at x

if

f (x + h) ’ f (x) = ±(h) + (h) h ,

where (h) ’ 0 as h ’ 0. It is natural to call ± the derivative of f at x.

Finally, if we consider f : Rm ’ Rp , the natural ¬‚ow of our argument

suggests that we say that f is di¬erentiable at x if we can ¬nd a linear map

± : Rm ’ Rp such that

f (x + h) = f (x) + ±(h) + (h) h

where (h) ’ 0 as h ’ 0. It is natural to call ± the derivative of f at x.

Important note: It is indeed natural to call ± the derivative of f at x.

Unfortunately, it is not consistent with our old de¬nition in the case m =

p = 1. If f : R ’ R, then, with our new de¬nition, the derivative is the map

t ’ f (x)t but, with our old, the derivative is the number f (x).

From the point of view we have adopted, the key observation of the one

dimensional di¬erential calculus is that well behaved curves, however com-

plicated they may be globally, behave locally like straight lines i.e. like the

simplest curves we know. The key observation of multidimensional calculus

is that well behaved functions, however complicated they may be globally,

behave locally like linear maps i.e. like the simplest functions we know. It

is this observation, above all, which justi¬es the immense amount of time

spent studying linear algebra, that is to say, studying the behaviour of linear

maps.

I shall assume that the reader has done a course on linear algebra and

is familiar with with the de¬nition and lemma that follow. (Indeed, I have

already assumed familiarity with the notion of a linear map.)

De¬nition 6.1.2. We say that a function (or map) ± : Rm ’ Rp is linear

if

±(»x + µy) = »±(x) + µ±(y)

for all x, y ∈ Rm and », µ ∈ R.

We shall often write ±x = ±(x).

Lemma 6.1.3. Each linear map ± : Rm ’ Rp is associated with a unique

p — m real matrix A = (aij ) such that if ±x = y then

m

yi = aij xj (†)

j=1

124 A COMPANION TO ANALYSIS

Conversely each p—m real matrix A = (aij ) is associated with a unique linear

map ± : Rm ’ Rp by the equation (†).

We shall call A the matrix of ± with respect to the standard bases. The

point to notice is that, if we take di¬erent coordinate axes, we get di¬erent

matrices associated with the same linear map.

From time to time, particularly in some of the exercises, we shall use other

facts about linear maps. The reader should not worry too much if some of

these facts are unfamiliar but she should worry if all of them are.

We now repeat the discussion of di¬erentiation with marginally more

generality and precision.

A function is continuous if it is locally approximately constant. A function

is di¬erentiable if it is locally approximately linear. More precisely, a function

is continuous at a point x if it is locally approximately constant, with an error

which decreases to zero, as we approach x. A function is di¬erentiable at a

point x if it is locally approximately linear, with an error which decreases to

zero faster than linearly, as we approach x.

De¬nition 6.1.4. Suppose that E is a subset of Rm and x a point such that

there exists a δ > 0 with the ball B(x, δ) ⊆ E. We say that f : E ’ Rp , is

di¬erentiable at x if we can ¬nd a linear map ± : Rm ’ Rp such that, when

h < δ,

f (x + h) = f (x) + ±h + (x, h) h ,

where (x, h) ’ 0 as h ’ 0. We write ± = Df (x) or ± = f (x).

If E is open and f is di¬erentiable at each point of E, we say that f is

di¬erentiable on E.

Needless to say, the centre of the de¬nition is the formula and the

reader should concentrate on understanding the rˆle of each term in that

o

formula. The rest of the de¬nition is just supporting wa¬„e. Formula is

sometimes written in the form

f (x + h) ’ f (x) ’ ±h

’0

h

as h ’ 0.

Of course, we need to complete De¬nition 6.1.4 by showing that ± is

unique.

Lemma 6.1.5. (i) Let γ : Rm ’ Rp be a linear map and : R m ’ Rp a

function with (h) ’ 0 as h ’ 0. If

γh = (h) h

125

Please send corrections however trivial to twk@dpmms.cam.ac.uk

then γ = 0 the zero map.

(ii) There is at most one ± satisfying the conditions of De¬nition 6.1.4.

Proof. (i) There are many di¬erent ways of setting out this simple proof.

Here is one. Let x ∈ Rm . If · > 0, we have

γx = · ’1 γ(·x) = · ’1 (·x) ·x = (·x) x

and so

(·x) x ’ 0

γx =

as · ’ 0 through values · > 0. Thus γx = 0 and γx = 0 for all x ∈ Rm .

In other words, γ = 0.

(ii) Suppose that we can ¬nd linear maps ±j : Rm ’ Rp such that, when

h < δ,

f (x + h) = f (x) + ±j h + j (x, h) h,

where j (x, h) ’ 0 as h ’ 0 [j = 1, 2].

Subtracting, we see that

(±1 ’ ±2 )h = (x, h)

where

’

(x, h) = 2 (x, h) 1 (x, h)

for h < δ. Since

(x, h) ¤ ’0

1 (x, h) + 2 (x, h)

as h ’ 0, we can apply part (i) to obtain ±1 = ±2 .

The coordinate free approach can be taken only so far, and to calculate

we need to know the the matrix A of ± = Df (x) with respect to the standard

bases. To ¬nd A we have recourse to directional derivatives.

De¬nition 6.1.6. Suppose that E is a subset of Rm and that we have a

function g : E ’ R. Suppose further that x ∈ E and u is a unit vector such

that there exists a δ > 0 with x + hu ∈ E for all |h| < δ. We can now de¬ne

a function G from the open interval (’δ, δ) to R by setting G(t) = g(x + tu).

If G is di¬erentiable at 0, we say that g has a directional derivative at x in

the direction u of value G (0).

126 A COMPANION TO ANALYSIS

Exercise 6.1.7. Suppose that E is a subset of of Rm and that we have a

function g : E ’ R. Suppose further that x ∈ E and u is a unit vector such

that there exists a δ > 0 with x + hu ∈ E for all |h| < δ. Show that g has a

directional derivative at x in the direction u of value a if and only if

g(x + tu) ’ g(x)

’a

t

as t ’ 0.

We are interested in the directional derivatives along the unit vectors ej in

the directions of the coordinate axes. The reader is almost certainly familiar

with these under the name of ˜partial derivatives™.

De¬nition 6.1.8. Suppose that E is a subset of of Rm and that we have a

function g : E ’ R. If we give Rm the standard basis e1 , e2 , . . . , em (where

ej is the vector with jth entry 1 and all other entries 0), then the directional

derivative of g at x in the direction ej is called a partial derivative and written

g,j (x).

The recipe for computing g,j (x) is thus, ˜di¬erentiate g(x1 , x2 , . . . , xj , . . . , xn )

with respect to xj treating all the xi with i = j as constants™.

The reader would probably prefer me to say that g,j (x) is the partial

derivative of g with respect to xj and write

‚g

g,j (x) = (x).

‚xj

I shall use this notation from time to time, but, as I point out in Appendix E,

there are cultural di¬erences between the way that applied mathematicians

and pure mathematicians think of partial derivatives, so I prefer to use a

di¬erent notation.

The reader should also know a third notation for partial derivatives.

Dj g = g,j .

This ˜D™ notation is more common than the ˜comma™ notation and is to be

preferred if you only use partial derivatives occasionally or if you only deal

with functions f : Rn ’ R. The ˜comma™ notation is used in Tensor Analysis

and is convenient in the kind of formulae which appear in Section 7.2.

If E is a subset of of Rm and we have a function g : E ’ Rp then we can

write

«

g1 (t)

¬g2 (t)·

¬ ·

g(t) = ¬ . ·

. .

gp (t)

127

Please send corrections however trivial to twk@dpmms.cam.ac.uk

and obtain functions gi : E ’ R with partial derivatives (if they exist) gi,j (x)

‚gi

(or, in more standard notation (x)). The proof of the next lemma just

‚xj

consists of dismantling the notation so laboriously constructed in the last

few paragraphs.

Lemma 6.1.9. Let f be as in De¬nition 6.1.4. If we use standard coordi-

nates, then, if f is di¬erentiable at x, its partial derivatives fi,j (x) exist and

the matrix of the derivative Df (x) is the Jacobian matrix (fi,j (x)) of partial

derivatives.

Proof. Left as a strongly recommended but simple exercise for the reader.

Notice that, if f : R ’ R, the matrix of Df (x) is the 1 — 1 Jacobian matrix

(f (x)). Notice also that Exercise 6.1.9 provides an alternative proof of the

uniqueness of the derivative (Lemma 6.1.5 (ii)).

It is customary to point out that the existence of the partial deriva-

tives does not imply the di¬erentiability of the function (see Example 7.3.14

below) but the main objections to over-reliance on partial derivatives are

that it makes formulae cumbersome and sti¬‚es geometric intuition. Let your

motto be ˜coordinates and matrices for calculation, vectors and lin-

ear maps for understanding™.

6.2 The operator norm and the chain rule

We shall need some method of measuring the ˜size™ of a linear map. The

reader is unlikely to have come across this in a standard ˜abstract algebra™

course, since algebraists dislike using ˜metric notions™ which do not generalise

from R to more general ¬elds.

Our ¬rst idea might be to use some sort of measure like

= max |aij |

±

where (aij ) is the matrix of ± with respect to the standard bases. However

± has no geometric meaning.

Exercise 6.2.1. Show by example that ± may depend on the coordinate

axes chosen.

Even if we insist that our method of measuring the size of a linear map

shall have a geometric meaning, this does not give a unique method. The

following chain of ideas gives one method which is natural and standard.

128 A COMPANION TO ANALYSIS

Lemma 6.2.2. If ± : Rm ’ Rp is linear, there exists a constant K(±) such

that

±x ¤ K(±) x

for all x ∈ Rm .

Proof. Since our object is merely to show that some K(±) exists and not to

¬nd a ˜good™ value, we can use the crudest inequalities.

If we write y = ±x, we have

p

±x = y ¤ |yi |

i=1

p m

¤ |aij ||xj |

i=1 j=1

p m

¤ |aij | x .

i=1 j=1

p m

|aij |.

The required result follows on putting K(±) = i=1 j=1

Exercise 6.2.3. Use Lemma 6.2.2 to estimate ±x ’ ±y and hence deduce

that every linear map ± : Rm ’ Rp is continuous. (This exercise takes longer

to pose than to do.)

Lemma 6.2.2 tells us that { ±x : x ¤ 1} is a non-empty subset of R

bounded above by K(±) and so has a supremum.

De¬nition 6.2.4. If ± : Rm ’ Rp is a linear map, then

± = sup ±x .

x ¤1

Exercise 6.2.5. If ± is as in De¬nition 6.2.4, show that the three quantities

±x

sup ±x , sup ±x , and sup

x

x=0

x ¤1 x =1

are well de¬ned and equal.

The ˜operator norm™ just de¬ned in De¬nition 6.2.4 has many pleasant

properties.

129

Please send corrections however trivial to twk@dpmms.cam.ac.uk

Lemma 6.2.6. Let ±, β : Rm ’ Rp be linear maps.

(i) If x ∈ Rm then ±x ¤ ± x .

(ii) ± ≥ 0,

(iii) If ± = 0 then ± = 0,

(iv) If » ∈ R then »± = |»| ± .

(v) (The triangle inequality) ± + β ¤ ± + β .

(vi) If γ : Rp ’ Rq is linear, then γ± ¤ γ ± .

Proof. I will prove parts (i) and (vi) leaving the equally easy remaining parts

as an essential exercise for the reader.

(i) If x = 0, we observe that ±0 = 0 and so

±0 = 0 = 0 ¤ 0 = ± 0 = ± 0

as required.

’1

If x = 0, we set u = x x. Since

’1

u=x x =1

we have ±u ¤ ± and so

±x = ±( x u) = ( x ±u) = x ±u ¤ ± x

as required.

(vi) If x ¤ 1 then, using part (i) twice,

γ±(x) = γ(±(x)) ¤ γ ±(x) ¤ γ x¤γ

± ±.

It follows that

¤γ

γ± = sup γ±(x) ±.

x ¤1

Exercise 6.2.7. (i) Write down a linear map ± : R2 ’ R2 such that ± = 0

but ±2 = 0.

(ii) Show that we cannot replace the inequality (vi) in Lemma 6.2.6 by an

equality.

(iii) Show that we cannot replace the inequality (v) in Lemma 6.2.6 by an

equality.

130 A COMPANION TO ANALYSIS

Exercise 6.2.8. (i) Suppose that ± : R ’ R is a linear map and that its

matrix with respect to the standard bases is (a). Show that

± = |a|.

(ii) Suppose that ± : Rm ’ R is a linear map and that its matrix with re-

spect to the standard bases is (a1 a2 . . . am ). By using the Cauchy-Schwarz

inequality (Lemma 4.1.2) and the associated conditions for equality (Exer-

cise 4.1.5 (i)) show that

1/2

m

a2

±= .

j

j=1

Although the operator norm is, in principle, calculable (see Exercises K.98

to K.101) the reader is warned that, except in special cases, there is no simple

formula for the operator norm and it is mainly used as a theoretical tool.

Should we need to have some idea of its size, extremely rough estimates will

often su¬ce.

Exercise 6.2.9. Suppose that ± : Rm ’ Rp is a linear map and that its

matrix with respect to the standard bases is A = (aij ). Show that

max |aij | ¤ ± ¤ pm max |aij |.

i,j i,j

By using the Cauchy-Schwarz inequality, show that

1/2

p m

a2

±¤ .

ij

i=1 j=1

Show that this inequality implies the corresponding inequality in the previous

paragraph.

We now return to di¬erentiation. Suppose that f : Rm ’ Rp and g :

Rp ’ Rq are di¬erentiable. What can we say about their composition g —¦ f ?

To simplify the algebra let us suppose that f (0) = 0, g(0) = 0 (so g —¦ f (0) =

0) and ask about the di¬erentiability of g—¦f at 0. Suppose that the derivative

of f at 0 is ± and the derivative of g at 0 is β. Then

f (h) ≈ ±h

when h is small (h ∈ Rm ) and

g(k) ≈ βk

131

Please send corrections however trivial to twk@dpmms.cam.ac.uk

when k is small (k ∈ Rp ). It ought, therefore, to to be true that

g(f (h)) ≈ β(±h)

i.e. that