<< . .

. 15
( : 70)



. . >>


f (t) = b + a(t ’ x) + small error

close to x. If we think a little harder about the nature of the ˜smallest error™
possible we see that it ˜ought to decrease faster than linear™ that is

f (t) = b + a(t ’ x) + E(t)|t ’ x|

with E(t) ’ 0 as t ’ x.

Exercise 6.1.1. Suppose that f : R ’ R. Show that the following two
statements are equivalent.
f (t) ’ f (x)
’ a as t ’ x.
(i)
t’x
(ii) f (t) = f (x) + a(t ’ x) + E(t)|t ’ x| with E(t) ’ 0 as t ’ x.
Rewriting our equations slightly, we see that f is di¬erentiable at x if

f (t) ’ f (x) = a(t ’ x) + E(t)|t ’ x|

with E(t) ’ 0 as t ’ 0. A ¬nal rewrite now gives f is di¬erentiable at x if

f (x + h) ’ f (x) = ah + (h)|h|.

where (h) ’ 0 as h ’ x. The derivative f (x) = a is the slope of the
tangent at x.
The obvious extension to well behaved functions f : Rm ’ R is to con-
sider the tangent plane at (x, f (x)). Just as the equation of a non-vertical
123
Please send corrections however trivial to twk@dpmms.cam.ac.uk

line through the origin in R — R is y = bt, so the equation of an appropriate
plane (or ˜hyperplane™ if the reader prefers) in Rm — R is y = ±(x) where
± : Rm ’ R is linear. This suggests that we say that f is di¬erentiable at x
if
f (x + h) ’ f (x) = ±(h) + (h) h ,
where (h) ’ 0 as h ’ 0. It is natural to call ± the derivative of f at x.
Finally, if we consider f : Rm ’ Rp , the natural ¬‚ow of our argument
suggests that we say that f is di¬erentiable at x if we can ¬nd a linear map
± : Rm ’ Rp such that
f (x + h) = f (x) + ±(h) + (h) h
where (h) ’ 0 as h ’ 0. It is natural to call ± the derivative of f at x.
Important note: It is indeed natural to call ± the derivative of f at x.
Unfortunately, it is not consistent with our old de¬nition in the case m =
p = 1. If f : R ’ R, then, with our new de¬nition, the derivative is the map
t ’ f (x)t but, with our old, the derivative is the number f (x).
From the point of view we have adopted, the key observation of the one
dimensional di¬erential calculus is that well behaved curves, however com-
plicated they may be globally, behave locally like straight lines i.e. like the
simplest curves we know. The key observation of multidimensional calculus
is that well behaved functions, however complicated they may be globally,
behave locally like linear maps i.e. like the simplest functions we know. It
is this observation, above all, which justi¬es the immense amount of time
spent studying linear algebra, that is to say, studying the behaviour of linear
maps.
I shall assume that the reader has done a course on linear algebra and
is familiar with with the de¬nition and lemma that follow. (Indeed, I have
already assumed familiarity with the notion of a linear map.)
De¬nition 6.1.2. We say that a function (or map) ± : Rm ’ Rp is linear
if
±(»x + µy) = »±(x) + µ±(y)
for all x, y ∈ Rm and », µ ∈ R.
We shall often write ±x = ±(x).
Lemma 6.1.3. Each linear map ± : Rm ’ Rp is associated with a unique
p — m real matrix A = (aij ) such that if ±x = y then
m
yi = aij xj (†)
j=1
124 A COMPANION TO ANALYSIS

Conversely each p—m real matrix A = (aij ) is associated with a unique linear
map ± : Rm ’ Rp by the equation (†).
We shall call A the matrix of ± with respect to the standard bases. The
point to notice is that, if we take di¬erent coordinate axes, we get di¬erent
matrices associated with the same linear map.
From time to time, particularly in some of the exercises, we shall use other
facts about linear maps. The reader should not worry too much if some of
these facts are unfamiliar but she should worry if all of them are.
We now repeat the discussion of di¬erentiation with marginally more
generality and precision.
A function is continuous if it is locally approximately constant. A function
is di¬erentiable if it is locally approximately linear. More precisely, a function
is continuous at a point x if it is locally approximately constant, with an error
which decreases to zero, as we approach x. A function is di¬erentiable at a
point x if it is locally approximately linear, with an error which decreases to
zero faster than linearly, as we approach x.
De¬nition 6.1.4. Suppose that E is a subset of Rm and x a point such that
there exists a δ > 0 with the ball B(x, δ) ⊆ E. We say that f : E ’ Rp , is
di¬erentiable at x if we can ¬nd a linear map ± : Rm ’ Rp such that, when
h < δ,
f (x + h) = f (x) + ±h + (x, h) h ,
where (x, h) ’ 0 as h ’ 0. We write ± = Df (x) or ± = f (x).
If E is open and f is di¬erentiable at each point of E, we say that f is
di¬erentiable on E.
Needless to say, the centre of the de¬nition is the formula and the
reader should concentrate on understanding the rˆle of each term in that
o
formula. The rest of the de¬nition is just supporting wa¬„e. Formula is
sometimes written in the form
f (x + h) ’ f (x) ’ ±h
’0
h
as h ’ 0.
Of course, we need to complete De¬nition 6.1.4 by showing that ± is
unique.
Lemma 6.1.5. (i) Let γ : Rm ’ Rp be a linear map and : R m ’ Rp a
function with (h) ’ 0 as h ’ 0. If
γh = (h) h
125
Please send corrections however trivial to twk@dpmms.cam.ac.uk

then γ = 0 the zero map.
(ii) There is at most one ± satisfying the conditions of De¬nition 6.1.4.

Proof. (i) There are many di¬erent ways of setting out this simple proof.
Here is one. Let x ∈ Rm . If · > 0, we have

γx = · ’1 γ(·x) = · ’1 (·x) ·x = (·x) x

and so

(·x) x ’ 0
γx =

as · ’ 0 through values · > 0. Thus γx = 0 and γx = 0 for all x ∈ Rm .
In other words, γ = 0.
(ii) Suppose that we can ¬nd linear maps ±j : Rm ’ Rp such that, when
h < δ,

f (x + h) = f (x) + ±j h + j (x, h) h,

where j (x, h) ’ 0 as h ’ 0 [j = 1, 2].
Subtracting, we see that

(±1 ’ ±2 )h = (x, h)

where


(x, h) = 2 (x, h) 1 (x, h)


for h < δ. Since

(x, h) ¤ ’0
1 (x, h) + 2 (x, h)


as h ’ 0, we can apply part (i) to obtain ±1 = ±2 .
The coordinate free approach can be taken only so far, and to calculate
we need to know the the matrix A of ± = Df (x) with respect to the standard
bases. To ¬nd A we have recourse to directional derivatives.

De¬nition 6.1.6. Suppose that E is a subset of Rm and that we have a
function g : E ’ R. Suppose further that x ∈ E and u is a unit vector such
that there exists a δ > 0 with x + hu ∈ E for all |h| < δ. We can now de¬ne
a function G from the open interval (’δ, δ) to R by setting G(t) = g(x + tu).
If G is di¬erentiable at 0, we say that g has a directional derivative at x in
the direction u of value G (0).
126 A COMPANION TO ANALYSIS

Exercise 6.1.7. Suppose that E is a subset of of Rm and that we have a
function g : E ’ R. Suppose further that x ∈ E and u is a unit vector such
that there exists a δ > 0 with x + hu ∈ E for all |h| < δ. Show that g has a
directional derivative at x in the direction u of value a if and only if
g(x + tu) ’ g(x)
’a
t
as t ’ 0.
We are interested in the directional derivatives along the unit vectors ej in
the directions of the coordinate axes. The reader is almost certainly familiar
with these under the name of ˜partial derivatives™.
De¬nition 6.1.8. Suppose that E is a subset of of Rm and that we have a
function g : E ’ R. If we give Rm the standard basis e1 , e2 , . . . , em (where
ej is the vector with jth entry 1 and all other entries 0), then the directional
derivative of g at x in the direction ej is called a partial derivative and written
g,j (x).
The recipe for computing g,j (x) is thus, ˜di¬erentiate g(x1 , x2 , . . . , xj , . . . , xn )
with respect to xj treating all the xi with i = j as constants™.
The reader would probably prefer me to say that g,j (x) is the partial
derivative of g with respect to xj and write
‚g
g,j (x) = (x).
‚xj
I shall use this notation from time to time, but, as I point out in Appendix E,
there are cultural di¬erences between the way that applied mathematicians
and pure mathematicians think of partial derivatives, so I prefer to use a
di¬erent notation.
The reader should also know a third notation for partial derivatives.
Dj g = g,j .
This ˜D™ notation is more common than the ˜comma™ notation and is to be
preferred if you only use partial derivatives occasionally or if you only deal
with functions f : Rn ’ R. The ˜comma™ notation is used in Tensor Analysis
and is convenient in the kind of formulae which appear in Section 7.2.
If E is a subset of of Rm and we have a function g : E ’ Rp then we can
write
« 
g1 (t)
¬g2 (t)·
¬ ·
g(t) = ¬ . ·
. .
gp (t)
127
Please send corrections however trivial to twk@dpmms.cam.ac.uk

and obtain functions gi : E ’ R with partial derivatives (if they exist) gi,j (x)
‚gi
(or, in more standard notation (x)). The proof of the next lemma just
‚xj
consists of dismantling the notation so laboriously constructed in the last
few paragraphs.

Lemma 6.1.9. Let f be as in De¬nition 6.1.4. If we use standard coordi-
nates, then, if f is di¬erentiable at x, its partial derivatives fi,j (x) exist and
the matrix of the derivative Df (x) is the Jacobian matrix (fi,j (x)) of partial
derivatives.

Proof. Left as a strongly recommended but simple exercise for the reader.

Notice that, if f : R ’ R, the matrix of Df (x) is the 1 — 1 Jacobian matrix
(f (x)). Notice also that Exercise 6.1.9 provides an alternative proof of the
uniqueness of the derivative (Lemma 6.1.5 (ii)).
It is customary to point out that the existence of the partial deriva-
tives does not imply the di¬erentiability of the function (see Example 7.3.14
below) but the main objections to over-reliance on partial derivatives are
that it makes formulae cumbersome and sti¬‚es geometric intuition. Let your
motto be ˜coordinates and matrices for calculation, vectors and lin-
ear maps for understanding™.


6.2 The operator norm and the chain rule
We shall need some method of measuring the ˜size™ of a linear map. The
reader is unlikely to have come across this in a standard ˜abstract algebra™
course, since algebraists dislike using ˜metric notions™ which do not generalise
from R to more general ¬elds.
Our ¬rst idea might be to use some sort of measure like

= max |aij |
±

where (aij ) is the matrix of ± with respect to the standard bases. However
± has no geometric meaning.

Exercise 6.2.1. Show by example that ± may depend on the coordinate
axes chosen.

Even if we insist that our method of measuring the size of a linear map
shall have a geometric meaning, this does not give a unique method. The
following chain of ideas gives one method which is natural and standard.
128 A COMPANION TO ANALYSIS

Lemma 6.2.2. If ± : Rm ’ Rp is linear, there exists a constant K(±) such
that

±x ¤ K(±) x

for all x ∈ Rm .

Proof. Since our object is merely to show that some K(±) exists and not to
¬nd a ˜good™ value, we can use the crudest inequalities.
If we write y = ±x, we have
p
±x = y ¤ |yi |
i=1
p m
¤ |aij ||xj |
i=1 j=1
p m
¤ |aij | x .
i=1 j=1

p m
|aij |.
The required result follows on putting K(±) = i=1 j=1

Exercise 6.2.3. Use Lemma 6.2.2 to estimate ±x ’ ±y and hence deduce
that every linear map ± : Rm ’ Rp is continuous. (This exercise takes longer
to pose than to do.)

Lemma 6.2.2 tells us that { ±x : x ¤ 1} is a non-empty subset of R
bounded above by K(±) and so has a supremum.

De¬nition 6.2.4. If ± : Rm ’ Rp is a linear map, then

± = sup ±x .
x ¤1


Exercise 6.2.5. If ± is as in De¬nition 6.2.4, show that the three quantities

±x
sup ±x , sup ±x , and sup
x
x=0
x ¤1 x =1


are well de¬ned and equal.

The ˜operator norm™ just de¬ned in De¬nition 6.2.4 has many pleasant
properties.
129
Please send corrections however trivial to twk@dpmms.cam.ac.uk

Lemma 6.2.6. Let ±, β : Rm ’ Rp be linear maps.
(i) If x ∈ Rm then ±x ¤ ± x .
(ii) ± ≥ 0,
(iii) If ± = 0 then ± = 0,
(iv) If » ∈ R then »± = |»| ± .
(v) (The triangle inequality) ± + β ¤ ± + β .
(vi) If γ : Rp ’ Rq is linear, then γ± ¤ γ ± .

Proof. I will prove parts (i) and (vi) leaving the equally easy remaining parts
as an essential exercise for the reader.
(i) If x = 0, we observe that ±0 = 0 and so

±0 = 0 = 0 ¤ 0 = ± 0 = ± 0

as required.
’1
If x = 0, we set u = x x. Since
’1
u=x x =1

we have ±u ¤ ± and so

±x = ±( x u) = ( x ±u) = x ±u ¤ ± x

as required.
(vi) If x ¤ 1 then, using part (i) twice,

γ±(x) = γ(±(x)) ¤ γ ±(x) ¤ γ x¤γ
± ±.

It follows that

¤γ
γ± = sup γ±(x) ±.
x ¤1




Exercise 6.2.7. (i) Write down a linear map ± : R2 ’ R2 such that ± = 0
but ±2 = 0.
(ii) Show that we cannot replace the inequality (vi) in Lemma 6.2.6 by an
equality.
(iii) Show that we cannot replace the inequality (v) in Lemma 6.2.6 by an
equality.
130 A COMPANION TO ANALYSIS

Exercise 6.2.8. (i) Suppose that ± : R ’ R is a linear map and that its
matrix with respect to the standard bases is (a). Show that

± = |a|.

(ii) Suppose that ± : Rm ’ R is a linear map and that its matrix with re-
spect to the standard bases is (a1 a2 . . . am ). By using the Cauchy-Schwarz
inequality (Lemma 4.1.2) and the associated conditions for equality (Exer-
cise 4.1.5 (i)) show that
1/2
m
a2
±= .
j
j=1

Although the operator norm is, in principle, calculable (see Exercises K.98
to K.101) the reader is warned that, except in special cases, there is no simple
formula for the operator norm and it is mainly used as a theoretical tool.
Should we need to have some idea of its size, extremely rough estimates will
often su¬ce.
Exercise 6.2.9. Suppose that ± : Rm ’ Rp is a linear map and that its
matrix with respect to the standard bases is A = (aij ). Show that

max |aij | ¤ ± ¤ pm max |aij |.
i,j i,j

By using the Cauchy-Schwarz inequality, show that
1/2
p m
a2
±¤ .
ij
i=1 j=1

Show that this inequality implies the corresponding inequality in the previous
paragraph.
We now return to di¬erentiation. Suppose that f : Rm ’ Rp and g :
Rp ’ Rq are di¬erentiable. What can we say about their composition g —¦ f ?
To simplify the algebra let us suppose that f (0) = 0, g(0) = 0 (so g —¦ f (0) =
0) and ask about the di¬erentiability of g—¦f at 0. Suppose that the derivative
of f at 0 is ± and the derivative of g at 0 is β. Then

f (h) ≈ ±h

when h is small (h ∈ Rm ) and

g(k) ≈ βk
131
Please send corrections however trivial to twk@dpmms.cam.ac.uk

when k is small (k ∈ Rp ). It ought, therefore, to to be true that

g(f (h)) ≈ β(±h)

i.e. that

<< . .

. 15
( : 70)



. . >>