2.5 Machine Representation of Numbers 51

of any given real number. To this concern, notice that, even if x and y were

two numbers in F, the result of an operation on them does not necessarily

belong to F. Therefore, we must de¬ne an arithmetic also on F.

The simplest approach to solve the ¬rst problem consists of rounding

x ∈ R in such a way that the rounded number belongs to F. Among all

the possible rounding operations, let us consider the following one. Given

x ∈ R in the normalized positional notation (2.29) let us substitute x by

its representant f l(x) in F, de¬ned as

at if at+1 < β/2

f l(x) = (’1)s (0. a1 a2 . . . at ) · β e , at =

˜ (2.32)

at + 1 if at+1 ≥ β/2.

The mapping f l : R ’ F is the most commonly used and is called rounding

(in the chopping one would take more trivially at = at ). Clearly, f l(x) = x

if x ∈ F and moreover f l(x) ¤ f l(y) if x ¤ y ∀x, y ∈ R (monotonicity

property).

Remark 2.2 (Over¬‚ow and under¬‚ow) Everything written so far holds

only for the numbers that in (2.29) have exponent e within the range of F.

If, indeed, x ∈ (’∞, ’xmax ) ∪ (xmax , ∞) the value f l(x) is not de¬ned,

while if x ∈ (’xmin , xmin ) the operation of rounding is de¬ned anyway

(even in absence of de-normalized numbers). In the ¬rst case, if x is the

result of an operation on numbers of F, we speak about over¬‚ow, in the

second case about under¬‚ow (or graceful under¬‚ow if de-normalized num-

bers are accounted for). The over¬‚ow is handled by the system through an

interrupt of the executing program.

Apart from exceptional situations, we can easily quantify the error, ab-

solute and relative, that is made by substituting f l(x) for x. The following

result can be shown (see for instance [Hig96], Theorem 2.2).

Property 2.1 If x ∈ R is such that xmin ¤ |x| ¤ xmax , then

f l(x) = x(1 + δ) with |δ| ¤ u (2.33)

where

1 1’t 1

u= β = (2.34)

M

2 2

is the so-called roundo¬ unit (or machine precision).

As a consequence of (2.33), the following bound holds for the relative error

|x ’ f l(x)|

¤ u,

Erel (x) = (2.35)

|x|

while, for the absolute error, one gets

E(x) = |x ’ f l(x)| ¤ β e’t |(a1 . . . at .at+1 . . . ) ’ (a1 . . . at )|.

˜

52 2. Principles of Numerical Mathematics

From (2.32), it follows that

β

|(a1 . . . at .at+1 . . . ) ’ (a1 . . . at )| ¤ β ’1 ,

˜

2

from which

1 ’t+e

E(x) ¤ β .

2

Remark 2.3 In the MATLAB environment it is possible to know imme-

diately the value of M , which is given by the system variable eps.

2.5.6 Machine Floating-point Operations

As previously stated, it is necessary to de¬ne on the set of machine numbers

an arithmetic which is analogous, as far as possible, to the arithmetic in R.

Thus, given any arithmetic operation —¦ : R — R ’ R on two operands in

R (the symbol —¦ may denote sum, subtraction, multiplication or division),

we shall denote by —¦ the corresponding machine operation

—¦ : R — R ’ F, x —¦ y = f l(f l(x) —¦ f l(y)).

From the properties of ¬‚oating-point numbers one could expect that for the

operations on two operands, whenever well de¬ned, the following property

holds: ∀x, y ∈ F, ∃δ ∈ R such that

x —¦ y = (x —¦ y)(1 + δ) with |δ| ¤ u. (2.36)

In order for (2.36) to be satis¬ed when —¦ is the operator of subtraction, it

will require an additional assumption on the structure of the numbers in F,

that is the presence of the so-called round digit (which is addressed at the

end of this section). In particular, when —¦ is the sum operator, it follows

that for all x, y ∈ F (see Exercise 11)

|x + y ’ (x + y)| |x| + |y|

¤ u(1 + u) + u, (2.37)

|x + y| |x + y|

so that the relative error associated with every machine operation will be

small, unless x + y is not small by itself. An aside comment is deserved by

the case of the sum of two numbers close in module, but opposite in sign.

In fact, in such a case x + y can be quite small, this generating the so-called

cancellation errors (as evidenced in Example 2.6).

It is important to notice that, together with properties of standard arith-

metic that are preserved when passing to ¬‚oating-point arithmetic (like, for

instance, the commutativity of the sum of two addends, or the product of

2.5 Machine Representation of Numbers 53

two factors), other properties are lost. An example is given by the associa-

tivity of sum: it can indeed be shown (see Exercise 12) that in general

x + (y + z) = (x + y) + z.

We shall denote by ¬‚op the single elementary ¬‚oating-point operation (sum,

subtraction, multiplication or division) (the reader is warned that in some

texts ¬‚op identi¬es an operation of the form a + b · c). According to the

previous convention, a scalar product between two vectors of length n will

require 2n ’ 1 ¬‚ops, a product matrix-vector 2(m ’ 1)n ¬‚ops if the matrix

is n — m and ¬nally, a product matrix-matrix 2(r ’ 1)mn ¬‚ops if the two

matrices are m — r and r — n respectively.

Remark 2.4 (IEC559 arithmetic) The IEC559 standard also de¬nes a

closed arithmetic on F, this meaning that any operation on it produces

a result that can be represented within the system itself, although not

necessarily being expected from a pure mathematical standpoint. As an

example, in Table 2.3 we report the results that are obtained in exceptional

situations.

exception examples result

0/0, 0 · ∞

non valid operation N aN

±∞

overf low

±∞

division by zero 1/0

underf low subnormal numbers

TABLE 2.3. Results for some exceptional operations

The presence of a N aN (Not a Number) in a sequence of operations au-

tomatically implies that the result is a N aN . General acceptance of this

standard is still ongoing.

We mention that not all the ¬‚oating-point systems satisfy (2.36). One of

the main reasons is the absence of the round digit in subtraction, that is,

an extra-bit that gets into action on the mantissa level when the subtrac-

tion between two ¬‚oating-point numbers is performed. To demonstrate the

importance of the round digit, let us consider the following example with

a system F having β = 10 and t = 2. Let us subtract 1 and 0.99. We have

101 · 0.1 101 · 0.10

100 · 0.99 ’ 101 · 0.09

101 · 0.01 ’’ 100 · 0.10

that is, the result di¬ers from the exact one by a factor 10. If we now

execute the same subtraction using the round digit, we obtain the exact

54 2. Principles of Numerical Mathematics

result. Indeed

101 · 0.1 101 · 0.10

100 · 0.99 ’ 101 · 0.09 9

101 · 0.00 1 ’’ 100 · 0.01

In fact, it can be shown that addition and subtraction, if executed without

round digit, do not satisfy the property

f l(x ± y) = (x ± y)(1 + δ) with |δ| ¤ u,

but the following one

f l(x ± y) = x(1 + ±) ± y(1 + β) with |±| + |β| ¤ u.

An arithmetic for which this latter event happens is called aberrant. In some

computers the round digit does not exist, most of the care being spent on

velocity in the computation. Nowadays, however, the trend is to use even

two round digits (see [HP94] for technical details about the subject).

2.6 Exercises

1. Use (2.7) to compute the condition number K(d) of the following expres-

sions

x ’ ad = 0, a > 0 d ’ x + 1 = 0,

(1) (2)

d being the datum, a a parameter and x the “unknown”.

[Solution : (1) K(d) |d|| log a|, (2) K(d) = |d|/|d + 1|.]

2. Study the well posedness and the conditioning in the in¬nity norm of the

following problem as a function of the datum d: ¬nd x and y such that

x + dy = 1,

dx + y = 0.

[Solution : the given problem is a linear system whose matrix is A =

1d

. It is well-posed if A is nonsingular, i.e., if d = ±1. In such a

d1

case, K∞ (A) = |(|d| + 1)/(|d| ’ 1)|.]

3. Study the conditioning of the solving formula x± = ’p ± p2 + q for

the second degree equation x2 + 2px ’ q with respect to changes in the

parameters p and q separately.

[Solution : K(p) = |p|/ p2 + q, K(q) = |q|/(2|x± | p2 + q).]

4. Consider the following Cauchy problem

x (t) = x0 eat (a cos(t) ’ sin(t)) , t>0

(2.38)

x(0) = x0

2.6 Exercises 55

whose solution is x(t) = x0 eat cos(t) (a is a given real number). Study the

conditioning of (2.38) with respect to the choice of the initial datum and

check that on unbounded intervals it is well conditioned if a < 0, while it

is ill conditioned if a > 0.

[Hint : consider the de¬nition of Kabs (a).]

5. Let x = 0 be an approximation of a non null quantity x. Find the relation

˜

between the relative error = |x ’ x|/|x| and E = |x ’ x|/|x|.

6. Find a stable formula for evaluating the square root of a complex number.

7. Determine all the elements of the set F = (10, 6, ’9, 9), in both normalized

and de-normalized cases.

8. Consider the set of the de-normalized numbers FD and study the behavior

of the absolute distance and of the relative distance between two of these

numbers. Does the wobbling precision e¬ect arise again?

[Hint : for these numbers, uniformity in the relative density is lost. As a

consequence, the absolute distance remains constant (equal to β L’t ), while

the relative one rapidly grows as x tends to zero.]

9. What is the value of 00 in IEEE arithmetic?

[Solution : ideally, the outcome should be N aN . In practice, IEEE systems

recover the value 1. A motivation of this result can be found in [Gol91].]

10. Show that, due to cancellation errors, the following sequence

6 1

(2.39)

I0 = log , Ik + 5Ik’1 = , k = 1, 2, . . . , n,

5 k

is not well suited to ¬nite arithmetic computations of the integral In =

n

1x

dx when n is su¬ciently large, although it works in in¬nite arith-

0 x+5

metic.

˜

[Hint : consider the initial perturbed datum I0 = I0 + µ0 and study the

propagation of the error µ0 within (2.39).]

11. Prove (2.37).

[Solution : notice that

|x + y ’ (x + y)| |x + y ’ (f l(x) + f l(y))| |f l(x) ’ x + f l(y) ’ y|

¤ + .

|x + y| |x + y| |x + y|

Then, use (2.36) and (2.35).]

12. Given x, y, z ∈ F with x + y, y + z, x + y + z that fall into the range of F,

show that

|(x + y) + z ’ (x + y + z)| ¤ C1 (2|x + y| + |z|)u

|x + (y + z) ’ (x + y + z)| ¤ C2 (|x| + 2|y + z|)u.

13. Which among the following approximations of π,

1 1 1 1

π =4 1’ + ’ + ’ ... ,

3 5 7 9

(2.40)

3 · 5(0.5)7

3

3(0.5)5

(0.5)

π = 6 0.5 + + + + ...

2·3 2·4·5 2·4·6·7

56 2. Principles of Numerical Mathematics

better limits the propagation of rounding errors? Compare using MATLAB

the obtained results as a function of the number of the terms in each sum

in (2.40).

14. Analyze the stability, with respect to propagation of rounding errors, of the

following two MATLAB codes to evaluate f (x) = (ex ’ 1)/x for |x| 1

% Algorithm 1 % Algorithm 2

if x == 0 y = exp (x);

f = 1; if y == 1

else f = 1;

f = (exp(x) - 1) / x; else

end f = (y - 1) / log (y);

end

[Solution : the ¬rst algorithm is inaccurate due to cancellation errors, while

the second one (in presence of round digit) is stable and accurate.]

15. In binary arithmetic one can show [Dek71] that the rounding error in the

sum of two numbers a and b, with a ≥ b, can be computed as

((a + b) ’ a) ’ b).

Based on this property, a method has been proposed, called Kahan com-

pensated sum, to compute the sum of n addends ai in such a way that the

rounding errors are compensated. In practice, letting the initial rounding

error e1 = 0 and s1 = a1 , at the i-th step, with i ≥ 2, the algorithm

evaluates yi = xi ’ ei’1 , the sum is updated setting si = si’1 + yi and

the new rounding error is computed as ei = (si ’ si’1 ) ’ yi . Implement

this algorithm in MATLAB and check its accuracy by evaluating again the

second expression in (2.40).

16. The area A(T ) of a triangle T with sides a, b and c, can be computed using

the following formula

p(p ’ a)(p ’ b)(p ’ c),

A(T ) =

where p is half the perimeter of T . Show that in the case of strongly de-

formed triangles (a b + c), this formula lacks accuracy and check this

experimentally.

3

Direct Methods for the Solution of

Linear Systems

A system of m linear equations in n unknowns consists of a set of algebraic

relations of the form

n

aij xj = bi , i = 1, . . . , m (3.1)

j=1

where xj are the unknowns, aij are the coe¬cients of the system and bi

are the components of the right hand side. System (3.1) can be more con-

veniently written in matrix form as

Ax = b, (3.2)

where we have denoted by A = (aij ) ∈ Cm—n the coe¬cient matrix, by

b=(bi ) ∈ Cm the right side vector and by x=(xi ) ∈ Cn the unknown

vector, respectively. We call a solution of (3.2) any n-tuple of values xi

which satis¬es (3.1).

In this chapter we shall be mainly dealing with real-valued square systems

of order n, that is, systems of the form (3.2) with A ∈ Rn—n and b ∈ Rn .

In such cases existence and uniqueness of the solution of (3.2) are ensured

if one of the following (equivalent) hypotheses holds:

1. A is invertible;

2. rank(A)=n;

3. the homogeneous system Ax=0 admits only the null solution.

58 3. Direct Methods for the Solution of Linear Systems

The solution of system (3.2) is formally provided by Cramer™s rule

∆j

xj = , j = 1, . . . , n, (3.3)

det(A)

where ∆j is the determinant of the matrix obtained by substituting the

j-th column of A with the right hand side b. This formula is, however,

of little practical use. Indeed, if the determinants are evaluated by the

recursive relation (1.4), the computational e¬ort of Cramer™s rule is of the

order of (n + 1)! ¬‚ops and therefore turns out to be unacceptable even for

small dimensions of A (for instance, a computer able to perform 109 ¬‚ops

per second would take 9.6 · 1047 years to solve a linear system of only 50

equations).

For this reason, numerical methods that are alternatives to Cramer™s rule

have been developed. They are called direct methods if they yield the so-

lution of the system in a ¬nite number of steps, iterative if they require

(theoretically) an in¬nite number of steps. Iterative methods will be ad-

dressed in the next chapter. We notice from now on that the choice between

a direct and an iterative method does not depend only on the theoretical ef-

¬ciency of the scheme, but also on the particular type of matrix, on memory

storage requirements and, ¬nally, on the architecture of the computer.

3.1 Stability Analysis of Linear Systems

Solving a linear system by a numerical method invariably leads to the

introduction of rounding errors. Only using stable numerical methods can

keep away the propagation of such errors from polluting the accuracy of the

solution. In this section two aspects of stability analysis will be addressed.

Firstly, we will analyze the sensitivity of the solution of (3.2) to changes

in the data A and b (forward a priori analysis). Secondly, assuming that

an approximate solution x of (3.2) is available, we shall quantify the per-

turbations on the data A and b in order for x to be the exact solution

of a perturbed system (backward a priori analysis). The size of these per-

turbations will in turn allow us to measure the accuracy of the computed

solution x by the use of a posteriori analysis.

3.1.1 The Condition Number of a Matrix

The condition number of a matrix A ∈ Cn—n is de¬ned as

A’1 ,

K(A) = A (3.4)

where · is an induced matrix norm. In general K(A) depends on the

choice of the norm; this will be made clear by introducing a subscript

3.1 Stability Analysis of Linear Systems 59

into the notation, for instance, K∞ (A) = A ∞ A’1 ∞ . More generally,

Kp (A) will denote the condition number of A in the p-norm. Remarkable

instances are p = 1, p = 2 and p = ∞ (we refer to Exercise 1 for the

relations among K1 (A), K2 (A) and K∞ (A)).

As already noticed in Example 2.3, an increase in the condition number

produces a higher sensitivity of the solution of the linear system to changes

in the data. Let us start by noticing that K(A) ≥ 1 since

1 = AA’1 ¤ A A’1 = K(A).

Moreover, K(A’1 ) = K(A) and ∀± ∈ C with ± = 0, K(±A) = K(A).

Finally, if A is orthogonal, K2 (A) = 1 since A 2 = ρ(AT A) = ρ(I) = 1

and A’1 = AT . The condition number of a singular matrix is set equal to

in¬nity.

For p = 2, K2 (A) can be characterized as follows. Starting from (1.21),

it can be proved that

σ1 (A)

A’1

K2 (A) = A =

2 2

σn (A)

where σ1 (A) and σn (A) are the maximum and minimum singular values of

A (see Property 1.7). As a consequence, in the case of symmetric positive

de¬nite matrices we have

»max