˜

˜

the error x ’ x in terms of A ’ A and b ’ b, while backward analysis estimates

˜

the perturbations δA = (δaij ) and δb = (δbi ) which should be impressed to the

entries of A and b in order to get (A + δA)xn = b + δb, xn being the solution of

the linear system (computed somehow). Finally, a posteriori error analysis looks

for an estimate of the error x ’ xn as a function of the residual rn = b ’ Axn .

•

We will develop this analysis in Section 3.1.

It is important to point out the role played by the a posteriori analysis in

devising strategies for adaptive error control. These strategies, by suitably

changing the discretization parameters (for instance, the spacing between

2.4 Sources of Error in Computational Models 43

nodes in the numerical integration of a function or a di¬erential equation),

employ the a posteriori analysis in order to ensure that the error does not

exceed a ¬xed tolerance.

A numerical method that makes use of an adaptive error control is called

adaptive numerical method. In practice, a method of this kind applies in the

computational process the idea of feedback, by activating on the grounds of

a computed solution a convergence test which ensures the control of error

within a ¬xed tolerance. In case the convergence test fails, a suitable strat-

egy for modifying the discretization parameters is automatically adopted

in order to enhance the accuracy of the solution to be newly computed,

and the overall procedure is iterated until the convergence check is passed.

2.4 Sources of Error in Computational Models

Whenever the numerical problem (2.12) is an approximation to the math-

ematical problem (2.1) and this latter is in turn a model of a physical

problem (which will be shortly denoted by PP), we shall say that (2.12) is

a computational model for PP.

In this process the global error, denoted by e, is expressed by the dif-

ference between the actually computed solution, xn , and the physical so-

lution, xph , of which x provides a model. The global error e can thus be

interpreted as being the sum of the error em of the mathematical model,

given by x ’ xph , and the error ec of the computational model, xn ’ x, that

is e = em + ec (see Figure 2.1).

PP : xph

em e

ec

xn

F (x, d) = 0

en ea

Fn (xn , dn ) = 0

FIGURE 2.1. Errors in computational models

The error em will in turn take into account the error of the mathematical

model in strict sense (that is, the extent at which the functional equation

(2.1) does realistically describe the problem PP) and the error on the data

(that is, how much accurately does d provide a measure of the real physical

44 2. Principles of Numerical Mathematics

data). In the same way, ec turns out to be the combination of the numerical

discretization error en = xn ’ x, the error ea introduced by the numerical

algorithm and the roundo¬ error introduced by the computer during the

actual solution of problem (2.12) (see Section 2.5).

In general, we can thus outline the following sources of error:

1. errors due to the model, that can be controlled by a proper choice of

the mathematical model;

2. errors in the data, that can be reduced by enhancing the accuracy in

the measurement of the data themselves;

3. truncation errors, arising from having replaced in the numerical model

limits by operations that involve a ¬nite number of steps;

4. rounding errors.

The errors at the items 3. and 4. give rise to the computational error. A

numerical method will thus be convergent if this error can be made arbi-

trarily small by increasing the computational e¬ort. Of course, convergence

is the primary, albeit not unique, goal of a numerical method, the others

being accuracy, reliability and e¬ciency.

Accuracy means that the errors are small with respect to a ¬xed tol-

erance. It is usually quanti¬ed by the order of in¬nitesimal of the error

en with respect to the discretization characteristic parameter (for instance

the largest grid spacing between the discretization nodes). By the way, we

notice that machine precision does not limit, on theoretical grounds, the

accuracy.

Reliability means it is likely that the global error can be guaranteed to be

below a certain tolerance. Of course, a numerical model can be considered

to be reliable only if suitably tested, that is, successfully applied to several

test cases.

E¬ciency means that the computational complexity that is needed to

control the error (that is, the amount of operations and the size of the

memory required) is as small as possible.

Having encountered the term algorithm several times in this section, we

cannot refrain from providing an intuitive description of it. By algorithm

we mean a directive that indicates, through elementary operations, all the

passages that are needed to solve a speci¬c problem. An algorithm can in

turn contain sub-algorithms and must have the feature of terminating after

a ¬nite number of elementary operations. As a consequence, the executor

of the algorithm (machine or human being) must ¬nd within the algorithm

itself all the instructions to completely solve the problem at hand (provided

that the necessary resources for its execution are available).

For instance, the statement that a polynomial of second degree surely

admits two roots in the complex plane does not characterize an algorithm,

2.5 Machine Representation of Numbers 45

whereas the formula yielding the roots is an algorithm (provided that the

sub-algorithms needed to correctly execute all the operations have been

de¬ned in turn).

Finally, the complexity of an algorithm is a measure of its executing

time. Calculating the complexity of an algorithm is therefore a part of the

analysis of the e¬ciency of a numerical method. Since several algorithms,

with di¬erent complexities, can be employed to solve the same problem P ,

it is useful to introduce the concept of complexity of a problem, this latter

meaning the complexity of the algorithm that has minimum complexity

among those solving P . The complexity of a problem is typically measured

by a parameter directly associated with P . For instance, in the case of

the product of two square matrices, the computational complexity can be

expressed as a function of a power of the matrix size n (see, [Str69]).

2.5 Machine Representation of Numbers

Any machine operation is a¬ected by rounding errors or roundo¬. They are

due to the fact that on a computer only a ¬nite subset of the set of real

numbers can be represented. In this section, after recalling the positional

notation of real numbers, we introduce their machine representation.

2.5.1 The Positional System

Let a base β ∈ N be ¬xed with β ≥ 2, and let x be a real number with

a ¬nite number of digits xk with 0 ¤ xk < β for k = ’m, . . . , n. The

notation (conventionally adopted)

xβ = (’1)s [xn xn’1 . . . x1 x0 .x’1 x’2 . . . x’m ] , xn = 0 (2.26)

is called the positional representation of x with respect to the base β. The

point between x0 and x’1 is called decimal point if the base is 10, binary

point if the base is 2, while s depends on the sign of x (s = 0 if x is positive,

1 if negative). Relation (2.26) actually means

n

s

xk β k

xβ = (’1) .

k=’m

Example 2.10 The conventional writing x10 = 425.33 denotes the number x =

4 · 102 + 2 · 10 + 5 + 3 · 10’1 + 3 · 10’2 , while x6 = 425.33 would denote the

real number x = 4 · 62 + 2 · 6 + 5 + 3 · 6’1 + 3 · 6’2 . A rational number can of

course have a ¬nite number of digits in a base and an in¬nite number of digits in

another base. For example, the fraction 1/3 has in¬nite digits in base 10, being

•

x10 = 0.¯ while it has only one digit in base 3, being x3 = 0.1.

3,

46 2. Principles of Numerical Mathematics

Any real number can be approximated by numbers having a ¬nite repre-

sentation. Indeed, having ¬xed the base β, the following property holds

∀µ > 0, ∀xβ ∈ R, ∃yβ ∈ R such that |yβ ’ xβ | < µ,

where yβ has ¬nite positional representation.

In fact, given the positive number xβ = xn xn’1 . . . x0 .x’1 . . . x’m . . . with

a number of digits, ¬nite or in¬nite, for any r ≥ 1 one can build two

numbers

r’1

(l) (u) (l)

xn’k β n’k , xβ = xβ + β n’r+1 ,

xβ =

k=0

(l) (u) (u) (l)

and xβ ’ xβ = β n’r+1 . If

having r digits, such that xβ < xβ < xβ

(l)

r is chosen in such a way that β n’r+1 < , then taking yβ equal to xβ

(u)

or xβ yields the desired inequality. This result legitimates the computer

representation of real numbers (and thus by a ¬nite number of digits).

Although theoretically speaking all the bases are equivalent, in the com-

putational practice three are the bases generally employed: base 2 or binary,

base 10 or decimal (the most natural) and base 16 or hexadecimal. Almost

all modern computers use base 2, apart from a few which traditionally

employ base 16. In what follows, we will assume that β is an even integer.

In the binary representation, digits reduce to the two symbols 0 and 1,

called bits (binary digits), while in the hexadecimal case the symbols used

for the representation of the digits are 0,1,...,9,A,B,C,D,E,F. Clearly,

the smaller the adopted base, the longer the string of characters needed to

represent the same number.

To simplify notations, we shall write x instead of xβ , leaving the base β

understood.

2.5.2 The Floating-point Number System

Assume a given computer has N memory positions in which to store any

number. The most natural way to make use of these positions in the rep-

resentation of a real number x di¬erent from zero is to ¬x one of them for

its sign, N ’ k ’ 1 for the integer digits and k for the digits beyond the

point, in such a way that

x = (’1)s · [aN ’2 aN ’3 . . . ak . ak’1 . . . a0 ] (2.27)

s being equal to 1 or 0. Notice that one memory position is equivalent to

one bit storage only when β = 2. The set of numbers of this kind is called

¬xed-point system. Equation (2.27) stands for

N ’2

’k

x = (’1) · β

s

aj β j (2.28)

j=0

2.5 Machine Representation of Numbers 47

and therefore this representation amounts to ¬xing a scaling factor for all

the representable numbers.

The use of ¬xed point strongly limits the value of the minimum and maxi-

mum numbers that can be represented on the computer, unless a very large

number N of memory positions is employed. This drawback can be easily

overcome if the scaling in (2.28) is allowed to be varying. In such a case,

given a non vanishing real number x, its ¬‚oating-point representation is

given by

x = (’1)s · (0.a1 a2 . . . at ) · β e = (’1)s · m · β e’t (2.29)

where t ∈ N is the number of allowed signi¬cant digits ai (with 0 ¤ ai ¤

β ’ 1), m = a1 a2 . . . at an integer number called mantissa such that 0 ¤

m ¤ β t ’ 1 and e an integer number called exponent. Clearly, the exponent

can vary within a ¬nite interval of admissible values: we let L ¤ e ¤ U

(typically L < 0 and U > 0). The N memory positions are now distributed

among the sign (one position), the signi¬cant digits (t positions) and the

digits for the exponent (the remaining N ’ t ’ 1 positions). The number

zero has a separate representation.

Typically, on the computer there are two formats available for the ¬‚oating-

point number representation: single and double precision. In the case of bi-

nary representation, these formats correspond in the standard version to

the representation with N = 32 bits (single precision)

1 8 bits 23 bits

s e m

and with N = 64 bits (double precision)

1 11 bits 52 bits

s e m

Let us denote by

t

ai β ’i

F(β, t, L, U ) = {0} ∪ x ∈ R : x = (’1) β se

i=1

the set of ¬‚oating-point numbers with t signi¬cant digits, base β ≥ 2, 0 ¤

ai ¤ β ’ 1, and range (L,U ) with L ¤ e ¤ U .

In order to enforce uniqueness in a number representation, it is typi-

cally assumed that a1 = 0 and m ≥ β t’1 . In such an event a1 is called

the principal signi¬cant digit, while at is the last signi¬cant digit and the

representation of x is called normalized. The mantissa m is now varying

between β t’1 and β t ’ 1.

For instance, in the case β = 10, t = 4, L = ’1 and U = 4, without

the assumption that a1 = 0, the number 1 would admit the following

representations

0.1000 · 101 , 0.0100 · 102 , 0.0010 · 103 , 0.0001 · 104 .

48 2. Principles of Numerical Mathematics

To always have uniqueness in the representation, it is assumed that also

the number zero has its own sign (typically s = 0 is assumed).

It can be immediately noticed that if x ∈ F(β, t, L, U ) then also ’x ∈

F(β, t, L, U ). Moreover, the following lower and upper bounds hold for the

absolute value of x

xmin = β L’1 ¤ |x| ¤ β U (1 ’ β ’t ) = xmax . (2.30)

The cardinality of F(β, t, L, U ) (henceforth shortly denoted by F) is

card F = 2(β ’ 1)β t’1 (U ’ L + 1) + 1.

From (2.30) it turns out that it is not possible to represent any number

(apart from zero) whose absolute value is less than xmin . This latter limi-

tation can be overcome by completing F by the set FD of the ¬‚oating-point

de-normalized numbers obtained by removing the assumption that a1 is

non null, only for the numbers that are referred to the minimum exponent

L. In such a way the uniqueness in the representation is not lost and it is

possible to generate numbers that have mantissa between 1 and β t’1 ’ 1

and belong to the interval (’β L’1 , β L’1 ). The smallest number in this set

has absolute value equal to β L’t .

Example 2.11 The positive numbers in the set F(2, 3, ’1, 2) are

7 5

(0.111) · 22 = (0.110) · 22 = 3, (0.101) · 22 = (0.100) · 22 = 2,

, ,

2 2

7 3 5

(0.111) · 2 = (0.110) · 2 = (0.101) · 2 = (0.100) · 2 = 1,

, , ,

4 2 4

7 3 5 1

(0.111) = , (0.110) = , (0.101) = , (0.100) = ,

8 4 8 2

7 3 5 1

(0.111) · 2’1 = (0.110) · 2’1 = (0.101) · 2’1 = (0.100) · 2’1 =

, , , .

16 8 16 4

They are included between xmin = β L’1 = 2’2 = 1/4 and xmax = β U (1’β ’t ) =

22 (1 ’ 2’3 ) = 7/2. As a whole, we have (β ’ 1)β t’1 (U ’ L + 1) = (2 ’ 1)23’1 (2 +

1 + 1) = 16 strictly positive numbers. Their opposites must be added to them,

as well as the number zero. We notice that when β = 2, the ¬rst signi¬cant digit

in the normalized representation is necessarily equal to 1 and thus it may not be

stored in the computer (in such an event, we call it hidden bit).

When considering also the positive de-normalized numbers, we should complete

the above set by adding the following numbers

3 1 1

(.011)2 · 2’1 = (.010)2 · 2’1 = (.001)2 · 2’1 =

, , .

16 8 16

According to what previously stated, the smallest de-normalized number is β L’t =

2’1’3 = 1/16. •

2.5 Machine Representation of Numbers 49

2.5.3 Distribution of Floating-point Numbers

The ¬‚oating-point numbers are not equally spaced along the real line, but

they get dense close to the smallest representable number. It can be checked

that the spacing between a number x ∈ F and its next nearest y ∈ F, where

both x and y are assumed to be non null, is at least β ’1 M |x| and at most

M |x|, being M = β

1’t

the machine epsilon. This latter represents the

distance between the number 1 and the nearest ¬‚oating-point number, and

therefore it is the smallest number of F such that 1 + M > 1.

Having instead ¬xed an interval of the form [β e , β e+1 ], the numbers of F

that belong to such an interval are equally spaced and have distance equal

to β e’t . Decreasing (or increasing) by one the exponent gives rise to a

decrement (or increment) of a factor β of the distance between consecutive

numbers.

Unlike the absolute distance, the relative distance between two consecu-

tive numbers has a periodic behavior which depends only on the mantissa

m. Indeed, denoting by (’1)s m(x)β e’t one of the two numbers, the dis-

tance ∆x from the successive one is equal to (’1)s β e’t , which implies that

the relative distance is

(’1)s β e’t

∆x 1

= = . (2.31)

(’1)s m(x)β e’t

x m(x)

Within the interval [β e , β e+1 ], the ratio in (2.31) is decreasing as x increases

since in the normalized representation the mantissa varies from β t’1 to β t ’

1 (not included). However, as soon as x = β e+1 , the relative distance gets

back to the value β ’t+1 and starts decreasing on the successive intervals,

as shown in Figure 2.2. This oscillatory phenomenon is called wobbling

precision and the greater the base β, the more pronounced the e¬ect. This

is another reason why small bases are preferably employed in computers.

2.5.4 IEC/IEEE Arithmetic

The possibility of building sets of ¬‚oating-point numbers that di¬er in base,

number of signi¬cant digits and range of the exponent has prompted in the

past the development, for almost any computer, of a particular system F. In

order to avoid this proliferation of numerical systems, a standard has been

¬xed that is nowadays almost universally accepted. This standard was de-

veloped in 1985 by the Institute of Electrical and Electronics Engineers

(shortly, IEEE) and was approved in 1989 by the International Electroni-

cal Commission (IEC) as the international standard IEC559 and it is now

known by this name (IEC is an organization analogue to the International

Standardization Organization (ISO) in the ¬eld of electronics). The stan-

dard IEC559 endorses two formats for the ¬‚oating-point numbers: a basic

format, made by the system F(2, 24, ’125, 128) for the single precision,

and by F(2, 53, ’1021, 1024) for the double precision, both including the

50 2. Principles of Numerical Mathematics

-23

2

-24

2

-123

-126 -124

-125

2 2 2

2

FIGURE 2.2. Variation of relative distance for the set of numbers

F(2, 24, ’125, 128) IEC/IEEE in single precision

de-normalized numbers, and an extended format, for which only the main

limitations are ¬xed (see Table 2.1).

single double single double

≥ 43 bits ≥ 79 bits ≥ 32 ≥ 64

N t

¤ ’1021 ¤ 16381 ≥ 1024 ≥ 16384

L U

TABLE 2.1. Lower or upper limits in the standard IEC559 for the extended

format of ¬‚oating-point numbers

Almost all the computers nowadays satisfy the requirements above. We

summarize in Table 2.2 the special codings that are used in IEC559 to

deal with the values ±0, ±∞ and with the so-called non numbers (shortly,

N aN , that is not a number), which correspond for instance to 0/0 or to

other exceptional operations.

value exponent mantissa

±0 L’1 0

±∞ U +1 0

N aN U +1 =0

TABLE 2.2. IEC559 codings of some exceptional values

2.5.5 Rounding of a Real Number in its Machine

Representation

The fact that on any computer only a subset F(β, t, L, U ) of R is actually