<< . .

. 8
( : 95)

. . >>

For the perturbed system A˜ = b, forward analysis provides an estimate of
the error x ’ x in terms of A ’ A and b ’ b, while backward analysis estimates
the perturbations δA = (δaij ) and δb = (δbi ) which should be impressed to the
entries of A and b in order to get (A + δA)xn = b + δb, xn being the solution of
the linear system (computed somehow). Finally, a posteriori error analysis looks
for an estimate of the error x ’ xn as a function of the residual rn = b ’ Axn .

We will develop this analysis in Section 3.1.

It is important to point out the role played by the a posteriori analysis in
devising strategies for adaptive error control. These strategies, by suitably
changing the discretization parameters (for instance, the spacing between
2.4 Sources of Error in Computational Models 43

nodes in the numerical integration of a function or a di¬erential equation),
employ the a posteriori analysis in order to ensure that the error does not
exceed a ¬xed tolerance.
A numerical method that makes use of an adaptive error control is called
adaptive numerical method. In practice, a method of this kind applies in the
computational process the idea of feedback, by activating on the grounds of
a computed solution a convergence test which ensures the control of error
within a ¬xed tolerance. In case the convergence test fails, a suitable strat-
egy for modifying the discretization parameters is automatically adopted
in order to enhance the accuracy of the solution to be newly computed,
and the overall procedure is iterated until the convergence check is passed.

2.4 Sources of Error in Computational Models
Whenever the numerical problem (2.12) is an approximation to the math-
ematical problem (2.1) and this latter is in turn a model of a physical
problem (which will be shortly denoted by PP), we shall say that (2.12) is
a computational model for PP.
In this process the global error, denoted by e, is expressed by the dif-
ference between the actually computed solution, xn , and the physical so-
lution, xph , of which x provides a model. The global error e can thus be
interpreted as being the sum of the error em of the mathematical model,
given by x ’ xph , and the error ec of the computational model, xn ’ x, that
is e = em + ec (see Figure 2.1).

PP : xph
em e

F (x, d) = 0

en ea

Fn (xn , dn ) = 0

FIGURE 2.1. Errors in computational models

The error em will in turn take into account the error of the mathematical
model in strict sense (that is, the extent at which the functional equation
(2.1) does realistically describe the problem PP) and the error on the data
(that is, how much accurately does d provide a measure of the real physical
44 2. Principles of Numerical Mathematics

data). In the same way, ec turns out to be the combination of the numerical
discretization error en = xn ’ x, the error ea introduced by the numerical
algorithm and the roundo¬ error introduced by the computer during the
actual solution of problem (2.12) (see Section 2.5).
In general, we can thus outline the following sources of error:
1. errors due to the model, that can be controlled by a proper choice of
the mathematical model;
2. errors in the data, that can be reduced by enhancing the accuracy in
the measurement of the data themselves;
3. truncation errors, arising from having replaced in the numerical model
limits by operations that involve a ¬nite number of steps;
4. rounding errors.
The errors at the items 3. and 4. give rise to the computational error. A
numerical method will thus be convergent if this error can be made arbi-
trarily small by increasing the computational e¬ort. Of course, convergence
is the primary, albeit not unique, goal of a numerical method, the others
being accuracy, reliability and e¬ciency.
Accuracy means that the errors are small with respect to a ¬xed tol-
erance. It is usually quanti¬ed by the order of in¬nitesimal of the error
en with respect to the discretization characteristic parameter (for instance
the largest grid spacing between the discretization nodes). By the way, we
notice that machine precision does not limit, on theoretical grounds, the
Reliability means it is likely that the global error can be guaranteed to be
below a certain tolerance. Of course, a numerical model can be considered
to be reliable only if suitably tested, that is, successfully applied to several
test cases.
E¬ciency means that the computational complexity that is needed to
control the error (that is, the amount of operations and the size of the
memory required) is as small as possible.

Having encountered the term algorithm several times in this section, we
cannot refrain from providing an intuitive description of it. By algorithm
we mean a directive that indicates, through elementary operations, all the
passages that are needed to solve a speci¬c problem. An algorithm can in
turn contain sub-algorithms and must have the feature of terminating after
a ¬nite number of elementary operations. As a consequence, the executor
of the algorithm (machine or human being) must ¬nd within the algorithm
itself all the instructions to completely solve the problem at hand (provided
that the necessary resources for its execution are available).
For instance, the statement that a polynomial of second degree surely
admits two roots in the complex plane does not characterize an algorithm,
2.5 Machine Representation of Numbers 45

whereas the formula yielding the roots is an algorithm (provided that the
sub-algorithms needed to correctly execute all the operations have been
de¬ned in turn).
Finally, the complexity of an algorithm is a measure of its executing
time. Calculating the complexity of an algorithm is therefore a part of the
analysis of the e¬ciency of a numerical method. Since several algorithms,
with di¬erent complexities, can be employed to solve the same problem P ,
it is useful to introduce the concept of complexity of a problem, this latter
meaning the complexity of the algorithm that has minimum complexity
among those solving P . The complexity of a problem is typically measured
by a parameter directly associated with P . For instance, in the case of
the product of two square matrices, the computational complexity can be
expressed as a function of a power of the matrix size n (see, [Str69]).

2.5 Machine Representation of Numbers
Any machine operation is a¬ected by rounding errors or roundo¬. They are
due to the fact that on a computer only a ¬nite subset of the set of real
numbers can be represented. In this section, after recalling the positional
notation of real numbers, we introduce their machine representation.

2.5.1 The Positional System
Let a base β ∈ N be ¬xed with β ≥ 2, and let x be a real number with
a ¬nite number of digits xk with 0 ¤ xk < β for k = ’m, . . . , n. The
notation (conventionally adopted)

xβ = (’1)s [xn xn’1 . . . x1 x0 .x’1 x’2 . . . x’m ] , xn = 0 (2.26)

is called the positional representation of x with respect to the base β. The
point between x0 and x’1 is called decimal point if the base is 10, binary
point if the base is 2, while s depends on the sign of x (s = 0 if x is positive,
1 if negative). Relation (2.26) actually means
xk β k
xβ = (’1) .

Example 2.10 The conventional writing x10 = 425.33 denotes the number x =
4 · 102 + 2 · 10 + 5 + 3 · 10’1 + 3 · 10’2 , while x6 = 425.33 would denote the
real number x = 4 · 62 + 2 · 6 + 5 + 3 · 6’1 + 3 · 6’2 . A rational number can of
course have a ¬nite number of digits in a base and an in¬nite number of digits in
another base. For example, the fraction 1/3 has in¬nite digits in base 10, being

x10 = 0.¯ while it has only one digit in base 3, being x3 = 0.1.
46 2. Principles of Numerical Mathematics

Any real number can be approximated by numbers having a ¬nite repre-
sentation. Indeed, having ¬xed the base β, the following property holds
∀µ > 0, ∀xβ ∈ R, ∃yβ ∈ R such that |yβ ’ xβ | < µ,
where yβ has ¬nite positional representation.
In fact, given the positive number xβ = xn xn’1 . . . x0 .x’1 . . . x’m . . . with
a number of digits, ¬nite or in¬nite, for any r ≥ 1 one can build two
(l) (u) (l)
xn’k β n’k , xβ = xβ + β n’r+1 ,
xβ =
(l) (u) (u) (l)
and xβ ’ xβ = β n’r+1 . If
having r digits, such that xβ < xβ < xβ
r is chosen in such a way that β n’r+1 < , then taking yβ equal to xβ
or xβ yields the desired inequality. This result legitimates the computer
representation of real numbers (and thus by a ¬nite number of digits).

Although theoretically speaking all the bases are equivalent, in the com-
putational practice three are the bases generally employed: base 2 or binary,
base 10 or decimal (the most natural) and base 16 or hexadecimal. Almost
all modern computers use base 2, apart from a few which traditionally
employ base 16. In what follows, we will assume that β is an even integer.
In the binary representation, digits reduce to the two symbols 0 and 1,
called bits (binary digits), while in the hexadecimal case the symbols used
for the representation of the digits are 0,1,...,9,A,B,C,D,E,F. Clearly,
the smaller the adopted base, the longer the string of characters needed to
represent the same number.
To simplify notations, we shall write x instead of xβ , leaving the base β

2.5.2 The Floating-point Number System
Assume a given computer has N memory positions in which to store any
number. The most natural way to make use of these positions in the rep-
resentation of a real number x di¬erent from zero is to ¬x one of them for
its sign, N ’ k ’ 1 for the integer digits and k for the digits beyond the
point, in such a way that
x = (’1)s · [aN ’2 aN ’3 . . . ak . ak’1 . . . a0 ] (2.27)
s being equal to 1 or 0. Notice that one memory position is equivalent to
one bit storage only when β = 2. The set of numbers of this kind is called
¬xed-point system. Equation (2.27) stands for
N ’2
x = (’1) · β
aj β j (2.28)
2.5 Machine Representation of Numbers 47

and therefore this representation amounts to ¬xing a scaling factor for all
the representable numbers.
The use of ¬xed point strongly limits the value of the minimum and maxi-
mum numbers that can be represented on the computer, unless a very large
number N of memory positions is employed. This drawback can be easily
overcome if the scaling in (2.28) is allowed to be varying. In such a case,
given a non vanishing real number x, its ¬‚oating-point representation is
given by
x = (’1)s · (0.a1 a2 . . . at ) · β e = (’1)s · m · β e’t (2.29)
where t ∈ N is the number of allowed signi¬cant digits ai (with 0 ¤ ai ¤
β ’ 1), m = a1 a2 . . . at an integer number called mantissa such that 0 ¤
m ¤ β t ’ 1 and e an integer number called exponent. Clearly, the exponent
can vary within a ¬nite interval of admissible values: we let L ¤ e ¤ U
(typically L < 0 and U > 0). The N memory positions are now distributed
among the sign (one position), the signi¬cant digits (t positions) and the
digits for the exponent (the remaining N ’ t ’ 1 positions). The number
zero has a separate representation.
Typically, on the computer there are two formats available for the ¬‚oating-
point number representation: single and double precision. In the case of bi-
nary representation, these formats correspond in the standard version to
the representation with N = 32 bits (single precision)
1 8 bits 23 bits
s e m
and with N = 64 bits (double precision)
1 11 bits 52 bits
s e m
Let us denote by
ai β ’i
F(β, t, L, U ) = {0} ∪ x ∈ R : x = (’1) β se


the set of ¬‚oating-point numbers with t signi¬cant digits, base β ≥ 2, 0 ¤
ai ¤ β ’ 1, and range (L,U ) with L ¤ e ¤ U .
In order to enforce uniqueness in a number representation, it is typi-
cally assumed that a1 = 0 and m ≥ β t’1 . In such an event a1 is called
the principal signi¬cant digit, while at is the last signi¬cant digit and the
representation of x is called normalized. The mantissa m is now varying
between β t’1 and β t ’ 1.
For instance, in the case β = 10, t = 4, L = ’1 and U = 4, without
the assumption that a1 = 0, the number 1 would admit the following
0.1000 · 101 , 0.0100 · 102 , 0.0010 · 103 , 0.0001 · 104 .
48 2. Principles of Numerical Mathematics

To always have uniqueness in the representation, it is assumed that also
the number zero has its own sign (typically s = 0 is assumed).
It can be immediately noticed that if x ∈ F(β, t, L, U ) then also ’x ∈
F(β, t, L, U ). Moreover, the following lower and upper bounds hold for the
absolute value of x

xmin = β L’1 ¤ |x| ¤ β U (1 ’ β ’t ) = xmax . (2.30)

The cardinality of F(β, t, L, U ) (henceforth shortly denoted by F) is

card F = 2(β ’ 1)β t’1 (U ’ L + 1) + 1.

From (2.30) it turns out that it is not possible to represent any number
(apart from zero) whose absolute value is less than xmin . This latter limi-
tation can be overcome by completing F by the set FD of the ¬‚oating-point
de-normalized numbers obtained by removing the assumption that a1 is
non null, only for the numbers that are referred to the minimum exponent
L. In such a way the uniqueness in the representation is not lost and it is
possible to generate numbers that have mantissa between 1 and β t’1 ’ 1
and belong to the interval (’β L’1 , β L’1 ). The smallest number in this set
has absolute value equal to β L’t .

Example 2.11 The positive numbers in the set F(2, 3, ’1, 2) are
7 5
(0.111) · 22 = (0.110) · 22 = 3, (0.101) · 22 = (0.100) · 22 = 2,
, ,
2 2
7 3 5
(0.111) · 2 = (0.110) · 2 = (0.101) · 2 = (0.100) · 2 = 1,
, , ,
4 2 4
7 3 5 1
(0.111) = , (0.110) = , (0.101) = , (0.100) = ,
8 4 8 2
7 3 5 1
(0.111) · 2’1 = (0.110) · 2’1 = (0.101) · 2’1 = (0.100) · 2’1 =
, , , .
16 8 16 4
They are included between xmin = β L’1 = 2’2 = 1/4 and xmax = β U (1’β ’t ) =
22 (1 ’ 2’3 ) = 7/2. As a whole, we have (β ’ 1)β t’1 (U ’ L + 1) = (2 ’ 1)23’1 (2 +
1 + 1) = 16 strictly positive numbers. Their opposites must be added to them,
as well as the number zero. We notice that when β = 2, the ¬rst signi¬cant digit
in the normalized representation is necessarily equal to 1 and thus it may not be
stored in the computer (in such an event, we call it hidden bit).
When considering also the positive de-normalized numbers, we should complete
the above set by adding the following numbers
3 1 1
(.011)2 · 2’1 = (.010)2 · 2’1 = (.001)2 · 2’1 =
, , .
16 8 16
According to what previously stated, the smallest de-normalized number is β L’t =
2’1’3 = 1/16. •
2.5 Machine Representation of Numbers 49

2.5.3 Distribution of Floating-point Numbers
The ¬‚oating-point numbers are not equally spaced along the real line, but
they get dense close to the smallest representable number. It can be checked
that the spacing between a number x ∈ F and its next nearest y ∈ F, where
both x and y are assumed to be non null, is at least β ’1 M |x| and at most
M |x|, being M = β
the machine epsilon. This latter represents the
distance between the number 1 and the nearest ¬‚oating-point number, and
therefore it is the smallest number of F such that 1 + M > 1.
Having instead ¬xed an interval of the form [β e , β e+1 ], the numbers of F
that belong to such an interval are equally spaced and have distance equal
to β e’t . Decreasing (or increasing) by one the exponent gives rise to a
decrement (or increment) of a factor β of the distance between consecutive
Unlike the absolute distance, the relative distance between two consecu-
tive numbers has a periodic behavior which depends only on the mantissa
m. Indeed, denoting by (’1)s m(x)β e’t one of the two numbers, the dis-
tance ∆x from the successive one is equal to (’1)s β e’t , which implies that
the relative distance is
(’1)s β e’t
∆x 1
= = . (2.31)
(’1)s m(x)β e’t
x m(x)

Within the interval [β e , β e+1 ], the ratio in (2.31) is decreasing as x increases
since in the normalized representation the mantissa varies from β t’1 to β t ’
1 (not included). However, as soon as x = β e+1 , the relative distance gets
back to the value β ’t+1 and starts decreasing on the successive intervals,
as shown in Figure 2.2. This oscillatory phenomenon is called wobbling
precision and the greater the base β, the more pronounced the e¬ect. This
is another reason why small bases are preferably employed in computers.

2.5.4 IEC/IEEE Arithmetic
The possibility of building sets of ¬‚oating-point numbers that di¬er in base,
number of signi¬cant digits and range of the exponent has prompted in the
past the development, for almost any computer, of a particular system F. In
order to avoid this proliferation of numerical systems, a standard has been
¬xed that is nowadays almost universally accepted. This standard was de-
veloped in 1985 by the Institute of Electrical and Electronics Engineers
(shortly, IEEE) and was approved in 1989 by the International Electroni-
cal Commission (IEC) as the international standard IEC559 and it is now
known by this name (IEC is an organization analogue to the International
Standardization Organization (ISO) in the ¬eld of electronics). The stan-
dard IEC559 endorses two formats for the ¬‚oating-point numbers: a basic
format, made by the system F(2, 24, ’125, 128) for the single precision,
and by F(2, 53, ’1021, 1024) for the double precision, both including the
50 2. Principles of Numerical Mathematics



-126 -124
2 2 2

FIGURE 2.2. Variation of relative distance for the set of numbers
F(2, 24, ’125, 128) IEC/IEEE in single precision

de-normalized numbers, and an extended format, for which only the main
limitations are ¬xed (see Table 2.1).

single double single double
≥ 43 bits ≥ 79 bits ≥ 32 ≥ 64
N t
¤ ’1021 ¤ 16381 ≥ 1024 ≥ 16384
TABLE 2.1. Lower or upper limits in the standard IEC559 for the extended
format of ¬‚oating-point numbers

Almost all the computers nowadays satisfy the requirements above. We
summarize in Table 2.2 the special codings that are used in IEC559 to
deal with the values ±0, ±∞ and with the so-called non numbers (shortly,
N aN , that is not a number), which correspond for instance to 0/0 or to
other exceptional operations.

value exponent mantissa
±0 L’1 0
±∞ U +1 0
N aN U +1 =0
TABLE 2.2. IEC559 codings of some exceptional values

2.5.5 Rounding of a Real Number in its Machine
The fact that on any computer only a subset F(β, t, L, U ) of R is actually

<< . .

. 8
( : 95)

. . >>