<< . .

. 9
( : 95)

. . >>

available poses several practical problems, ¬rst of all the representation in F
2.5 Machine Representation of Numbers 51

of any given real number. To this concern, notice that, even if x and y were
two numbers in F, the result of an operation on them does not necessarily
belong to F. Therefore, we must de¬ne an arithmetic also on F.
The simplest approach to solve the ¬rst problem consists of rounding
x ∈ R in such a way that the rounded number belongs to F. Among all
the possible rounding operations, let us consider the following one. Given
x ∈ R in the normalized positional notation (2.29) let us substitute x by
its representant f l(x) in F, de¬ned as
at if at+1 < β/2
f l(x) = (’1)s (0. a1 a2 . . . at ) · β e , at =
˜ (2.32)
at + 1 if at+1 ≥ β/2.
The mapping f l : R ’ F is the most commonly used and is called rounding
(in the chopping one would take more trivially at = at ). Clearly, f l(x) = x
if x ∈ F and moreover f l(x) ¤ f l(y) if x ¤ y ∀x, y ∈ R (monotonicity

Remark 2.2 (Over¬‚ow and under¬‚ow) Everything written so far holds
only for the numbers that in (2.29) have exponent e within the range of F.
If, indeed, x ∈ (’∞, ’xmax ) ∪ (xmax , ∞) the value f l(x) is not de¬ned,
while if x ∈ (’xmin , xmin ) the operation of rounding is de¬ned anyway
(even in absence of de-normalized numbers). In the ¬rst case, if x is the
result of an operation on numbers of F, we speak about over¬‚ow, in the
second case about under¬‚ow (or graceful under¬‚ow if de-normalized num-
bers are accounted for). The over¬‚ow is handled by the system through an
interrupt of the executing program.
Apart from exceptional situations, we can easily quantify the error, ab-
solute and relative, that is made by substituting f l(x) for x. The following
result can be shown (see for instance [Hig96], Theorem 2.2).

Property 2.1 If x ∈ R is such that xmin ¤ |x| ¤ xmax , then
f l(x) = x(1 + δ) with |δ| ¤ u (2.33)
1 1’t 1
u= β = (2.34)
2 2
is the so-called roundo¬ unit (or machine precision).
As a consequence of (2.33), the following bound holds for the relative error
|x ’ f l(x)|
¤ u,
Erel (x) = (2.35)
while, for the absolute error, one gets
E(x) = |x ’ f l(x)| ¤ β e’t |(a1 . . . at .at+1 . . . ) ’ (a1 . . . at )|.
52 2. Principles of Numerical Mathematics

From (2.32), it follows that

|(a1 . . . at .at+1 . . . ) ’ (a1 . . . at )| ¤ β ’1 ,
from which
1 ’t+e
E(x) ¤ β .

Remark 2.3 In the MATLAB environment it is possible to know imme-
diately the value of M , which is given by the system variable eps.

2.5.6 Machine Floating-point Operations
As previously stated, it is necessary to de¬ne on the set of machine numbers
an arithmetic which is analogous, as far as possible, to the arithmetic in R.
Thus, given any arithmetic operation —¦ : R — R ’ R on two operands in
R (the symbol —¦ may denote sum, subtraction, multiplication or division),
we shall denote by —¦ the corresponding machine operation

—¦ : R — R ’ F, x —¦ y = f l(f l(x) —¦ f l(y)).

From the properties of ¬‚oating-point numbers one could expect that for the
operations on two operands, whenever well de¬ned, the following property
holds: ∀x, y ∈ F, ∃δ ∈ R such that

x —¦ y = (x —¦ y)(1 + δ) with |δ| ¤ u. (2.36)

In order for (2.36) to be satis¬ed when —¦ is the operator of subtraction, it
will require an additional assumption on the structure of the numbers in F,
that is the presence of the so-called round digit (which is addressed at the
end of this section). In particular, when —¦ is the sum operator, it follows
that for all x, y ∈ F (see Exercise 11)

|x + y ’ (x + y)| |x| + |y|
¤ u(1 + u) + u, (2.37)
|x + y| |x + y|

so that the relative error associated with every machine operation will be
small, unless x + y is not small by itself. An aside comment is deserved by
the case of the sum of two numbers close in module, but opposite in sign.
In fact, in such a case x + y can be quite small, this generating the so-called
cancellation errors (as evidenced in Example 2.6).

It is important to notice that, together with properties of standard arith-
metic that are preserved when passing to ¬‚oating-point arithmetic (like, for
instance, the commutativity of the sum of two addends, or the product of
2.5 Machine Representation of Numbers 53

two factors), other properties are lost. An example is given by the associa-
tivity of sum: it can indeed be shown (see Exercise 12) that in general

x + (y + z) = (x + y) + z.

We shall denote by ¬‚op the single elementary ¬‚oating-point operation (sum,
subtraction, multiplication or division) (the reader is warned that in some
texts ¬‚op identi¬es an operation of the form a + b · c). According to the
previous convention, a scalar product between two vectors of length n will
require 2n ’ 1 ¬‚ops, a product matrix-vector 2(m ’ 1)n ¬‚ops if the matrix
is n — m and ¬nally, a product matrix-matrix 2(r ’ 1)mn ¬‚ops if the two
matrices are m — r and r — n respectively.

Remark 2.4 (IEC559 arithmetic) The IEC559 standard also de¬nes a
closed arithmetic on F, this meaning that any operation on it produces
a result that can be represented within the system itself, although not
necessarily being expected from a pure mathematical standpoint. As an
example, in Table 2.3 we report the results that are obtained in exceptional

exception examples result
0/0, 0 · ∞
non valid operation N aN
overf low
division by zero 1/0
underf low subnormal numbers
TABLE 2.3. Results for some exceptional operations

The presence of a N aN (Not a Number) in a sequence of operations au-
tomatically implies that the result is a N aN . General acceptance of this
standard is still ongoing.

We mention that not all the ¬‚oating-point systems satisfy (2.36). One of
the main reasons is the absence of the round digit in subtraction, that is,
an extra-bit that gets into action on the mantissa level when the subtrac-
tion between two ¬‚oating-point numbers is performed. To demonstrate the
importance of the round digit, let us consider the following example with
a system F having β = 10 and t = 2. Let us subtract 1 and 0.99. We have

101 · 0.1 101 · 0.10
100 · 0.99 ’ 101 · 0.09
101 · 0.01 ’’ 100 · 0.10

that is, the result di¬ers from the exact one by a factor 10. If we now
execute the same subtraction using the round digit, we obtain the exact
54 2. Principles of Numerical Mathematics

result. Indeed
101 · 0.1 101 · 0.10
100 · 0.99 ’ 101 · 0.09 9
101 · 0.00 1 ’’ 100 · 0.01

In fact, it can be shown that addition and subtraction, if executed without
round digit, do not satisfy the property

f l(x ± y) = (x ± y)(1 + δ) with |δ| ¤ u,

but the following one

f l(x ± y) = x(1 + ±) ± y(1 + β) with |±| + |β| ¤ u.

An arithmetic for which this latter event happens is called aberrant. In some
computers the round digit does not exist, most of the care being spent on
velocity in the computation. Nowadays, however, the trend is to use even
two round digits (see [HP94] for technical details about the subject).

2.6 Exercises
1. Use (2.7) to compute the condition number K(d) of the following expres-

x ’ ad = 0, a > 0 d ’ x + 1 = 0,
(1) (2)

d being the datum, a a parameter and x the “unknown”.
[Solution : (1) K(d) |d|| log a|, (2) K(d) = |d|/|d + 1|.]
2. Study the well posedness and the conditioning in the in¬nity norm of the
following problem as a function of the datum d: ¬nd x and y such that

x + dy = 1,
dx + y = 0.

[Solution : the given problem is a linear system whose matrix is A =
. It is well-posed if A is nonsingular, i.e., if d = ±1. In such a
case, K∞ (A) = |(|d| + 1)/(|d| ’ 1)|.]
3. Study the conditioning of the solving formula x± = ’p ± p2 + q for
the second degree equation x2 + 2px ’ q with respect to changes in the
parameters p and q separately.
[Solution : K(p) = |p|/ p2 + q, K(q) = |q|/(2|x± | p2 + q).]
4. Consider the following Cauchy problem

x (t) = x0 eat (a cos(t) ’ sin(t)) , t>0
x(0) = x0
2.6 Exercises 55

whose solution is x(t) = x0 eat cos(t) (a is a given real number). Study the
conditioning of (2.38) with respect to the choice of the initial datum and
check that on unbounded intervals it is well conditioned if a < 0, while it
is ill conditioned if a > 0.
[Hint : consider the de¬nition of Kabs (a).]
5. Let x = 0 be an approximation of a non null quantity x. Find the relation
between the relative error = |x ’ x|/|x| and E = |x ’ x|/|x|.
6. Find a stable formula for evaluating the square root of a complex number.
7. Determine all the elements of the set F = (10, 6, ’9, 9), in both normalized
and de-normalized cases.
8. Consider the set of the de-normalized numbers FD and study the behavior
of the absolute distance and of the relative distance between two of these
numbers. Does the wobbling precision e¬ect arise again?
[Hint : for these numbers, uniformity in the relative density is lost. As a
consequence, the absolute distance remains constant (equal to β L’t ), while
the relative one rapidly grows as x tends to zero.]
9. What is the value of 00 in IEEE arithmetic?
[Solution : ideally, the outcome should be N aN . In practice, IEEE systems
recover the value 1. A motivation of this result can be found in [Gol91].]
10. Show that, due to cancellation errors, the following sequence
6 1
I0 = log , Ik + 5Ik’1 = , k = 1, 2, . . . , n,
5 k
is not well suited to ¬nite arithmetic computations of the integral In =
dx when n is su¬ciently large, although it works in in¬nite arith-
0 x+5
[Hint : consider the initial perturbed datum I0 = I0 + µ0 and study the
propagation of the error µ0 within (2.39).]
11. Prove (2.37).
[Solution : notice that

|x + y ’ (x + y)| |x + y ’ (f l(x) + f l(y))| |f l(x) ’ x + f l(y) ’ y|
¤ + .
|x + y| |x + y| |x + y|
Then, use (2.36) and (2.35).]
12. Given x, y, z ∈ F with x + y, y + z, x + y + z that fall into the range of F,
show that
|(x + y) + z ’ (x + y + z)| ¤ C1 (2|x + y| + |z|)u
|x + (y + z) ’ (x + y + z)| ¤ C2 (|x| + 2|y + z|)u.

13. Which among the following approximations of π,
1 1 1 1
π =4 1’ + ’ + ’ ... ,
3 5 7 9
3 · 5(0.5)7
π = 6 0.5 + + + + ...
2·3 2·4·5 2·4·6·7
56 2. Principles of Numerical Mathematics

better limits the propagation of rounding errors? Compare using MATLAB
the obtained results as a function of the number of the terms in each sum
in (2.40).
14. Analyze the stability, with respect to propagation of rounding errors, of the
following two MATLAB codes to evaluate f (x) = (ex ’ 1)/x for |x| 1

% Algorithm 1 % Algorithm 2
if x == 0 y = exp (x);
f = 1; if y == 1
else f = 1;
f = (exp(x) - 1) / x; else
end f = (y - 1) / log (y);

[Solution : the ¬rst algorithm is inaccurate due to cancellation errors, while
the second one (in presence of round digit) is stable and accurate.]
15. In binary arithmetic one can show [Dek71] that the rounding error in the
sum of two numbers a and b, with a ≥ b, can be computed as

((a + b) ’ a) ’ b).

Based on this property, a method has been proposed, called Kahan com-
pensated sum, to compute the sum of n addends ai in such a way that the
rounding errors are compensated. In practice, letting the initial rounding
error e1 = 0 and s1 = a1 , at the i-th step, with i ≥ 2, the algorithm
evaluates yi = xi ’ ei’1 , the sum is updated setting si = si’1 + yi and
the new rounding error is computed as ei = (si ’ si’1 ) ’ yi . Implement
this algorithm in MATLAB and check its accuracy by evaluating again the
second expression in (2.40).
16. The area A(T ) of a triangle T with sides a, b and c, can be computed using
the following formula

p(p ’ a)(p ’ b)(p ’ c),
A(T ) =

where p is half the perimeter of T . Show that in the case of strongly de-
formed triangles (a b + c), this formula lacks accuracy and check this
Direct Methods for the Solution of
Linear Systems

A system of m linear equations in n unknowns consists of a set of algebraic
relations of the form
aij xj = bi , i = 1, . . . , m (3.1)

where xj are the unknowns, aij are the coe¬cients of the system and bi
are the components of the right hand side. System (3.1) can be more con-
veniently written in matrix form as

Ax = b, (3.2)

where we have denoted by A = (aij ) ∈ Cm—n the coe¬cient matrix, by
b=(bi ) ∈ Cm the right side vector and by x=(xi ) ∈ Cn the unknown
vector, respectively. We call a solution of (3.2) any n-tuple of values xi
which satis¬es (3.1).
In this chapter we shall be mainly dealing with real-valued square systems
of order n, that is, systems of the form (3.2) with A ∈ Rn—n and b ∈ Rn .
In such cases existence and uniqueness of the solution of (3.2) are ensured
if one of the following (equivalent) hypotheses holds:

1. A is invertible;

2. rank(A)=n;

3. the homogeneous system Ax=0 admits only the null solution.
58 3. Direct Methods for the Solution of Linear Systems

The solution of system (3.2) is formally provided by Cramer™s rule

xj = , j = 1, . . . , n, (3.3)

where ∆j is the determinant of the matrix obtained by substituting the
j-th column of A with the right hand side b. This formula is, however,
of little practical use. Indeed, if the determinants are evaluated by the
recursive relation (1.4), the computational e¬ort of Cramer™s rule is of the
order of (n + 1)! ¬‚ops and therefore turns out to be unacceptable even for
small dimensions of A (for instance, a computer able to perform 109 ¬‚ops
per second would take 9.6 · 1047 years to solve a linear system of only 50
For this reason, numerical methods that are alternatives to Cramer™s rule
have been developed. They are called direct methods if they yield the so-
lution of the system in a ¬nite number of steps, iterative if they require
(theoretically) an in¬nite number of steps. Iterative methods will be ad-
dressed in the next chapter. We notice from now on that the choice between
a direct and an iterative method does not depend only on the theoretical ef-
¬ciency of the scheme, but also on the particular type of matrix, on memory
storage requirements and, ¬nally, on the architecture of the computer.

3.1 Stability Analysis of Linear Systems
Solving a linear system by a numerical method invariably leads to the
introduction of rounding errors. Only using stable numerical methods can
keep away the propagation of such errors from polluting the accuracy of the
solution. In this section two aspects of stability analysis will be addressed.
Firstly, we will analyze the sensitivity of the solution of (3.2) to changes
in the data A and b (forward a priori analysis). Secondly, assuming that
an approximate solution x of (3.2) is available, we shall quantify the per-
turbations on the data A and b in order for x to be the exact solution
of a perturbed system (backward a priori analysis). The size of these per-
turbations will in turn allow us to measure the accuracy of the computed
solution x by the use of a posteriori analysis.

3.1.1 The Condition Number of a Matrix
The condition number of a matrix A ∈ Cn—n is de¬ned as

A’1 ,
K(A) = A (3.4)

where · is an induced matrix norm. In general K(A) depends on the
choice of the norm; this will be made clear by introducing a subscript
3.1 Stability Analysis of Linear Systems 59

into the notation, for instance, K∞ (A) = A ∞ A’1 ∞ . More generally,
Kp (A) will denote the condition number of A in the p-norm. Remarkable
instances are p = 1, p = 2 and p = ∞ (we refer to Exercise 1 for the
relations among K1 (A), K2 (A) and K∞ (A)).
As already noticed in Example 2.3, an increase in the condition number
produces a higher sensitivity of the solution of the linear system to changes
in the data. Let us start by noticing that K(A) ≥ 1 since

1 = AA’1 ¤ A A’1 = K(A).

Moreover, K(A’1 ) = K(A) and ∀± ∈ C with ± = 0, K(±A) = K(A).
Finally, if A is orthogonal, K2 (A) = 1 since A 2 = ρ(AT A) = ρ(I) = 1
and A’1 = AT . The condition number of a singular matrix is set equal to
For p = 2, K2 (A) can be characterized as follows. Starting from (1.21),
it can be proved that
σ1 (A)
K2 (A) = A =
2 2
σn (A)

where σ1 (A) and σn (A) are the maximum and minimum singular values of
A (see Property 1.7). As a consequence, in the case of symmetric positive
de¬nite matrices we have

<< . .

. 9
( : 95)

. . >>