<< . .

. 4
( : 7)

. . >>

and hence
H (E, X|Y ) = H (X|Y ) . (2.1)
Applying chaining rule for conditional entropies (Shannon entropy, property 8) again, but for di¬erent variables
H (E, X|Y ) = H (X|E, Y ) + H (E|Y ) , (2.2)
and further because conditioning reduces entropy (Shannon entropy, property 7), H (E|Y ) ¤ H (E) = H (p e ) ,
whence by (2.1) and (2.2)
H (X|Y ) = H (E, X|Y ) ¤ H (X|E, Y ) + H (pe ) . (2.3)
Finally H (X|E, Y ) should be bounded as follows, after some algebra
H (X|E, Y ) = p (E = 0) H (X|E = 0, Y ) + p (E = 1) H (X|E = 1, Y )
¤ p (E = 0) · 0 + pe log (|X| ’ 1) = pe log (|X| ’ 1) .
This relation with the help of (2.3) is known as Fano™s inequality:
H (pe ) + pe log (|X| ’ 1) ≥ H (X|Y ) . (2.4)

2.1.2 Accessible quantum information: quantum Fano inequality and the Holevo
Quantum Fano inequality
There exists analogous relation to (2.4), in quantum information theory, named quantum Fano inequality:

S (ρ, E) ¤ H (F (ρ, E)) + (1 ’ F (ρ, E)) log d2 ’ 1 . (2.5)

Where F (ρ, E) is the entanglement ¬delity of a quantum operation de¬ned in (B.1), for more details refer to
appendix B.2. In the above equation, the entropy exchange of the operation E upon ρ, was introduced

S (ρ, E) S (R , Q ) ,

which is a measure of the noise caused by E, on a quantum system Q (ρ ≡ ρ Q ), puri¬ed by R. The prime
notation is used to indicate the states after the application of E. Note that the entropy exchange, does not
depend upon the way in which the initial state of Q, is puri¬ed by R. This is because any two puri¬cations of Q
into RQ are related by a unitary operation on the system R, [1, p.111], and because of von Neumann entropy,
property 2.
Quantum Fano inequality (2.5) is proven by taking an orthonormal basis |i for the system RQ, cho-
i| ρR Q |i , then it follows
sen so that the ¬rst state in the set |1 = |RQ . Forming the quantities pi
from (1.39) that S (R , Q ) ¤ H (p1, . . . , pd2 ) , and with some simple algebra H (p1, . . . , pd2 ) = H (p1 ) +
pd 2
(1 ’ p1 ) H 1’p1 , . . . , 1’p1 ¤ log d2 ’ 1 , and since p1 F (ρ, E) , q.e.d.

The Holevo bound
Another result giving an upper bound of accessible quantum information is the Holevo bound [26]

H (X : Y ) ¤ S (ρ) ’ px S (ρx ) , (2.6)

where ρ = x px ρx . Moreover the right hand side of this inequality is useful in quantum information theory,
and hence it is given a special name: Holevo χ quantity. Concerning its proof assume that someone, named
P, prepares some quantum information system Q with states ρ X , where X = 0, . . . , n, having probabilities
p0, . . . , pn. The quantum information Q is going to be measured by another person, M, using POVM elements
{Ey } = {E0, . . . , Em} on the state and will have an outcome Y. The state of the total system before measurement
will then be

ρP QM = px |x x| — ρx — |0 0| ,

where the tensor product was in respect to the order P QM. Matrix |0 0| represents the initial state of the
measurement system, which holds before getting any information. The measurement is described by an operator
E, that acts on each state σ of Q by measuring it with POVM elements {Ey } , and storing on M the outcome.
This can be expressed by

E (σ — |0 0|) = Ey — |y y| .
Ey σ

Quantum operation E is trace preserving. To see this ¬rst notice that it is made us of operations elements
Ey — Uy , where Uy |y |y + y , with + the modulo n + 1 addition. Of course Uy is unitary since it is
a map taking |y basis vector to another basis vector |y + y , and hence it is change of basis from one basis

Uy Uy =I

to a cyclic permutation of the same basis. Now because Ey are POVM I = Ey = Ey Ey
y y
† †
Ey Ey — Uy Uy = I, q.e.d.
Subsequently primes are used to denote states after the application of E, and unprimed notation for states
before its application. Note now that S (P : Q) = S (P : Q, M ) since M is initially uncorrelated with P and Q,
and because by applying operator E it is not possible to increase mutual information (von Neumann entropy,
property 10) S (P : Q, M ) ≥ S (P : Q , M ) . Putting these results together

S (P : M ) ¤ S (P : Q) . (2.7)

The last equation, with a little algebra is understood to be the Holevo bound. The one on the right of
(2.7) can be found by thinking that ρP Q = x px |x x| — ρ x , hence S (P ) = H (px ) , S (Q) = S (ρ) , and
S (P, Q) = H (px ) + x px S ρχ (von Neumann entropy, property 6), thus

S (P : Q) = S (P ) + S (Q) ’ S (P, Q) = S (ρ) ’ px S (ρx ) .

Now the left hand side of (2.7) is found by noting that after a measurement

px |x x| — Ey — |y y| ,
= Ey ρx

tracing out the system Q and using the observation that the joint distribution p (x, y) for the pair (X, Y )
satis¬es p (x, y) = px p (y|x) = px tr(ρx Ey ) = px tr Ey ρx Ey , it is straightforward to see that ρP M =
xy p (x, y) |x x| — |y y| , whence

S (P : M ) = S (P ) + S (M ) ’ S (P , M ) = H (X) + H (Y ) ’ H (X, Y ) = H (X : Y ) ,


2.2 Data processing
As it is widely known, information except of being stored and transmitted, it is also processed. In subsection
2.2.1 data processing is de¬ned for the classical case and the homonymous inequality is proven. An analogous
de¬nition and inequality are demonstrated for the quantum case in subsection 2.2.2.

2.2.1 Classical data processing
Classical data processing can be described in mathematical terms by a chain of random variables

X1 ’ X 2 ’ X 3 ’ · · · ’ X n (2.8)

where Xi is the i-th step of processing. Of course each step depends only from the information gained by
the previous, that is p (Xn+1 = xn+1 |Xn = xn, . . . , X1) = p (Xn+1 = xn+1 |Xn = xn) , which de¬nes a Markov
chain. But as it is already accentuated information is a physical entity which can be distorted by noise. Thus
if X ’ Y ’ Z is a Markov chain representing an information process one can prove

H (X) ≥ H (X : Y ) ≥ H (X : Z) , (2.9)

which is known as the data processing inequality. This inequality reveals a mathematical insight of the following
physical truth: if a system described by a random variable X is subjected to noise, producing Y, then further
data process cannot be used to increase the amount of mutual information between the output process and the
original information X; once information is lost, it cannot be restored. It is worth mentioning that in a data
process chain X ’ Y ’ Z, information a system Z shares with X must be information which Z also shares
with Y ; the information is ™pipelined™ from X through Y to Z. This is described by the data pipelining inequality

H (Z : Y ) ≥ H (Z : X) .

This is derived by (2.9) and noting that X ’ Y ’ Z is a Markov chain, if and only if

p (X = x, Y = y, Z = z) p (Y = y, Z = z)
p (Z = z|Y = y, X = x) = p (Z = z|Y = y) ” ”
p (X = x, Y = y) p (Y = y)
p (X = x, Y = y, Z = z) p (X = x, Y = y)
” p (X = x|Y = y, Z = z) = p (X = x|Y = y) ,
p (Y = y, Z = z) p (Y = y)

if and only if, Z ’ Y ’ X is a Markov chain too. In the above proof there is no problem with null probabilities
in the denominators, because then every probability would be null, and the proven result would still hold.

2.2.2 Quantum data processing
The quantum analogue of data processing (2.8) is described by a chain of quantum operations

ρ ’ E1 (ρ) ’ (E2 —¦ E1 ) (ρ) ’ · · · ’ (En —¦ · · · —¦ E2 —¦ E1) (ρ) .

In the above model each step of process is obtained by application of a quantum operator. By de¬ning the
coherent information,

I (ρ, E) S (E (ρ)) ’ S (ρ, E) . (2.10)

It can be proven that

S (ρ) ≥ I (ρ, E1) ≥ I (ρ, E1 —¦ E2) , (2.11)

which corresponds to the classical data processing inequality (2.9).

Chapter 3

Real-life applications of information

After a theoretical presentation of information theory one should look over its real-life applications. Two ex-
tremely useful applications of information theory are data compression discussed in section 3.1 and transmission
of information over noisy channels is the main topic of section 3.2.

3.1 Data compression
Nowadays data compression is a widely applied procedure; everybody uses .zip archives, listen to .mp3 music,
watches videos in .mpeg format and exchanges photos in .jpg ¬les. Although in all these cases, special techniques
depending on the type of data are used, the general philosophy underlying data compression is inherited by
Shannon™s noiseless channel coding theorem [7, 8] discussed in subsection 3.1.1. In the quantum case data
compression was theoretically found to be possible in 1995 by Schumacher [27] and is presented in subsection

3.1.1 Shannon™s noiseless channel coding theorem
Shannon™s main idea was to estimate the physical resources needed to represent an information content, which
as already seen in section 1.1 are related to entropy. As a simple example to understand how the theorem works,
consider a binary alphabet to be compressed. In this alphabet assume that 0 occurs with probability 1 ’ p and 1
with probability p. Then if n-bits strings are formed, they will contain (1 ’ p) n zero bits and pn one bits, with
very high probability related to the magnitude of n, according to the law of large numbers. Since from all the
possible strings to be formed these are the most likely they are usually named typical sequences. Calculating
all the combinations of ones and zeros, there are totally np such strings. Using Stirling™s approximation
n! , one ¬nds that
n n! e
= (3.1)
(np)! [n (1 ’ p)]!
np np np n(1’p)
e e
= 2n log n’np log np’n(1’p) log n(1’p)
= np n(1’p)
(n (1 ’ p))
2n[’p log p’(1’p) log(1’p)] = 2nH(p) .
Generalizing the above argument for an alphabet of k letters xi ∈ X with occurrence probabilities p (xi) , the
number of typical sequences can be easily calculated using combinatorics as before, and found to be
n n!
2nH(X) .
= (3.2)
np (x1 ) , np (x2) , . . . , np (xk’1) (np (x))!
x∈X\{xk }

Where letters need not just be one symbol but also a sequence of symbols like words of English language.
Obviously the probability of such a sequence to occur is approximately 2’nH(X) . This approximate probability

gives a very intuitive way of understanding the mathematical terminology of Shannon™s noiseless theorem, where
a sequence y = xi1 xi2 xi3 · · · xin is de¬ned to be -typical if the probability of its occurrence is

2’n(H(X)+ ) ¤ p (y) ¤ 2’n(H(X)’ ) .

The set of all such sequences is denoted T (n, ) . Another very useful way of writing this result is
1 1
’ H (X) ¤ .
log (3.3)
p (xi1 xi2 xi3 · · · xin )
Now the theorem of typical sequences can be stated

Theorem of typical sequences:

1. Fix > 0, then for any δ > 0, for su¬ciently large n, the probability that a sequence is -typical is at
least 1 ’ δ.
> o and δ > 0, for su¬ciently large n, (1 ’ δ) 2n(H(X)’ ) ¤ |T (n, )| ¤ 2n(H(X)+ ) .
2. For any ¬xed
3. Let S (n) be the collection of size at most 2nR, of length n sequences, where R < H (X) is ¬xed. Then
for any δ > 0 and for su¬ciently large n, y∈S(n) p (y) ¤ δ.
The above theorem is proven by using the law of large numbers. Moreover Shannon™s noiseless channel
coding theorem, is just an application of the last stated theorem. Shannon implemented a compression scheme
which is just a map of an n-bit sequence y = xi1 xi2 xi3 · · · xin to another one of nR-bits denoted by Cn (y) .
Of course in such a compression scheme an invert map Dn (Dn —¦ Cn = idX n ) should exist, which naturally
would be named decompression scheme. However the set of typical sequences, in non-trivial cases, is only a
subset of all the possible sequences and this drives to failure of the schemes, when they will be invited to
map the complementary subset, known as atypical sequences subset. This way further nomenclature may be
added by saying that a compression decompression scheme (Cn, Dn ) is said to be reliable if the probability that
Dn (Cn (y)) = y approaches one as n approaches in¬nity. It is time to state the theorem.

Shannon™s noiseless channel coding theorem:
Assume an alphabet X then if R > H (X) is chosen there exists a reliable compression decom-
pression scheme of rate R. Conversely, if R < H (X) any such scheme will not be reliable.
This theorem is revealing a remarkable operational interpretation for the entropy rate H (X): it is just the
minimal physical resources necessary and su¬cient to reliably transmit data. Finally it should be stressed that
somebody can have a perfectly reliable compression decompression scheme just by extending maps to atypical
sequences; in this case there is just high probability that no more than nR resources are needed to carry the

3.1.2 Schumacher™s quantum noiseless channel coding theorem
It is yet quite surprising that quantum information can be compressed as was proven by Schumacher 3.1.2.
Assume that quantum information transmitted can be in states |xi ∈ H —n with probability p (xi ) . This
i=1 p (xi ) |xi xi | . A compression-decompression scheme of rate R
is described by the density matrix ρ =
consists of two quantum operations Cn and Dn analogous to the maps de¬ned for the classical case. The
compression operation Cn is taking states from H —n to H —nR and the decompression Dn returns them back,
as Figure 3.1 demonstrates. One can de¬ne a sequence xi1 xi2 xi3 · · · xin as -typical by a relation resembling to
the classical (3.3)
1 1
’ S (ρ) ¤ .
p (xi1 ) p (xi2 ) · · · p (xin )
A state |xi1 |xi2 · · · |xin is said to be -typical if the sequence xi1 xi2 xi3 · · · xin is -typical. The -typical
subspace will be noted T (n, ) and the projector onto this subspace will be

|xi1 xi1 | — |xi2 xi2 | — · · · — |xin x in | .
P (n, ) =
xi1 xi2 xi3 ···xin ∈T (n, )

Now the quantum typical sequences theorem can be stated.

Typical subspace theorem:

> 0 then for any δ > 0, for su¬ciently large n, tr(P (n, ) ρ —n) ≥ 1 ’ δ.
1. Fix
> o and δ > 0, for su¬ciently large n, (1 ’ δ) 2n(S(X)’ ) ¤ |T (n, )| ¤ 2n(S(X)+ ) .
2. For any ¬xed
3. Let S (n) be a projector onto any subspace of H —n of dimension at most 2nR, where R < S (ρ) is ¬xed.
Then for any δ > 0 and for su¬ciently large n, tr(S (n) ρ —n) ¤ δ.

Following the same principles the quantum version of Shannon™s theorem as proved by Schumacher is,

Schumacher™s noiseless channel coding theorem:
Let ρ be information belonging in some a Hilbert space H then if R > S (ρ) there exists a reliable
compression scheme. Conversely if R < S (ρ) any compression scheme is not reliable.

Cn Dn
’ ’
ρ ρ ρ
n log d nS (ρ) n log d
qubits qubits qubits

Figure 3.1: Quantum data compression. The compression operation Cn compresses a quantum source ρ stored
in n log d qubits into nS (ρ) qubits. The source is accurately recovered via the decompression operation D n .

The compression scheme found by Schumacher is

Cn (σ) P (n, ) σP (n, ) + |0 i| σ |i 0| ,

where |i is an orthonormal basis for the orthocomplement of the typical subspace, and |0 is some standard
state. As one can see this quantum operation takes any state σ from H —n to H —nR , the subspace of -typical
sequences if σ can be compressed, and if not it gives as outcome the standard state |0 , which is meant to be
a failure. Finally Dn was found to be the identity map on H —nR, which obviously maps any compressed state
back to H —nR ¤ H —n .

3.2 Information over noisy channels
It is an everyday life fact that communication channels are imperfect and are always subject to noise which
distorts transmitted information. This of course prevents reliable communication without some special control
of information transmitted and received. One can use error correction in order to achieve such a control,
which is summarized in subsection 3.2.1. However there are some general theoretical results concerning such
transmissions, which help calculate the capacity of noisy channels. The cases of classical information over noisy
classical and quantum channels each presented in subsections 3.2.2 and 3.2.3. A a summary for the up today
results for quantum information over noisy quantum channels is given in subsection 3.2.4.

3.2.1 Error correction
Error correction is a practical procedure for transmission of information over noisy channels. The intuitive idea
behind it is common in every day life. As an example recall a typical telephone conversation. If the connection
is of low quality, people communicating often need to repeat their words, in order to protect their talk against
the noise. Moreover sometimes during a telephonic communication, one is asked to spell a word. Then by saying
words whose initials are the letters to be spelled, misunderstanding is minimized. If someone wants to spell the
word ”phone”, he can say ”Parents”, ”Hotel”, ”Oracle”, ”None” and ”Evangelist”. If instead of saying ”None”,
he said ”New” the person at the other side of the line could possibly hear ”Mew”. This example demonstrates
why words should be carefully selected. One can see that the last two words di¬er by only one letter, their
Hamming distance is small (refer to appendix B.1 for a de¬nition), hence they should select words with higher

The error correction procedure, as presented in the last paragraph, is the encoding and transmission of a
message longer than the one willing to communicate, containing enough information to reconstruct the initial
message up to a probability. If one wish to encode a k-bit word x, into an n-bit y (k ¤ n), then this encoding-
decoding scheme is named [n, k] , and represented by function C (x) = y. In the last case it is often written
x ∈ C. One can avoid misunderstandings similar to ”New” and ”Mew”, as found in the above paragraph, by
asking each codeword to be of Hamming distance greater than d. Then after receiving a codeword y, he tries to
¬nd to which Hamming sphere, Sph(x, d) , it belongs, and then identi¬es the received codeword with the center
of the sphere: x. Such a code is denoted by [n, k, d].
The basic notions of classical and quantum error correction are summarized in next paragraphs.

Classical error correction
From all the possible error correction codes, a subset is going to be presented here, the linear ones. A member
of this subset, namely a [n, k] linear code, is modeled by an n — l matrix G, often called the generator. The
k-bit message x, is treated as a column vector, and the encoded n-bit message is the Gx, where the numbers in
both G and x are numbers of Z2, that is zeros and ones, and all the operations are performed modulo 2.
The linear codes are used because for the case of [n, k] , nk bits are needed to represent it. In a general code
C, an n-bit string would correspond to one of 2k words, and to do this a table of n2k bits is needed. In contrast
by using linear codes much memory is saved the encoding program is more e¬cient.
In order to perform error correction one takes an (n ’ k) — n matrix H, named parity check, having the
property HG = 0. Then for every codeword y = Gx it is obvious that Hy = 0. Now if an noise was present
during transmission, one receives a state y y + e, where e is the error occurred. Hence Hy Hy + He = He.
Usually Hy is called the error syndrome. From the error syndrome one can identify the initial y if t ≥ d (y , y) =
d (y + e, y) = d (e, 0) , and then checking in which sphere y ∈ Sph(y, d) . To do this the distance of the code
must be de¬ned by d ≡ d (C) min d (x, y) , that is the spheres of radius d must be distinct. Then if
d ≥ 2t + 1, up to t bits can be corrected. All this are under the assumption that the probability that the channel
¬‚ips a bit is less than 1 .
It is easy to check that linear codes [n, k, d], must satisfy the Singleton bound
n ’ k ≥ d ’ 1. (3.4)

One can further prove, that for large n there exists an [n, k] error-correcting code, protecting against t bits for
some k, such that
k t
≥ 1 ’ Hbin . (3.5)
n n
This is known as Gilbert-Varshamov bound.
Some further de¬nitions are needed. Suppose an [n, k] code C is given, then its dual is denoted C ⊥ , and has
as generator matrix H T and parity check GT . Thus the words in C ⊥ are orthogonal to C. A code is said to be
weakly self-dual if C ⊆ C ⊥ , and strictly self dual if C = C ⊥ .

Quantum error correction
In what concerns quantum information theory, errors occurring are not of the same nature as in the classical
case. One has to deal, except from bit ¬‚ip errors, with phase ¬‚ip errors. The ¬rst codes found to be able to both
of them are named Calderbank-Shor-Steane after their inventors [28, 29]. Assume C 1 and C2 are [n, k1] and

[n, k2] classical linear codes such that C1 ‚ C2 and both C1 and C2 can correct t errors. Then an [n, k1 ’ k2]
quantum code CSS(C1, C2) is de¬ned, capable of correcting errors on t qubits, named the CSS code of C1
over C2 , via the following construction. Suppose x ∈ C1, then de¬ne |x + C2 y∈C |x + y , where
|C2 | 2

the addition is modulo 2. If now x is an element of C1 such that x ’ x ∈ C2, then it easy to verify that

<< . .

. 4
( : 7)

. . >>