H (E, X|Y ) = H (X|Y ) . (2.1)

Applying chaining rule for conditional entropies (Shannon entropy, property 8) again, but for di¬erent variables

H (E, X|Y ) = H (X|E, Y ) + H (E|Y ) , (2.2)

and further because conditioning reduces entropy (Shannon entropy, property 7), H (E|Y ) ¤ H (E) = H (p e ) ,

whence by (2.1) and (2.2)

H (X|Y ) = H (E, X|Y ) ¤ H (X|E, Y ) + H (pe ) . (2.3)

Finally H (X|E, Y ) should be bounded as follows, after some algebra

H (X|E, Y ) = p (E = 0) H (X|E = 0, Y ) + p (E = 1) H (X|E = 1, Y )

¤ p (E = 0) · 0 + pe log (|X| ’ 1) = pe log (|X| ’ 1) .

This relation with the help of (2.3) is known as Fano™s inequality:

H (pe ) + pe log (|X| ’ 1) ≥ H (X|Y ) . (2.4)

16

2.1.2 Accessible quantum information: quantum Fano inequality and the Holevo

bound

Quantum Fano inequality

There exists analogous relation to (2.4), in quantum information theory, named quantum Fano inequality:

S (ρ, E) ¤ H (F (ρ, E)) + (1 ’ F (ρ, E)) log d2 ’ 1 . (2.5)

Where F (ρ, E) is the entanglement ¬delity of a quantum operation de¬ned in (B.1), for more details refer to

appendix B.2. In the above equation, the entropy exchange of the operation E upon ρ, was introduced

S (ρ, E) S (R , Q ) ,

which is a measure of the noise caused by E, on a quantum system Q (ρ ≡ ρ Q ), puri¬ed by R. The prime

notation is used to indicate the states after the application of E. Note that the entropy exchange, does not

depend upon the way in which the initial state of Q, is puri¬ed by R. This is because any two puri¬cations of Q

into RQ are related by a unitary operation on the system R, [1, p.111], and because of von Neumann entropy,

property 2.

Quantum Fano inequality (2.5) is proven by taking an orthonormal basis |i for the system RQ, cho-

i| ρR Q |i , then it follows

sen so that the ¬rst state in the set |1 = |RQ . Forming the quantities pi

from (1.39) that S (R , Q ) ¤ H (p1, . . . , pd2 ) , and with some simple algebra H (p1, . . . , pd2 ) = H (p1 ) +

pd 2

p2

(1 ’ p1 ) H 1’p1 , . . . , 1’p1 ¤ log d2 ’ 1 , and since p1 F (ρ, E) , q.e.d.

The Holevo bound

Another result giving an upper bound of accessible quantum information is the Holevo bound [26]

H (X : Y ) ¤ S (ρ) ’ px S (ρx ) , (2.6)

x

where ρ = x px ρx . Moreover the right hand side of this inequality is useful in quantum information theory,

and hence it is given a special name: Holevo χ quantity. Concerning its proof assume that someone, named

P, prepares some quantum information system Q with states ρ X , where X = 0, . . . , n, having probabilities

p0, . . . , pn. The quantum information Q is going to be measured by another person, M, using POVM elements

{Ey } = {E0, . . . , Em} on the state and will have an outcome Y. The state of the total system before measurement

will then be

ρP QM = px |x x| — ρx — |0 0| ,

x

where the tensor product was in respect to the order P QM. Matrix |0 0| represents the initial state of the

measurement system, which holds before getting any information. The measurement is described by an operator

E, that acts on each state σ of Q by measuring it with POVM elements {Ey } , and storing on M the outcome.

This can be expressed by

E (σ — |0 0|) = Ey — |y y| .

Ey σ

y

Quantum operation E is trace preserving. To see this ¬rst notice that it is made us of operations elements

Ey — Uy , where Uy |y |y + y , with + the modulo n + 1 addition. Of course Uy is unitary since it is

a map taking |y basis vector to another basis vector |y + y , and hence it is change of basis from one basis

†

Uy Uy =I

†

’

to a cyclic permutation of the same basis. Now because Ey are POVM I = Ey = Ey Ey

y y

† †

Ey Ey — Uy Uy = I, q.e.d.

y

Subsequently primes are used to denote states after the application of E, and unprimed notation for states

before its application. Note now that S (P : Q) = S (P : Q, M ) since M is initially uncorrelated with P and Q,

and because by applying operator E it is not possible to increase mutual information (von Neumann entropy,

property 10) S (P : Q, M ) ≥ S (P : Q , M ) . Putting these results together

S (P : M ) ¤ S (P : Q) . (2.7)

17

The last equation, with a little algebra is understood to be the Holevo bound. The one on the right of

(2.7) can be found by thinking that ρP Q = x px |x x| — ρ x , hence S (P ) = H (px ) , S (Q) = S (ρ) , and

S (P, Q) = H (px ) + x px S ρχ (von Neumann entropy, property 6), thus

S (P : Q) = S (P ) + S (Q) ’ S (P, Q) = S (ρ) ’ px S (ρx ) .

x

Now the left hand side of (2.7) is found by noting that after a measurement

ρP QM

px |x x| — Ey — |y y| ,

= Ey ρx

xy

tracing out the system Q and using the observation that the joint distribution p (x, y) for the pair (X, Y )

satis¬es p (x, y) = px p (y|x) = px tr(ρx Ey ) = px tr Ey ρx Ey , it is straightforward to see that ρP M =

xy p (x, y) |x x| — |y y| , whence

S (P : M ) = S (P ) + S (M ) ’ S (P , M ) = H (X) + H (Y ) ’ H (X, Y ) = H (X : Y ) ,

q.e.d.

2.2 Data processing

As it is widely known, information except of being stored and transmitted, it is also processed. In subsection

2.2.1 data processing is de¬ned for the classical case and the homonymous inequality is proven. An analogous

de¬nition and inequality are demonstrated for the quantum case in subsection 2.2.2.

2.2.1 Classical data processing

Classical data processing can be described in mathematical terms by a chain of random variables

X1 ’ X 2 ’ X 3 ’ · · · ’ X n (2.8)

where Xi is the i-th step of processing. Of course each step depends only from the information gained by

the previous, that is p (Xn+1 = xn+1 |Xn = xn, . . . , X1) = p (Xn+1 = xn+1 |Xn = xn) , which de¬nes a Markov

chain. But as it is already accentuated information is a physical entity which can be distorted by noise. Thus

if X ’ Y ’ Z is a Markov chain representing an information process one can prove

H (X) ≥ H (X : Y ) ≥ H (X : Z) , (2.9)

which is known as the data processing inequality. This inequality reveals a mathematical insight of the following

physical truth: if a system described by a random variable X is subjected to noise, producing Y, then further

data process cannot be used to increase the amount of mutual information between the output process and the

original information X; once information is lost, it cannot be restored. It is worth mentioning that in a data

process chain X ’ Y ’ Z, information a system Z shares with X must be information which Z also shares

with Y ; the information is ™pipelined™ from X through Y to Z. This is described by the data pipelining inequality

H (Z : Y ) ≥ H (Z : X) .

This is derived by (2.9) and noting that X ’ Y ’ Z is a Markov chain, if and only if

p (X = x, Y = y, Z = z) p (Y = y, Z = z)

p (Z = z|Y = y, X = x) = p (Z = z|Y = y) ” ”

=

p (X = x, Y = y) p (Y = y)

p (X = x, Y = y, Z = z) p (X = x, Y = y)

” p (X = x|Y = y, Z = z) = p (X = x|Y = y) ,

=

p (Y = y, Z = z) p (Y = y)

if and only if, Z ’ Y ’ X is a Markov chain too. In the above proof there is no problem with null probabilities

in the denominators, because then every probability would be null, and the proven result would still hold.

18

2.2.2 Quantum data processing

The quantum analogue of data processing (2.8) is described by a chain of quantum operations

ρ ’ E1 (ρ) ’ (E2 —¦ E1 ) (ρ) ’ · · · ’ (En —¦ · · · —¦ E2 —¦ E1) (ρ) .

In the above model each step of process is obtained by application of a quantum operator. By de¬ning the

coherent information,

I (ρ, E) S (E (ρ)) ’ S (ρ, E) . (2.10)

It can be proven that

S (ρ) ≥ I (ρ, E1) ≥ I (ρ, E1 —¦ E2) , (2.11)

which corresponds to the classical data processing inequality (2.9).

19

Chapter 3

Real-life applications of information

theory

After a theoretical presentation of information theory one should look over its real-life applications. Two ex-

tremely useful applications of information theory are data compression discussed in section 3.1 and transmission

of information over noisy channels is the main topic of section 3.2.

3.1 Data compression

Nowadays data compression is a widely applied procedure; everybody uses .zip archives, listen to .mp3 music,

watches videos in .mpeg format and exchanges photos in .jpg ¬les. Although in all these cases, special techniques

depending on the type of data are used, the general philosophy underlying data compression is inherited by

Shannon™s noiseless channel coding theorem [7, 8] discussed in subsection 3.1.1. In the quantum case data

compression was theoretically found to be possible in 1995 by Schumacher [27] and is presented in subsection

3.1.2.

3.1.1 Shannon™s noiseless channel coding theorem

Shannon™s main idea was to estimate the physical resources needed to represent an information content, which

as already seen in section 1.1 are related to entropy. As a simple example to understand how the theorem works,

consider a binary alphabet to be compressed. In this alphabet assume that 0 occurs with probability 1 ’ p and 1

with probability p. Then if n-bits strings are formed, they will contain (1 ’ p) n zero bits and pn one bits, with

very high probability related to the magnitude of n, according to the law of large numbers. Since from all the

possible strings to be formed these are the most likely they are usually named typical sequences. Calculating

n

all the combinations of ones and zeros, there are totally np such strings. Using Stirling™s approximation

nn

n! , one ¬nds that

e

nn

n n! e

= (3.1)

n(1’p)

(np)! [n (1 ’ p)]!

np np np n(1’p)

e e

nn

= 2n log n’np log np’n(1’p) log n(1’p)

= np n(1’p)

(n (1 ’ p))

(np)

2n[’p log p’(1’p) log(1’p)] = 2nH(p) .

=

Generalizing the above argument for an alphabet of k letters xi ∈ X with occurrence probabilities p (xi) , the

number of typical sequences can be easily calculated using combinatorics as before, and found to be

n n!

2nH(X) .

= (3.2)

np (x1 ) , np (x2) , . . . , np (xk’1) (np (x))!

x∈X\{xk }

Where letters need not just be one symbol but also a sequence of symbols like words of English language.

Obviously the probability of such a sequence to occur is approximately 2’nH(X) . This approximate probability

20

gives a very intuitive way of understanding the mathematical terminology of Shannon™s noiseless theorem, where

a sequence y = xi1 xi2 xi3 · · · xin is de¬ned to be -typical if the probability of its occurrence is

2’n(H(X)+ ) ¤ p (y) ¤ 2’n(H(X)’ ) .

The set of all such sequences is denoted T (n, ) . Another very useful way of writing this result is

1 1

’ H (X) ¤ .

log (3.3)

p (xi1 xi2 xi3 · · · xin )

n

Now the theorem of typical sequences can be stated

Theorem of typical sequences:

1. Fix > 0, then for any δ > 0, for su¬ciently large n, the probability that a sequence is -typical is at

least 1 ’ δ.

> o and δ > 0, for su¬ciently large n, (1 ’ δ) 2n(H(X)’ ) ¤ |T (n, )| ¤ 2n(H(X)+ ) .

2. For any ¬xed

3. Let S (n) be the collection of size at most 2nR, of length n sequences, where R < H (X) is ¬xed. Then

for any δ > 0 and for su¬ciently large n, y∈S(n) p (y) ¤ δ.

The above theorem is proven by using the law of large numbers. Moreover Shannon™s noiseless channel

coding theorem, is just an application of the last stated theorem. Shannon implemented a compression scheme

which is just a map of an n-bit sequence y = xi1 xi2 xi3 · · · xin to another one of nR-bits denoted by Cn (y) .

Of course in such a compression scheme an invert map Dn (Dn —¦ Cn = idX n ) should exist, which naturally

would be named decompression scheme. However the set of typical sequences, in non-trivial cases, is only a

subset of all the possible sequences and this drives to failure of the schemes, when they will be invited to

map the complementary subset, known as atypical sequences subset. This way further nomenclature may be

added by saying that a compression decompression scheme (Cn, Dn ) is said to be reliable if the probability that

Dn (Cn (y)) = y approaches one as n approaches in¬nity. It is time to state the theorem.

Shannon™s noiseless channel coding theorem:

Assume an alphabet X then if R > H (X) is chosen there exists a reliable compression decom-

pression scheme of rate R. Conversely, if R < H (X) any such scheme will not be reliable.

This theorem is revealing a remarkable operational interpretation for the entropy rate H (X): it is just the

minimal physical resources necessary and su¬cient to reliably transmit data. Finally it should be stressed that

somebody can have a perfectly reliable compression decompression scheme just by extending maps to atypical

sequences; in this case there is just high probability that no more than nR resources are needed to carry the

information.

3.1.2 Schumacher™s quantum noiseless channel coding theorem

It is yet quite surprising that quantum information can be compressed as was proven by Schumacher 3.1.2.

Assume that quantum information transmitted can be in states |xi ∈ H —n with probability p (xi ) . This

n

i=1 p (xi ) |xi xi | . A compression-decompression scheme of rate R

is described by the density matrix ρ =

consists of two quantum operations Cn and Dn analogous to the maps de¬ned for the classical case. The

compression operation Cn is taking states from H —n to H —nR and the decompression Dn returns them back,

as Figure 3.1 demonstrates. One can de¬ne a sequence xi1 xi2 xi3 · · · xin as -typical by a relation resembling to

the classical (3.3)

1 1

’ S (ρ) ¤ .

log

p (xi1 ) p (xi2 ) · · · p (xin )

n

A state |xi1 |xi2 · · · |xin is said to be -typical if the sequence xi1 xi2 xi3 · · · xin is -typical. The -typical

subspace will be noted T (n, ) and the projector onto this subspace will be

|xi1 xi1 | — |xi2 xi2 | — · · · — |xin x in | .

P (n, ) =

xi1 xi2 xi3 ···xin ∈T (n, )

Now the quantum typical sequences theorem can be stated.

21

Typical subspace theorem:

> 0 then for any δ > 0, for su¬ciently large n, tr(P (n, ) ρ —n) ≥ 1 ’ δ.

1. Fix

> o and δ > 0, for su¬ciently large n, (1 ’ δ) 2n(S(X)’ ) ¤ |T (n, )| ¤ 2n(S(X)+ ) .

2. For any ¬xed

3. Let S (n) be a projector onto any subspace of H —n of dimension at most 2nR, where R < S (ρ) is ¬xed.

Then for any δ > 0 and for su¬ciently large n, tr(S (n) ρ —n) ¤ δ.

Following the same principles the quantum version of Shannon™s theorem as proved by Schumacher is,

Schumacher™s noiseless channel coding theorem:

Let ρ be information belonging in some a Hilbert space H then if R > S (ρ) there exists a reliable

compression scheme. Conversely if R < S (ρ) any compression scheme is not reliable.

Cn Dn

’ ’

ρ ρ ρ

n log d nS (ρ) n log d

qubits qubits qubits

Figure 3.1: Quantum data compression. The compression operation Cn compresses a quantum source ρ stored

in n log d qubits into nS (ρ) qubits. The source is accurately recovered via the decompression operation D n .

The compression scheme found by Schumacher is

Cn (σ) P (n, ) σP (n, ) + |0 i| σ |i 0| ,

where |i is an orthonormal basis for the orthocomplement of the typical subspace, and |0 is some standard

state. As one can see this quantum operation takes any state σ from H —n to H —nR , the subspace of -typical

sequences if σ can be compressed, and if not it gives as outcome the standard state |0 , which is meant to be

a failure. Finally Dn was found to be the identity map on H —nR, which obviously maps any compressed state

back to H —nR ¤ H —n .

3.2 Information over noisy channels

It is an everyday life fact that communication channels are imperfect and are always subject to noise which

distorts transmitted information. This of course prevents reliable communication without some special control

of information transmitted and received. One can use error correction in order to achieve such a control,

which is summarized in subsection 3.2.1. However there are some general theoretical results concerning such

transmissions, which help calculate the capacity of noisy channels. The cases of classical information over noisy

classical and quantum channels each presented in subsections 3.2.2 and 3.2.3. A a summary for the up today

results for quantum information over noisy quantum channels is given in subsection 3.2.4.

3.2.1 Error correction

Error correction is a practical procedure for transmission of information over noisy channels. The intuitive idea

behind it is common in every day life. As an example recall a typical telephone conversation. If the connection

is of low quality, people communicating often need to repeat their words, in order to protect their talk against

the noise. Moreover sometimes during a telephonic communication, one is asked to spell a word. Then by saying

words whose initials are the letters to be spelled, misunderstanding is minimized. If someone wants to spell the

word ”phone”, he can say ”Parents”, ”Hotel”, ”Oracle”, ”None” and ”Evangelist”. If instead of saying ”None”,

he said ”New” the person at the other side of the line could possibly hear ”Mew”. This example demonstrates

why words should be carefully selected. One can see that the last two words di¬er by only one letter, their

Hamming distance is small (refer to appendix B.1 for a de¬nition), hence they should select words with higher

distances.

22

The error correction procedure, as presented in the last paragraph, is the encoding and transmission of a

message longer than the one willing to communicate, containing enough information to reconstruct the initial

message up to a probability. If one wish to encode a k-bit word x, into an n-bit y (k ¤ n), then this encoding-

decoding scheme is named [n, k] , and represented by function C (x) = y. In the last case it is often written

x ∈ C. One can avoid misunderstandings similar to ”New” and ”Mew”, as found in the above paragraph, by

asking each codeword to be of Hamming distance greater than d. Then after receiving a codeword y, he tries to

¬nd to which Hamming sphere, Sph(x, d) , it belongs, and then identi¬es the received codeword with the center

of the sphere: x. Such a code is denoted by [n, k, d].

The basic notions of classical and quantum error correction are summarized in next paragraphs.

Classical error correction

From all the possible error correction codes, a subset is going to be presented here, the linear ones. A member

of this subset, namely a [n, k] linear code, is modeled by an n — l matrix G, often called the generator. The

k-bit message x, is treated as a column vector, and the encoded n-bit message is the Gx, where the numbers in

both G and x are numbers of Z2, that is zeros and ones, and all the operations are performed modulo 2.

The linear codes are used because for the case of [n, k] , nk bits are needed to represent it. In a general code

C, an n-bit string would correspond to one of 2k words, and to do this a table of n2k bits is needed. In contrast

by using linear codes much memory is saved the encoding program is more e¬cient.

In order to perform error correction one takes an (n ’ k) — n matrix H, named parity check, having the

property HG = 0. Then for every codeword y = Gx it is obvious that Hy = 0. Now if an noise was present

during transmission, one receives a state y y + e, where e is the error occurred. Hence Hy Hy + He = He.

Usually Hy is called the error syndrome. From the error syndrome one can identify the initial y if t ≥ d (y , y) =

d (y + e, y) = d (e, 0) , and then checking in which sphere y ∈ Sph(y, d) . To do this the distance of the code

must be de¬ned by d ≡ d (C) min d (x, y) , that is the spheres of radius d must be distinct. Then if

x,y∈C,x=y

d ≥ 2t + 1, up to t bits can be corrected. All this are under the assumption that the probability that the channel

¬‚ips a bit is less than 1 .

2

It is easy to check that linear codes [n, k, d], must satisfy the Singleton bound

n ’ k ≥ d ’ 1. (3.4)

One can further prove, that for large n there exists an [n, k] error-correcting code, protecting against t bits for

some k, such that

k t

≥ 1 ’ Hbin . (3.5)

n n

This is known as Gilbert-Varshamov bound.

Some further de¬nitions are needed. Suppose an [n, k] code C is given, then its dual is denoted C ⊥ , and has

as generator matrix H T and parity check GT . Thus the words in C ⊥ are orthogonal to C. A code is said to be

weakly self-dual if C ⊆ C ⊥ , and strictly self dual if C = C ⊥ .

Quantum error correction

In what concerns quantum information theory, errors occurring are not of the same nature as in the classical

case. One has to deal, except from bit ¬‚ip errors, with phase ¬‚ip errors. The ¬rst codes found to be able to both

of them are named Calderbank-Shor-Steane after their inventors [28, 29]. Assume C 1 and C2 are [n, k1] and

⊥

[n, k2] classical linear codes such that C1 ‚ C2 and both C1 and C2 can correct t errors. Then an [n, k1 ’ k2]

quantum code CSS(C1, C2) is de¬ned, capable of correcting errors on t qubits, named the CSS code of C1

√1

over C2 , via the following construction. Suppose x ∈ C1, then de¬ne |x + C2 y∈C |x + y , where

|C2 | 2

the addition is modulo 2. If now x is an element of C1 such that x ’ x ∈ C2, then it easy to verify that