The words you are searching are inside this book. To get more targeted content, please make full-text search by clicking here.

HAN AND VERD& APPROXIMATION THEORY OF OUTPUT STATISTICS 753 fication via channels [l]. As a by-product of our resolvability results, we find in Section ...

Discover the best professional documents and content resources in AnyFlip Document Base.
Search
Published by , 2017-05-17 05:50:27

Approximation Theory of Output Statistics

HAN AND VERD& APPROXIMATION THEORY OF OUTPUT STATISTICS 753 fication via channels [l]. As a by-product of our resolvability results, we find in Section ...

152 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 39, NO. 3, MAY 1993

Approximation Theory of Output Statistics

Te Sun Han, Fellow, IEEE, and Sergio VerdG, Fellow, IEEE

Abstract-Given a channel and an input process, the minimum concept in the Shannon theory; the resolvability of a system
randomness of those input processes whose output statistics (channel) defined as the number of random bits per input
approximate the original output statistics with arbitrary accuracy sample required to achieve arbitrarily accurate approximation
is studied. The notion of resolvability of a channel, defined of the output statistics regardless of the actual input process.
as the number of random bits required per channel use in Intuitively, we can anticipate that the resolvability of a system
order to generate an input that achieves arbitrarily accurate will depend on how “noisy” it is. A coarse approximation of
approximation of the output statistics for any given input process, the input statistics whose generation requires comparatively
is introduced. A general formula for resolvability that holds few bits will be good enough when the system is very
regardless of the channel memory structure, is obtained. It is noisy, because, then, the output cannot reflect any fine detail
shown that, for most channels, resolvability is equal to Shannon contained in the input distribution.
capacity. By-products of the analysis are a general formula for
the minimum achievable (fixed-length) source coding rate of any Although the problem of approximation of output statistics
finite-alphabet source, and a strong converse of the identification involves no codes of any sort or the transmission/reproduction
coding theorem, which holds for any channel that satisfies the of information, its analysis and results turn out to be Shannon
strong converse of the channel coding theorem. theoretic in nature. In fact, our main conclusion is that (for
most systems) resolvability is equal to Shannon capacity.
Index Terms- Shannon theory, channel output statistics, re-
solvability, random number generation complexity; channel ca- In order to make the notion of resolvability precise, we
pacity, noiseless source coding theorem, identification via chan- need to specify the “distance” measure between true and
nels. approximated output statistics and the “complexity” measure
of random number generation. Our main, but not exclusive,
I. INTRODUCTION focus is on the II-distance (or variational distance) and on
the worst-case measure of randomness, respectively. This
T 0 MOTIVATE the problem studied in this paper let us complexity measure of a random variable is equal to the
consider the computer simulation of stochastic systems. number of random bits required to generate every possible
Usually, the objective of the simulation is to compute a set realization of the random variable; we refer to it as the
of statistics of the response of the system to a given “real- resolution of the random variable and we show how to obtain
world” input random process. To accomplish this, a sample it from the probability distribution. The alternative, average
path of the input random process is generated and empirical randomness measure is known to equal the entropy plus at
estimates of the desired output statistics are computed from most two bits [ll], and it leads to the associated notion of
the output sample path. A random number generator is used mean-resolvability.
to generate the input sample path and an important question
is how many random bits are required per input sample. The Section II introduces the main definitions. The class of
answer would depend only on the given “real-world” input channels we consider is very general. To keep the development
statistics if the objective were to reproduce those statistics as simple as possible we restrict attention to channels with
exactly (in which case an infinite number of bits per sample finite input/output alphabets. However, most of the proofs do
would be required if the input distribution is continuous, for not rely on that assumption, and it is specifically pointed out
example). However, the real objective is to approximate the when this is not the case. In addition to allowing channels
output statistics. Therefore, the required number of random with arbitrary memory structure, we deal with completely
bits depends not only on the input statistics but on the degree general input processes, in particular, neither ergodicity nor
of approximation required for the output statistics, and on stationarity assumptions are imposed.
the system itself. In this paper, we are interested in the
approximation of output statistics (via an alternative input Further motivation for the notions of resolvability and
process) with arbitrary accuracy, in the sense that the distance mean-resolvability is given in Section III. Section IV gives
between the finite-dimensional statistics of the true output a general formula for the resolvability of a channel. The
process and the approximated output process is required to achievability part of the resolvability thereom (which gives an
vanish asymptotically. This leads to the introduction of a new upper bound to resolvability) holds for any channel, regardless
of its memory structure or the finiteness of the input/output
Manuscript received February 7, 1992; revised September 18, 1992. This alphabets. The finiteness of the input set is the only substantive
work was supported in part by the U.S. Office of Naval Research under Grant restriction under which the converse part (which lower bounds
N00014-90-J-1734 and in part by the NEC Corp. under its grant program. resolvability) is shown in Section IV via Lemma 6.

T. S. Han is with the Graduate School of Information Systems, University The approximation of output statistics has intrinisic connec-
of Electra-Communications, Tokyo 182, Japan. tions with the following three major problems in the Shannon
theory: (noiseless) source coding, channel coding and identi-
S. Verdti is with the Department of Electrical Engineering, Princeton
University, Princeton, NJ 08544.

IEEE Log Number 9206960.

0018-9448/93$03.00 0 1993 IEEE

HAN AND VERD& APPROXIMATIONTHEORY OF OUTPUT STATISTICS 753

fication via channels [l]. As a by-product of our resolvability In order to describe the statistics of input/output processes,
results, we find in Section III a very general formula for the we will use the sequence of finite-dimensional distributions’
minimum achievable fixed-length source coding rate that holds
for any finite-alphabet source thereby dispensing with the {Xn = (X,‘“‘, . .. ,X?))}r=,, which is abbreviated as X.
classical assumptions of ergodicity and stationarity. In Section The following notation will be used for the output distribution
V, we show that as long as the channel satisfies the strong con- when the input is distributed according to Q”:
verse to the channel coding theorem, the resolvability formula
found in Section IV is equal to the Shannon capacity. As a sim- Q”W”(y”) = c W”(y” 1?)&“(a?)<
ple consequence of the achievability part of the resolvability
theorem, we show in Section VI a general strong converse X”EA”
to the identification coding theorem, which was known to
hold only for discrete memoryless channels [7]. This result Definition 2 (141: Given a joint distribution Pxnyfi (a~“,y”)
implies that the identification capacity is guaranteed to equal
the Shannon capacity for any finite-input channel that satisfies = Px-(z”)W”(y” ] a?), the information density is the
the strong converse to the Shannon channel coding theorem.
function defined on A” x B”:
The more appropriate kind (average or worst-case) of com-
plexity measure will depend on the specific application. For iXnW”(un, b”) = log Wn(bn ’a”).
example, in single sample-path simulations, the worst-case PY” @ “I
measure may be preferable. At any rate, the limited study
in Section VII indicates that in every case we consider, the The distribution of the random variable (l/n)ix-w- (Xn,Y”)
mean-resolvability is also equal to the Shannon capacity of where X” and Y” have joint distribution Px-Y- will be
the system. referred to as the information spectrum. The expected value of
the information spectrum is the normalized mutual information
Similarly, the results presented in Section VIII evidence that (l/n)I(X”; Y”).
the main conclusions on resolvability (established in previous
sections) also hold when the variational-distance approxi- Definition 3: The limsup in probability of a sequence of
mation criterion is replaced by the normalized divergence. random variables {A,} is defined as the smallest extended
Section VIII concludes with the proof of a folk theorem real number @ such that for all E > 0
which fits naturally within the approximation theory of output
statistics: the output distribution due to any good channel code ilm P[A, > /3 + E] = 0.
must approximate the output distribution due to the capacity-
achieving input. Analogously, the liminf in probability is the largest extended
real number Q! such that for all e > 0, limn+cr, P[A, 5
Although the problem treated in this paper is new, it is Cl-E] = 0. Note that a sequence of random variables
interesting to note two previous information-theoretic con- converges in probability to a constant, if and only if its
tributions related to the notion of quantifying the minimum limsup in probability is equal to its liminf in probability.
complexity of a randomness source required to approximate The limsup in probability [resp. liminf in probability] of the
some given distribution. In one of the approaches to measure sequence of random variables {(l/n)ix%w- (Xn, Yn)}rZ1
the common randomness between two dependent random will be referred to as the sup-information rate [resp. inf-
variables proposed in [21], the randomness source is the input information rate] of the pair (X, Y) and will be denoted
to two independent memoryless random transformations, the as 7(X; Y) [resp. 1(X; Y)]. The mutual information rate of
outputs of which are required to have a joint distribution which (X, Y), if it exists, is the limit
approximates (in normalized divergence) the nth product of the
given joint distribution, The class of channels whose transition 1(X; Y) = 7l1im’00 In1(X”; Y”).
probabilities can be approximated (in &distance) by sliding-
block transformations of the input and an independent noise Although convergence in probability does not necessarily
source are studied in [13], and the minimum entropy rate of the imply convergence of the means, (e.g., [15, p. 135]), in most
independent noise source required for accurate approximation cases of information-theoretic interest that implication does
is shown to be the maximum conditional output entropy over indeed hold in the context of information rates.
all stationary inputs.
Lemma 1: For any channel with finite input alphabet, if
II. PRELIMINARIES I(X; Y) = L(X; Y), (i.e., the information spectrum con-
verges in probability to a constant), then
This section introduces the basic notation and fundamental
concepts as well as several properties to be used in the sequel. 7(X; Y) = 1(X; Y) = l(X; Y),

Definition 1: A channel W with input and output alpha- and the input-output pair (X, Y) is called information stable.
bets, A and B, respectively, is a sequence of conditional
distributions Proof: See the Appendix for a proof that hinges on the

w = {w”(yn 12”) = Pyn,X”(yn 12”); finiteness of the input alphabet. 0

(cz?, y”) E A” x B”},“=,. 1No consistency restrictions between the channel conditional probabilities,
or between the finite-dimensional distributions of the input/output processes
are imposed. Thus, “processes” refer to sequences of finite-dimensional
distributions, rather than distributions on spaces of infinite dimensional
sequences.

754 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 39, NO. 3, MAY 1993

Definition 4 [7]: For any positive integer M,2 a probability and
distribution P is said to be M - type if

d(Y”, 9’“) < E, (2.5)

P(u) E 0, $-, ;, . . . ,l ) for all w E R. for all sufficiently large 12,where Y and Y are the output
CI statistics due to input processes X and X, respectively, i.e.,

The number of different M-type distributions on R is upper Py- = Px-kV,
bounded by IRIM. P+ = P.&%W”.

Definition 5: The resolution R(P) of a probability distri- If R is an c-achievable resolution rate, for every E > 0, then,
bution P is the minimum log M such that P is M-type. (If we say that R is an achievable resolution rate. By definition,
P is not M-type for any integer M, then R(P) = +oo.) the set of (e-) achievable resolution rates is either empty or
a closed interval. The minimum e-achievable resolution rate
Resolution is a new measure of randomness which is (resp., achievable resolution rate) is called the c-resolvability
related to conventional measures via the following immediate (resp., resolvability) of the channel, and it is denoted by S,
inequality. (resp., S). Note that S, is monotonically nonincreasing in Eand

Lemma 2: Let H(P) denote the entropy of P and let Z(P)
denote the RCnyi entropy of order 0, i.e., the logarithm of the
number of points with positive P-mass.

Then.

H(P)< z(P)5 R(P) sups, = s. (2.6)
E>O

with equality, if and only if P is equiprobable. The definitions of achievable resolution rates can be modi-
The information spectrum is upper bounded almost surely fied so that the defining property applies to a particular input
X instead of every input process. In such case, we refer to the
by the input (or output) resolution: corresponding quantities as (c-) achievable resolution rate for
Lemma 3: X and (E-) resolvability for X, for which we use the notation
S,(X) and S(X). It follows from Definition 7 that
p[iXnWn(Xn, Y”) < R(X”)] = 1.

Proof: For every (a?, y”) E A” x B” such that s = sup S(X).
Px-(x”) > 0,3 we have X

iX”W”(xn, y”) 5 log 1 The main focus of this paper is on the resolvability of

px- (x”) (2.1)systems as defined in Definition 7. In addition, we shall
investigate another kind of resolvability results by considering
and a different randomness measure. Specifically, if in Definition
7, (2.4) is replaced by

Px-(CC”) = m(C) exp (-R(Xn)), (2.2)

$2”) < R + y (2.7)

where m(z”) is an integer greater than or equal to 1. Thus,

the result follows uniting (2.1) and (2.2). 0 then achievable resolution rates become achievable entropy
rates and resolvability becomes mean-resolvability. It follows
Definition 6 (e.g., [3]): The variational distance or Ii- from Lemma 2 that for all E > 0 and X
distance between two distributions P and Q defined on the
same measurable space (R, 9) is ax) 5Se(X) (2.8)

Q)d(P> = ~I%4 - Q(w)1 where 3, and ?? denote (E-) mean-resolvability in parallel
WE0
= 2~5 P’(E)- Q(WI. with the above definitions of S, and S. It is obvious that
(2.3) 3 = supx 3(X).

The motivation for the definitions of resolvability and mean-

Definition 7: Let c 2 0. R is an c-achievable resolution resolvability is further developed in the following section.

rate for channel W if for every input process X and for all

y > 0, there.exists h whose resolution satisfies III. RESOLUTION, RANDOM NUMBER GENERATION

;R(Xn) < R + y (2.4) AND SOURCE CODING
The purpose of this section is to motivate the definitions

‘The alternative terminology type with denominator M can be found in [2, of resolvability and mean-resolvability introduced in Section
ch. 121. II through their relationship with random number generation
and noiseless source coding. Along the way, we will show
3Following common usage in information theory, when the distributions in that our resolvability theorems lead to new general results in
Definitions 4-6 denote those of random variables, they will b_ereplaced by source coding.
the random variables themselves, e.g., R(X), H(X), d(Yn, Y”).

HAN AND VERDti: APPROXIMATIONTHEORY OF OUTPUT STATISTICS 755

A. Resolution and Random Number Generation in the definition of mean-resolvability. Note that the two-bit
uncertainty of the Knuth-Yao theorem is inconsequential for
A prime way to quantify the “randomness” of a random the purposes of our (asymptotic) definition.
variable is via the complexity of its generation with a computer
that has access to a basic random experiment which generates B. Resolution and Source Coding
equally likely random values, such as fair coin flips, dice,
etc. By complexity, we mean the number of random bits that Having justified the new concepts of resolvability and mean-
the most efficient algorithm requires in order to generate the resolvability on the basis of their significance in the complexity
random variable. Depending on the algorithm, the required of random variable generation, let us now explore their rela-
number of random bits may be random itself. For example, tionship with well-established concepts in the Shannon theory.
consider the generation of the random variable with probability To this end, in the remainder of this section we will focus on
masses P[X = -11 = l/4, P[X = 0] = l/2, P[X = 11 = the special case of an identity channel (A = B; W” (y” (
l/4, with an algorithm such that if the outcome of a fair coin xn) = 1 if xn = y/“), in which our approximation theory
flip is Heads, then the output is 0, and if the outcome is Tails, becomes one of approximation of source statistics.
another fair coin flip is requested in order to decide $1 or
-1. On the average this algorithm requires 1.5 coin flips, and Suppose we would ike to generate random sequences ac-
in the worst-case 2 coin flips are necessary. Therefore, the cording to the finite dimensional distributions of some given
complexity measure can take two fundamental forms: worst- process X. As we have argued, the worst-case and average
case or average (over the range of outcomes of the random number of bits per dimension required are (l/n)R(Xn) and
variable). (l/n)H(Xn), respectively. If, however, we are content with
reproducing the source statistics within an arbitrarily small
First, let us consider the worst-case complexity. A concep- tolerance, fewer bits may be needed, asymptotically in the
tual model for the generation of arbitrary random variables is worst case. For example, consider the case of independent flips
a deterministic transformation of a random variable uniformly of a biased coin with tails probability equal to l/~. It is evident
distributed on [0, 11. Although such a random variable cannot that R(Xn) = 00 for every n. However, the asymptotic
be generated by a discrete machine, this model suggests an al- equipartition property (AEP) states that for any E > 0 and
gorithm for the generation of finitely-valued random variables large n, the exp (&(1/x) + nc) typical sequences exhaust
in a finite-precision computer: a deterministic transformation most of the probability. If we let M = exp (nh(l/rr) + 2nc)
of the outcome of a random number generator which outputs then we can quantize the probability of each of those sequences
M equally likely values, in lieu of the uniformly distributed to a multiple of l/M, thereby achieving a quantization error
random variable. The lowest value of log M required to in each mass of at most l/M. Consequently, the sum of the
generate the random variable (among all possible deterministic absolute errors on the typical sequences is exponentially small,
transformations) is its worst-case complexity. Other algorithms and the masses of the atypical sequences can be approximated
may require fewer random bits on the average, but not for by zero because of the AEP, thereby yielding an arbitrarily
every possible outcome. It is now easy to recognize that small variational distance between the true and approximating
the worst-case complexity of a random variable is equal statistics. The resolution rate of the approximating statistics is
to its resolution. This is because processing the output of h(l/7r)+2~. Indeed, in this case S(X) = s(X) = h(l/r), and
the M-valued random number generator with a deterministic this reasoning can be applied to any stationary ergodic source
transformation (which is conceptually nothing more than a to show that S(X) is equal to the entropy rate of X (always
table lookup) results in a discrete random variable whose in the context of an identity channel). The key to the above
probability masses are multiples of l/M, i.e., an M-type. procedure to approximate the statistics of the source with finite
At first sight, it may seem that the use of resolution (as resolution is the use of repetition. Had we insisted on a uniform
opposed to entropy) in the definition of resolvability is overly approximation to the original statistics we would not have
stringent. However, this is not the case because that definition succeeded in bringing the variational distance to negligible
is concerned with asymptotic approximation. Analogously, in levels, because of the small but exponentially significant
practice, M may be constrained to be a power of 2; however, variation in the probability masses of the typical sequences.
this possible modification has no effect on the definition By allowing an approximation with a uniform distribution on
of achievable resolution rates (Definition 7) because it is a collection of M elements with repetition, i.e., an M-type,
only concerned with the asymptotic behavior of the ratio with large enough M, it is possible to closely track those
of resolution to number of dimensions of the approximating variations in the probability masses. A nice bonus is that for
distribution. this approximation procedure to work it is not necessary that
the masses of the typical sequences be similar, as dictated by
The average complexity of random variable generation has the AEP. This is why the connection between resolvability
been studied in the work of Knuth and Yao [ll], which shows and source coding is deeper than that provided by the AEP,
that the minimum expected number of fair bits required to and transcends stationary ergodic sources. To show this, let us
generate a random variable lies between its entropy and its first record the standard definitions of the fundamental limits
entropy plus two bits, (cf. [2, theorem 5.12.31). That lower in fixed-length and variable-length source coding.
bound holds even if the basic equally likely random number
generator is allowed to be nonbinary. This result is the reason Definition 8: R is an &-achievable source coding rate for
for the choice of entropy as the average complexity measure X if for all y > 0 and all sufficiently large n, there exists a

756 IEEE TRANSACTIONSON INFORMATIONTHEORY, VOL. 39, NO. 3, MAY 1993

collection of M n-tuples {x’;“, . . . , x%} such that P[Xn $z{XT,. . . ,x&)1 L E.

~10gM~Rfy Choose M’ such that

and exp (nR + 2ny) I M’ I exp (nR + 3ny)

R is an achievable (fixed-length) source coding rate for X and an arbitrary element xg @ D’ . We are going to construct
if it is e-achievable for all 0 < E < 1. T(X) denotes the an approximation Xn to X” which satisfies the following
minimum achievable source coding rate for X. conditions:

Definition 9: Fix an integer r > 2. R is an achievable a) Xn is an M/-type, i = l,...,M,
variable-length source coding rate for X if for all y > 0 and
all sufficiently large n, there exists an r-ary prefix code for b) P&6$) 5 t,
X” such that the average codeword length L, satisfies c) ]P*n(xr) - Px-(xa)] 5 l/M’,
d) P+(D) + P+(x$) = 1.
;Ln log r < Rfy.
It will then follow immediately that R is a 3e-achievable
The minimum achievable variable-length source coding rate resolution rate, as
for X is denoted by T(X).
$2”) 5 R + 37
As shown below, in the special case of the identity chan-
nel, resolvability and mean-resolvability reduce to the mini- and M - Px-(x;)l+ P+(x;;)
mum achievable fixed-length and variable-length source cod- 4xn> 0
ing rates, respectively, for any source. Although quite different 5 ClP+(xy)
from the familiar setting of combined source and channel
coding (e.g., no decoding is present at the channel output), the i=l
approximation theory of output statistics could be subtitled
“source coding via channels” because of the following two + c px-(xn)
results. X%EDC

Theorem 1: For any X and the identity channel, , <26+$<2t+exp(-ny)<3r

for all sufficiently large n. The construction of Xn
is

S(X) = T(X). P+(xn) = 0, if xn $ {a$, . . . ,xG},

Proof: i = O,...,M,
1) T(X) < S(X). We show that if R is an e-achievable
where the integers k; are selected as follows. If
resolution rate for X then it is an e/2-achievable source
coding rate for X. According to Definition 7, for every 5 ,M’PXn (x1), 5 M’,
y > 0 and all sufficiently large n, there exists Xn with
i=l
;R(Xn) < R+y
then

k = [M’Pxn (~a)], i= l,...,M

d(X”, Xn) < E. M

We can view Xn as putting mass l/M on each member ko = M’ - Cki
of a collection of M = exp (R(Xn)) elements of
A” denoted by D 7 {XT, . . . , XL} (Note that the M i=l
elements of this collection need not all be different.)
The collection D is a source code with probability of and properties a)-d) are readily seen to be satisfied. On the
error smaller than e/2 because other hand, consider the case where

E > d(X”, P) 2 2Pp(D”) - 2P,,(D”) M = M’ + L
= 2Pxn (D”).
C r M ’P x - ( x , n ) ]
2) S(X) 5 T(X). We show that if R is an e-achievable
source coding rate for X, then it is a Se-achievable i=l
resolution rate for X. For arbitrary y > 0 and all
sufficiently large n, select D = {x;, +. +, XL} such that with 1 5 L 5 M. Since it may be assumed, without loss of
generality, that Pp(xT) > 0 for all i = 1,. .. , M we may
1 log M < R + y, set
,n
IcrJ= 0 i= l,*..L,
ki = [M’Px~~(~r)j - 1 > 0, i = L+ l,...M,
ki = [M’Pxn (x%)1,

which again guarantees that a)-d) are satisfied. 0

HAN AND VERDtl: APPROXIMATIONTHEORY OF OUTPUT STATISTICS 151

Theorem 2: For any X and the identity channel, Theorems 1 and 2 show a pleasing parallelism between the
resolvability of a process (with the identity channel) and its
S(X) = T(X) = lim SUP~+~ $X”). minimum achievable source coding rate. Theorem 1 and the
Shannon-MacMillan theorem lead to the solution of S(X) as
Proof: the entropy rate of X in the special case of stationary ergodic
X. Interestingly, the results of this paper allow us to find
1) s(X) 5 T(X). Suppose that R is an achievable the resolvability of any process with the identity channel, and
variable-length source coding rate. Then, Definition 9 thus a completely general formula for the minimum achievable
states that there exists for all y > 0 and all sufficiently source-coding rate for any source.
large n a prefix code whose average length L, satisfies
Theorem 3: For any X and the identity channel,
;Ln log T < R + y. (3.1)
S(X) = T(X) = H(X),

Moreover, the fundamental source coding lower bound where H(X) is the sup-entropy rate defined as 7(X; Y) for
for an r-ary code (e.g., [2, theorem 5.3.11) is the identity channel (cf. Definition 3), i.e., the smallest real
number ,0 such that for all E > 0
II 5 L, log T. (3.2)
1

Now, let X = X. Then, d(Xn, Xn) = 0, and (3.1)-(3.2) Proof:
imply that 1) H(X) 5 S(X). w e will argue by contradiction: choose

h(%“) < R + y, (3.3) an achievable resolution rate R for X such that for some
n s>o

concluding that R is an achievable mean-resolution rate R+S<H(X).
for X.
2) T(X) < s(X). Let R be an achievable mean-resolution By the definition of a(X), there exists a! > 0 such that
rate for X. For arbitrary y > 0, 0 < E < l/2 and all
sufficiently large n, choose Xn such that (3.3) is satisfied px-(Do) 2 Q (3.6)
and d(Xn, 2”) < E. On the other hand, there exists an
r-ary prefix code for X” with average length bounded infinitely often with DO defined as the set of least likely
by (e.g., [2, theorem 5.4.1) source words:

L, log T < H(Xn) + log T. (3.4) 1 log 1
n
PX”(““) z R+S

We want to show that if the above Eis chosen sufficiently Select 0 < E < a2, and Xn for all sufficiently large n
small then the code satisfies to satisfy

An L% log r 2 R + 27 d(Xn, P) < E

for all sufficiently large n, thereby proving that R is an and
achievable variable length source coding rate for X. To
that end, all that is required is the continuity of entropy LR(x”) < Rf $.
in variational distance: n

Lemma 4 (3, p. 331: If P and Q are distributions Define
defined on R such that d(P, Q) 5 6’< l/2, then
D1 = {xc” E An: Pxn(xn) > 0 and
V(P) - H(Q)1F 0 log (I fl I /@I. (3.5) and consider < $/2
1 _ P+W)
I- 1
Pxn(2”)

Using Lemma 4 and (3.3), we obtain

$X”) + ; log T 5 $?(a”) Pxn(Dl n Do) 2 Px-(Do) - Px-(DE) (3.7)
> a - Eli2 > 0,
+‘t log IAl” + 1 log T
n En which holds infinitely often because of (3.6) and

for sufficiently large n if Elog (1 A 1/c) < y. ~~‘~px-(D:) I c Px=(x:“) P+ (2”)
3) T(X) = lim s~p~+~ (l/n)H(Xn). This follows im- 1 - pxn(xn)
X-CD;
mediately from the bounds in (3.2) and (3.4). See also
2 E.
[lOI. 0

758 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 39, NO. 3, MAY 1993

For those n such that (3.7) holds, we can find x2; E This result will be shown by means of an achievability (or
D1 n Do whose P*,-mass satisfies the following lower direct) theorem which provides an upper bound to resolvability
and upper bounds: along with a converse theorem which gives a lower bound to
resolvability.
P+ (x;) 2 (1 - 2’2)Pxn (xgn) > 0

and > n1 log Pp(x1g) 1 A. Direct Resolvability Theorem
; log P.+(x1$) + n1 log 1 + El/2 Theorem 4: Every channel W and input process X satisfy

S,(X) L qx; y>

for any E > 0 where Y is the output of W due to X.

if n is sufficiently large. Therefore we have found (an Proof: Fix an arbitrary y > 0. According to Definition
infinite number of) n such that 7, we have to show the existence of a process x such that

Pj+ 1 log 1 > ;R(2”) 1 nl-imice d(Yn, 9’“) = 0
1
n P+ p) and gn is an M-type distribution with

L P*n + log 1 >R+; M = exp (nT(X; Y) + ny),

P+&(X”) -

contradicting Lemma 3. and Y”, Y” are the output distribution due to X” and zn,
2) S(X) I mq is a special case (identity channel) of the respectively.

general direct resolvability result (Theorem 4 in Section We will construct the approximating input statistics by the
IV). q Shannon random selection approach. For any collection of
(not necessarily distinct) M elements of A”, the distribution
Remark 1: We may consider a modified version of Defini- constructed by placing l/M mass on each of the members of
the collection is an M-type distribution. If each member of
tion 9 as follows. Let us say that an r-ary variable-length code the collection is generated randomly and independently with
distribution Xn, we will show that the variational distance
{@(xn))rn~An for X” is an c-prefix code for X” (0 < E < 1) between Y” and the approximated output averaged over the
selection of the M-collection vanishes, and hence there must
if there exists a subset D of A” such that Pp (D) 2 1 - 6 exist a sequence of realizations for which the variational
distance also vanishes.
and {$(x~)),~cD is a prefix code. It is easy to check that
For any { cj E A”, j = 1, +. . , M} denote the output
Theorem 2 continues to hold if “all y > 0” and “r-ary prefix distribution

code” are replaced by “all y > 0 and 0 < E < 1” and “r-ary

c-prefix code” respectively in Definition 9.

A general formula for the minimum achievable rate for

noiseless source coding without stationarity and ergodicity as-

sumptions has been a longstanding goal. It had been achieved

[lo] in the setting of variable-length coding (see Theorem pPn[C1,..]‘(,YCn)M= ~~~n(Yn I Cj). (4.1)

2). In fixed-length coding, progress towards that goal had 3=1

been achieved mainly in the context of stationary sources The objective is to show that

(via the ergodic decomposition theorem, e.g., [6]). A general

result that holds for nonstationary/nonergodic sources is stated

in4 [9] without introducing the notions of T(X) and H(X). nl-im-tooEd(Y”, ?“[X;,.+.,X$]) = 0,

The results established in this section from the standpoint of

distribution approximation attain general formulas for both where the expectation is with respect to i.i.d. (X;, . e. , X&)
with common distribution Xn. Instead of using the standard
fixed-length and variable-length source coding without re- Csiszar-Kullback-Pinsker bound in terms of divergence [3],
in order to upper bound d(Y”, Y”[X;, . e. , X$]) we will
course to stationarity or ergodicity assumptions. It should use the following new bound in terms of the distribution of
the log likelihood ratio.
be noted that an independent proof of T(X) = g(X) can
Lemma 5: For every p > 0,
be obtained by generalizing the proof of the source coding

theorem in [3, Theorem 1.11.

IV. RESOLVAL~ILITY THEOREMS 1d(P, Q) L .-l?o-g e cL+2p 1% PQ((Xx)) >P >

A general formula for the resolvability of any channel in
terms of its statistical description is obtained in this section.

4[9] refers to a Nankai University thesisby T. S. Yang for a proof. where X is distributed according to P.

HAN AND VERDfJ: APPROXIMATIONTHEORY OF OUTPUT STATISTICS 759

Proof of Lemma 5: We can write the variational distance as where r = (exp p - 1)/2 > 0, X” and Y” are connected
through W” and {Y”, X;, . . . , XE} are independent.
d(P,Q) = cl 0 5 log g} P=(x)- Q(x)1
The first probability in the right-hand side of (4.2) is
XEcl 1
P ;ixnrun(x”, Y”)>I(X;Y)+y+$ogr )
5 O} [Q(x)- P(x)l, [ I
+ Cl log $j
XEcl { which goes to 0 by definition of sup-information rate. The
second probability in (4.2) is upper bounded by
= 2dl + 2d2,

where M Y”)) > 1 + r . (4.3)
I
dr=xl log 8 > IL} W - Q(x)1 P $c exp (ixnwn(Xy,
[ j=1
{
Had we a maximum over j = 1, . . *, M instead of the sum
<- P log in (4.3), showing that the corresponding probability vanishes
would be a standard step in the random-coding proof of the
and direct channel coding theorem. In the present case, we need
to work harder. Despite the fact that for every j = 1, e. . , M

I ~/log e. 0 it is not possible to apply the weak law of large numbers to
(4.3) directly because the distribution of each of the random
Proof of Theorem 4 (cont.): According to Lemma 5, it suf- variables in the sum depends on the number of terms in the
fices to show that the following expression goes to 0 as sum (through n). In order to show that the probability in
n + co, for every p > 0, (4.3) vanishes it will be convenient to condition on Y”, and,
accordingly to define the following random variables for every
c *.. c Px-(Cl).*.px-(CM) c +-[~,...&f? yn E B” and j = l,...,M,
PEB”
ClEA* CMEA” K,j(yn) = exp i~n~n(Xjn, Y”),

* 1 log p~-[c,...chl](Yn) G,j(yn) = v,,j(~“)W’in,j(~“) I Ml, (4.4)

i PY”(Yn) > pi UdYn=)j$vn,j(yn), (4.5)
j=l
= $5 c ‘*. c Px-(Cl)

j=lcl~A” CM 64”

. . . Px-(cM) c W”(Y” I %I

PEEP

.l log %[C,-CM]w) >~ Note that for every y” E Bn both {Vn,j(yn)}jNl and
i PY” (Y”) 1 {Z,, j (y”)}y=r are independent collections of random vari-
ables because {XT}jM_r are independent. According to (4.4)
= c ... c Px-(4 PX”Y”(Cl> and (4.5), the probability in (4.3) is equal to the expected
C2EAn CMEAn value with respect to Pyn of
. . . Px-(cnn) c c
Y”)
cl~Any”~Bn
P[UM(f) > 1 + T] 5 p[TM(!/“) # UM(yn)]

. 1 $- exp (ix-w-(cl, y/“)) +P[TM(f) > 1 + r]. (4.6)
{
The first term in the right-hand side of (4.6) is equal to
M
yy”)) > 1+ 27
+ $c exp (ixnwn(cj,
1
3=2

< P & exp(ix-wn(Xn, 1Y”)) > 7
[

M Y”)) > 1 + r , (4.2) 5 ~w-dY”) # K,j(Yn)l
I
+ P $c exp (ixnwn(XT, j=l

[ 3=2 = MP[Vn, l(y”“) > M],

760 IEEE TRANSACTIONSON INFORMATIONTHEORY, VOL. 39, NO. 3, M A Y 1993

whose expectation with respect to Py- yields O* -a 0

&ifc c px-(zn)pYn(Yn) l/2 .I
ee
x ”E A ”~ “E B ~
l/2
. l{exp ixnwn(zn, y”) > M}
1. <

= M C C Pxnyn(xn, y”) exp(--iXnWn(P, y ”)) Fig. 1. Discrete memoryless channel in Example 1.

z ”E A ”~ “E B ” which goes to 0 as n + 00 by definition. 0

. l{exp x=wn(xn, y”) > iVf>

I C C Px~y~(~n,yn)l{expix~W~(~n,yln) > M) We remark that in most cases of interest in applications,
X and W will be such that (X; Y) is information stable in
z ”E A * ~ “E B ” which case, the upper bound in Theorem 4 is equal to the
input-output mutual information rate (cf. Lemma 1).
= P ;ixnwn(Xn, 1Y”) > f(X;Y) + y )
[

which, again, goes to 0 by definition of 7(X; Y). Regarding B. Converse Resolvability Theorem
the second term in (4.6), notice first that
Theorem 4 together with the converse resolvability theorem
J w M ( Y ”)l = wn, l(Y”)l proved in this subsection will enable us to find a general
1 c PX-yY”W formula for the resolvability of a channel. However, let us
I YV{K%, l(Y”) 5 w start by giving a negative answer to the immediate question
XnEAn (4.7) as to whether the upper bound in Theorem 4 is always tight.

5 1. Example 1: Consider the 3-input-2-output memoryless
channel of Fig. 1, and the i.i.d. input processX that uses0 and
Therefore, using (4.7) and the Chebychev inequality, we get 1 with probability l/2, respectively. It is clear that 7(X; Y) =
1(X; Y) = 1 bit/symbol. However, the deterministic input
p[TM(Yn) > 1 + Tl I p[TM(Yn) - E[TM(Y”)] > T] process that concentrates all the probability in (e,. .. , e)
achieves exactly the same output statistics. Thus S(X) =
5 -$ var (G&P)) s(X) = 0. On the other hand, it turns out that we can find
a capacity-achieving input process for which the bound in
where we have used the fact that {Zn, j(y”)}gr are i.i.d. Theorem 4 is tight. (We will seein the sequel that this is always
Finally, unconditioning the expectation on the right side of true.) Let X’ be the uniform distribution on all sequencesthat
(4.8), we get contain no symbol e and the same number of O’s and l’s (i.e.,
their type is (l/2, 0, l/2)). The entropy rate, the resolution
$pz, l(Y”)l rate and the mutual information rate of this process are all
equal to 1 bit/symbol. Moreover, any input process which
xnEAnyn~Bn approximates X’ arbitrarily accurately cannot have a lower
entropy rate (nor lower resolution rate, a fortiori). To see this,
p X ”Y ”( Z n , y”) 2 first consider the case when the input is restricted not to use
e. Then the input is equal to the output and close variational
.( px-(~“)PY”(LP) distance implies that the entropies are also close (cf. Lemma
4). If e is allowed in the input sequences,then the capabilities
. l{exp ixnwm(z?, y”) 5 M} for approximating X’ do not improve because for any input
sequence containing at least one e, the probability that the
= E + exp ix-wn(Xn, Y”) output sequencehas type (l/2, l/2) is less than l/2. Therefore,
[ the distance between the output distributions is lower bounded
by one half the probability of the input sequencescontaining
II+1 $ exp ixnwn(Xn, Y”) 5 1 at least one e. Thus S(X’) = 3(X’) = 1 bit/symbol.

where the expectation in the right-hand side is with respect to The degeneracy illustrated by Example 1 is avoided in
important classes of channels such as discrete memoryless
Pxnyn and can be decomposed as channels with full rank (cf. Remark 4). In those settings,
sharp results including the tightness of Theorem 4 can be
E $ exp ixnwn(Xn, Y”) proved using the method of types [8]. In general, however, the
converse resolvability theorem does not apply to individual
. 1 A exp ixnwn(Xn, Y”) < exp 7 inputs.
{
Theorem 5: For any channel with finite input alphabet and
+E $ exp ixnw-(Xn, Y”) all sufficiently small E > 0,
[
s, > supT(X; Y),
. 1 exp 7 < & exp ixnwn(Xn, Y”) 5 1
{ X

I exp ;ix-w7L(x”, Y”) > 7(X; Y) + ; )
I

HAN AND VERDtJ: APPROXIMATIONTHEORY OF OUTPUT STATISTICS 161.

Proofi The following simple result turns the proof of the contradicting (4.10). The number of different M/-type dis-
converse resolvability theorem into a constructive proof. tributions on A” is upper bounded by 1 A” j”’ . Therefore,
the number of different distributions satisfying (4.13) is upper
Lemma 6: Given a channel W with finite input alphabet, bounded by
fix R > 0, E > 0. If for every y > 0 there exists a collection
{Qi}y=r such that N 5 exp (&’ + nr) ( A” j=P (nR’+n7) 5 1 A” l=P (nR’+2n7),

;loglogN>R-y (4.9) for sufficiently large n E J, which results in
and

rnnyd(Q;W”, QjnWn) > 2~ (4.10) ; loglogN< R/+37,

infinitely often in n, then

S, > R. (4.11) for sufficiently large n E J, contradicting (4.9) because y > 0

is arbitrarily small. 0

Proof of Lemma 6: We first make the point that the achiev- Proof of Theorem 5 {cont.): The construction of the N dis-
ability in Definition 7 is equivalent to its uniform version tributions required by Lemma 6 boils down to the construction
where the “sufficiently large n” for which the statement holds of a channel derived from the original one and a code for
is independent of X. To see this, suppose that R is e-achievable that channel. This is because of Lemma 7, which is akin to
in the sense of Definition 7 and denote by n,(e, R, y, W, Xj the direct identification coding theorem [l] and whose proof
the minimum n,, such that for all n 2 n, there exists X readily follows from the argument used in the proof of [7,
satisfying (2.4) and (2.5). We claim Theorem 31.

swno(c, R, Y, W, X) < 00. (4.12) Lemma 7: Fix 0 < X < l/2, and a conditional distribution
W;: T; -+ T; with arbitrary alphabets T; C A” and
X T$ c B”. If there exists an (n, M, X) code in the maximal
error probability sense for W$, then there exists p > 1 and a
Assume otherwise. Then there exists a sequence of input collection { (Qr , Di)}zr where Q: is a distribution on Tg
processes {Xk}r!r such that and Di c TF, such that for all i = 1,. . . , N,

KG = n,(E, R, Y, W, Xk)

is an increasing divergent sequence. Construct a new input R(Q:) 5 log M, (4.14)
process X by letting Q;W;(Di) 2 1 -X, (4.15)
Q~WW) i 2X, ifj # i, (4.16)
X” = x,“,,, if ?ik <n<&+l.
N 2 j$pM - 11. (4.17)
Note that for all k, n = %k+r - 1 is such that there is no Xn
with the desired properties. On the other hand, since Ek+t -1 is We will show that the collection {Qr}Er satisfies the
a divergent sequence, R cannot be an e-achievable resolution conditions of Lemma 6, with an appropriate choice of M. For
rate for the constructed input process X and we arrive at a now, let us construct an appropriate conditional distribution
contradiction, thereby establishing (4.12). WG which will be suitable for finding the channel code
required by Lemma 7.
Let us now contradict’ the claim of Lemma 6 and suppose
that for some R’ < R, Lemma 8: For every X and RI < 7(X; Y) there exists
0 < a < 1, T; c A”, T; c B”, T;l-y c T; x T;
S, 5 R’. and a conditional distribution WG: T$ -+ TF such that if
(z”, y”) E T&,, then,
Then, for each Q;, we can find ai such that
aW,“(yn I x:“) < W”(y” I x6”) I W,“(y” 1x”) (4.18)
LnR(cjl”) < R’ + y (4.13)

and

cl(Q;W”, @W”) < E,

if n 2 sup, nO(t, R’, y, W, X). Along those n, there is an and
infinite number of integers denoted by J for which (4.10)
holds. Let us focus attention on those blocklengths only. We (4.19)
must have Ql # Qy if i # j, for otherwise the triangle
inequality implies infinitely often, where

d(Q;W”, QyWn) Pq(z”) = Px- (X”)/Px- (CL xn E T;, (4.20)
5 d(Q;W”, @W”) + d(QyW”, @Wn) 0, xn $ T;,
< 2E
and Py-T = Px.;.W,“.
5The finiteness of the input alphabet is crucial in this argument.

162 [EEE TRANSACTIONSON INFORMATIONTHEORY,VOL. 39, NO. 3, MAY 1993

Proof: Choose RI < Ra < 7(X; Y) and define Proof of Theorem 5 (cont.): Now it remains to show ex-

istence of the channel code required by Lemma 7. This is

D& = (C, y”) E A” x Bn: $xnw/-(zn, y”) > Ra . accomplished via a fundamental result in the Shannon theory
C
applied to X$ and WG. I

By definition of 7(X; Y), there exists a > 0 such that Lemma 9 [4]: For every n, M and d > 0, there exists an

(n, M, X) code for W$ whose maximal error probability X

Px-Y- [D;yl > 2a, (4.21) satisfies

if n E I, where I is an infinite set. Define X 5 exp (-nB)+P kix;w;(XG, 1YT) 5 k log M + 8 .
[ (4.23)
T;(C) = {yn E Bn: (~9, y”) E D$y}
0(x?) = w”(T;(z”) 1P) Now, let us choose M so that
T; = (9 E A”: r~(x~) > cx}
R1 - 20 5 1 log M < R1 - x9, (4.24)
T; = u T;(C) n

X"ET2 for arbitrary 19> 0. Then, owing to (4.19) and the second
inequality in (4.24), the second term on the right-hand side of
TnXY -- D& n (T; x T;) (4.23) vanishes. Lemmas 8 and 9 along with the left inequality
in (4.24) provide the (n, M, exp (-n6’)) code required by
and W;: T; + TF by Lemma 7 for an infinite number of blocklengths n. Then,
(4.17) and (4.24) imply that for sufficiently large n
W ,“(YI”xn)= C0W>n(yln ) xn)/g(xn), if y” E T;(F)
if y” @T?(9).

Now, (4.18) follows immediately because if zn E Tg, then ; log log N 2 ; log M - 8,
a! < g(P) 5 1. In general, if zn E A”, then 0 5 a(~?) 5 1
and 2 R1 - 38,

Px”y”[D;y] = c Px-(C)a(9) which satisfies (4.9) for arbitrary y > 0 because R1 <
supx f(X; Y), and 0 > 0 can be chosen arbitrarily close
to those boundaries. Finally, to show (4.10) we apply Lemma
7 to get

d(Q;W”, : QjnWn) > 2Q;W”(Di) - 2QjnWn(Di)

> 2aQ;W,“(Di) - 2QyW,“(Di)

> 2a(l - A) - 4x, (4.25)

which together with (4.21) implies that where the second inequality follows from (4.18) and the third

inequality results from (4.15) and (4.16). With an adequate

Px-El > Q, choice of X (guaranteed by Lemma 7) the right side of (4.25)

if n E I. In turn, this implies is strictly positive and (4.10) holds, concluding the proof of

Theorem 5. 0

pY$(Yn) = c wyyn I X”)PXP (z”) Theorems 4 and 5 and S = supx S(X) readily result in
the general formula for channel resolvability.
PET; o(x”>Px- CT;)

I sI12.n(Y”). (4.22) Theorem 6: The resolvability of a channel W with finite
input alphabet is given by

Now to prove (4.19), note that if (?, yy”) E Tgy, and n E I s = sup7(X; Y). (4.26)

;ix;&n, y”) = ; log wn(Yn I x”> X
c+“>Py,- (Y”>
Remark 2: Theorems 4 and 5 actually show the stronger
> 1 log WYYln I x:“) result:
- n PY$ (Y”)
s, = supF(X; Y) (4.27)

> ~ix7zu”‘(x”, y”) + ; log Q X
n
2 for all sufficiently small e > 0.

> Rz + ; log o V. RESOLVABILITY AND CAPACITY

> RI Having derived a general expression for resolvability in
Theorem 6, this section shows that for the great majority
infinitely often; where we have used (4.22) to derive the second of channels of interest, resolvability is equal to the Shannon
inequality. Finally, by definition of W$ capacity.6 Let us first record the following fact.

Px;Y,[T;~-YI = 1. 0 6We adhere to the conventional definition of channel caDacitv 13. o. 1011.

HAN AND VERDti: APPROXIMATION THEORY OF OUTPUT STATISTICS 163

Theorem 7: For any channel with finite input alphabet Proof of Lemma 10: Arguing by contradiction, assume that
there exists S > 0, cy > 0 and X” such that

c I s. (5.1) P ,12.x-w-(X”, 1>a,
[
Y”)>c+s

Proof: The result follows from Theorem 6 and the fol- for all n in an infinite set J of integers. Under such an
lowing chain of inequalities assumption, we will construct an (n, M, 1 - 01/3) code with

C < liminf,,, sup il(xn; Y”)
in n
(5.2)

5 limsup,,, sup A (X”; Y”) for every sufficiently large n E J. The construction will follow
X” n from the standard Shannon random selection with a suitably
chosen input distribution.
< sup7(X; Y).
Codebook: Each codeword is selected independently and
X randomly with distribution

The first inequality is the general (weak) converse to the Pp(z?) = ~~(a”)/Pp(G), if a? E G:
channel coding theorem, which can be proved using the Fano , otherwise,
inequality (cf. [16, Theorem 11). So only the third inequality
needs to be proved, i.e., for all y > 0 and all sufficiently large where

n,

sup 5(X”; Y”) 5 supqx; Y) + 7, G= -txn~An: W”(D(z%) 1x”) 2 ;}
xn n
X

but this follows from Lemma Al (whose proof hinges on the Lxqr(x”, y/“) > c + 6
n
finiteness of the input alphabet) as developed in the proof of

Lemma 1 (cf. Appendix). 0 Decoder: The following decoding set Di corresponds to
codeword c; :
In order to investigate under what conditions equality holds
in (5.1) we will need the following definition. Di = D(q) - 6 D(q).
3=1
Definition 10: A channel W with capacity C is said7 to 3fZ
satisfy the strong converse if for every y > 0 and every
sequence of (n, M, X,) codes that satisfy Error Probability:

1 log M > C + y, W”(D; 1ci) 5 W”(D”(c;) 1ci) + FW-(D(c) 1ci)
n 3=1
1#1
it holds that

Al + 1, as n~oo. < 1 - ; + FW”(D(c) ) ci). (5.3)
1j#=,1
Theorem 8: For any channel with finite input alphabet
which satisfies the strong converse: Let us now estimate the average of the last term in (5.3)
with respect to the choice of ci:

c WV(q) I c;)px+i)

C,EG

Proof: In view of Theorem 7 and its proof, it is enough to = c W”(D(q) 1ci)m,
show S < C. The following lemma implies the desired result
C,EG

c > supI(X; Y) = s. I j+q c WV%) I x”)Px-(x”)

X X” XnEA”

Lemma 10: A channel W that satisfies the strong converse = P~@(cj))/Px4G),

has the property that for all S > 0 and X where

nl-im00 P kix%pv-(X”, 1Y”) > C + 6 = 0. py"(D(cj)) = c PY-(Y~)
[
ynEBn - nS)W”(y” 1cj)}

7This is the strong converse in the sense of Wolfowitz [20]. . l{Py-(yy”) < exp (-nC

< exp (-nC - nS).

lb4 IEEE TRANSACTIONSON INFORMATIONTHEORY, VOL. 39, NO. 3, MAY 1993

Fig. 2. Channel whose capacity is less than its resolvability.

Thus, the average of the right side of (5.3) is upper bounded by 3.’ 65
1 - G + (A4 - 1) exp(-nC - &)/E+(G)
Fig. 3. Information spectrum (probability mass function of the normalized
information density) of channel in Example 2 with 61 = 0.1, 62 = 0.15,
a = 0.5, n = 1000.

5 1 - % + exp

where

using (5.2). Finally, all that remains is to lower bound Wi(Y I x> = (1 - &)l{Y = x} f&lb # x:),
Px-(G). We notice that Pxn(G) = P[Z 2 a/2], with and
2 = Wn(D(Xn) ( Xn). The random variable 2 lies in the
interval [0, l] and its expectation is bounded by 0 < & < l/2, i = 1, 2.

[ 1P 2 2 ; + ; 2 E[Z] 1Y”) > c + a! A typical information spectrum for this channel and large n
is depicted in Fig. 3. The e-capacity of this channel depends
1. on E [19] and its capacity is equal to min {Cl, Cz} where
= P ,zxnwn(X”, Ci = log 2 - h(Si) is the capacity of BSC;. In order to
compute the resolvability of this channel, note first that if
[ the distribution of the random variable A, is a mixture of
two distributions, then the lim sup in probability of {A,}
> 01. is equal to the maximum of the corresponding lim sup in
probability that would result if A, were distributed under
Therefore, the right side of (5.4) is upper bounded by each of the individual distributions comprising the mixture.
Now, to investigate the asymptotic*behavior of the information
1-t+exp nS 2 spectrum, consider the bounds
-? ;<l--5,
(>

for all sufficiently large n E J, thereby completing the
0
construction and analysis of the sought-after code.

The condition of Theorem 8 is not only sz@cient but also i log min {a, 1 - a} + max {u, V}
necessary for 5’= C. This fact along with a general formula
for the Shannon capacity of a channel (obtained without any
assumptions such as information stability):

c = supI(X; Y) (5.5) 5 L log Q exp nu + (l- o) exp nv)

X 5 Lx{u: v}.

(cf. the dual expression (4.26) for the channel resolvability) is Therefore, the information spectrum evaluated with the opti-
proved in [17], by way of a new approach to the converse of mal i.i.d. input distribution X (which assigns equal probability
the channel coding theorem. to all the input sequences) converges asymptotically to the
distribution of
Important classes of channels (such as those with finite
memory) are known to satisfy the strong converse [5], [18], where we have identified u and v with the quantities within
[20]. The archetypical example of a channel that does not the maximum in (5.6). If (Xj, Yj) are connected through IVr
satisfy the strong converse is: (which occurs with probability a), then the expected value of
the first term in (5.6) exceeds that of the second one by
Example 2: Consider the channel in Fig. 2 where the
switch selects one of the binary symmetric channels (BSC)
with probability (a, 1 - a), 0 < Q < 1 and remains fixed for
the whole duration of the transmission. Thus, its conditional
probabilities are

W ”(!/l,...,Yn I Zlr...rG) wx,w, (Xj, Yj)l = qxj; Yj)

72 n = -@X,W(Xj,

= q-JwYi I Xi) + (1 - N,l-Iwz(Yi I Xi) ydl + m47lll~2 I Xj),

i=l i=l (5.7)

HAN AND VERDir: APPROXIMATION THEORY OF OUTPUT STATISTICS lb5

where the expectations in (5.7) are with respect to the joint Proof: If R is a (Xi, &)-achievable ID rate, then for
distribution of (Xj, Yj) connected through Wi. Reversing every y > 0 and for all sufficiently large n, there exist
the roles of channels 1 and 2, we obtain an analogous (n, N, X1, &)-ID codes {(Qr, Di), i = 1,. . . , N} whose
expression to (5.7). Therefore, the weak law of large numbers rate satisfies
results in
; log log N > R - y.

7(X; Y) = max{Ci, Ca}. 0

We have seen that for the majority of channels, resolvability From such a sequence of codebooks { Qy , i = 1, . . . , N}
is equal to capacity, and therefore the body of results in where N grows monotonically (doubly exponentially) with
information theory devoted to the maximization of mutual n, we can construct the sequence {Qi = (Qi, Qz, .. .)}E”=,
information is directly applicable to the calculation of re- required in Lemma 6, with an arbitrary choice of Qy if i > N.
solvability for these channels. Example 2 has illustrated the Then {Qi}~!i satisfies (4.9). Furthermore, for i # j and
computation of resolvability using the formula in Theorem 6 in i < N, j 5 N, then for all sufficiently large n,
a case where the capacity is strictly smaller than resolvability.
For channels that do not satisfy the strong converse it is of d(QaW”, QyWn) > 2Q~W’“(Di) - aQ,“w”(Di)
interest to develop tools for the maximization of the sup- > 2(1 - Xl) - 2x2 > 26,
information rate (resolvability) and of the inf-information rate
(capacity). It turns out [17] that the basic results on mutual satisfying (4.10). Thus, the conclusion of Lemma 6 is that
information which are the key to its maximization, such as
the data-processing lemma and the optimality of independent and Theorem 9 is proved. 0
inputs for memoryless systems are inherited by 1(X; Y) and
Theorem 9 and Theorem 4 imply that the (Xi, &)-ID
ax; 0 capacity is upper bounded by

VI. RESOLVABILITYAND IDENTIFICATIONVIA CHANNELS supT(X; Y),
X
A major recent achievement in the Shannon Theory was the
identification (ID) coding theorem of Ahlswede and Dueck [l]. which is equal to the Shannon capacity under the mild suf-
The ID capacity of a channel is the maximal iterated logarithm ficient condition (strong converse) found in Section V. This
of the number of messages per channel use which can be gives a very general version of the strong converse to the
reliably transmitted when the receiver is only interested in identification coding theorem, which applies to any finite-input
deciding whether a specific message is equal to the transmitted channel-well beyond the discrete memoryless channels for
message. The direct part of the ID coding theorem states which it was already known to hold [7, Theorem 21.
that the ID-capacity of any channel is lower bounded by its
capacity [l]. A version of the converse theorem (soft converse) It should be noted that Theorem 9 and [7, Theorem l] imply
which requires the error probabilities to vanish exponentially that if 0 < X < Xi, X < X2, E > 0, and E+ Xi + X2 5 1, then
fast and applies to discrete memoryless channels was proved in
[l]. The strong converse to the ID coding theorem for discrete where CA is the X-capacity of the channel in the maximal error
memoryless channels was proved in [7]. Both proofs (of the probability sense and Dxl, xZ is the (X1, &)-ID capacity. Note
soft-converse and the strong converse) are nonelementary and that unlike the bound on e-resolvability in Theorem 5, (6.1)
crucially rely on the assumption that the channel is discrete and can be used with arbitrary 0 < 6 < 1, but may not be tight if
memoryless. The purpose of this section is to provide a version the channel does not satisfy the strong converse. If the strong
of the strong converse to the ID coding theorem which not only converse is satisfied, however, (6.1) holds with equality for
holds in wide generality, but follows immediately from the all sufficiently small E > 0, because of (4.27) and Theorem
direct part of the resolvability theorem. The link between the 8 as well as the fact that C = CA for all ‘0 < X < 1 due
theories of approximation of output statistics and identification to the assumed strong converse. Consequently, we have the
via channels is not accidental. We have already seen that the following corollary.
proof of the converse resolvability theorem (Theorem 5) uses
Lemma 7, which is, in essence, the central tool in the proof Corollary: For any finite-input channel satisfying the strong
of the direct ID coding theorem. converse:

The root of the interplay between both bodies of results is
the following simple theorem.*

Theorem 9: Let the channel have finite input alphabet. Its C=%,xz =s, (f-54
(Xi, &)-ID capacity is upper bounded by its e-resolvability
S,, with 0 < E < 1 - Xr - X2. if Xi + X2 < 1.
The first equality in (6.2) had been proved in [7, Theorem
sWe refer the reader to [l], [7] for the pertinent definitions in identification
via channels. 21 for the specific case of discrete memoryless channels using
a different approach.

766 IEEE TRANSACTIONSON INFORMATIONTHEORY, VOL. 39, NO. 3, MAY 1993

VII. MEAN-RESOLVABILITYTHEOREMS we obtain

This section briefly investigates the effect of replacing the $(Y”) - $(P”)
worst-case randomness measure (resolution) by the average
randomness measure (entropy) on the results obtained in = $X”) - +(X”)
Section IV. The treatment here will not be as thorough as
that in Section IV, and in particular, we will leave the proof - $qX” 1Y”) + &P 1Y”)
of a general formula for mean-resolvability for future work.
Instead, we will present some evidence that, in channels of > ;H(Xn) - +(Xn)
interest, mean-resolvability is also equal to capacity.
- ; log M - ;h(X)
An immediate consequence of (2.8) is that the direct re-
solvability theorem (Theorem 4) holds verbatim for mean- 2 c - y - [C - p +-y] - x log 2 - i/Z(X)
resolvability, i.e., for all E > 0 and X

S’,(X) <3(X) < 7(X; Y), (7.1) = p/4 - $A)
2 7,
and if the channel satisfies the strong converse, then (7.10)

s 5 c. (7.2) where the first inequality is a result of the Fano inequality, the
second inequality follows from (7.5), (7.6), and (7.8) and the
Therefore, in this section our attention can be focused on last inequality holds for all sufficiently large n. Now applying
converse mean-resolvability theorems. Lemma 4 to the present case (I R I= an), (7.7) results in

First, we illustrate a proof technique which is completely IH(Y”) - H(P)1 5 n0 log 2 + 0 log l/19,
different from that of the converse resolvability theorem
(Theorem 5) in order to find the mean-resolvability of binary which contradicts (7.10) because 0 can be chosen arbitrarily
symmetric channels (BSC).
small. 0
Theorem 10: The mean-resolvability of a BSC is equal to
its capacity. Remark 3: The only feature of the BSC used in theproof of
Theorem 10 is that H(Y 1X = Z) is independent of 5, which
Proo$ Since BSC’s satisfy the strong converse, (7.2) holds for any “additive-noise” discrete-memoryless channel.
holds and we need to show
A converse mean-resolvability theorem which, unlike The-
s> c. (7.3) orem 5, does not hinge on the assumption of finite input
alphabet can be proved by curtailing the freedom of choice
Suppose otherwise, i.e., for some p > 0, of the approximating input distribution. In Example 1, we
illustrated the pathological behavior that may arise when the
0 <S<C-,LL. (7.4) approximating input distribution puts mass in sequenceswhich
have zero probability under the original distribution. One way
Let A = ~1 log 4, and y = ,LL/~. For all sufficiently large to avoid this behavior is to restrict the approximating input
n, there exists an (n, M, E) code (in the maximal error distribution to be absolutely continuous with respect to the
probability sense) such that all its codewords are distinct original input distribution.
(X < l/2, because p < log 2) and

log 2 > ; log A4 > c - y. (7.5) Theorem 11 (Mean-Resolvability Semi-Converse): For any
channel W with capacity C there exists an input process X
Let X” be uniformly distributed on the codewords of the such that if 2 satisfies
(n, M, X) code. Thus,
nl-i0m3 d(Y”, P) = 0 (7.12)
and P*- << Px-, then, for every ~1> 0,
H(Xn) = log 111. (7.6)

According to (7.4), for any 0 < 0 < l/2, there exists Xn such 5qe) 2 c - p (7.13)
that the outputs to Xn and Xn satisfy
n
d(Yn, P) < 0
(7.7)

infinitely often.

and Proof: Let us suppose that the result is not true and
therefore there exists ~0 > 0 such that for every input process
$2’“) < c-p+$. (7.8) X we can find 2 such that P+, < Pxn, (7.12) and

Since by definition of BSC

H(Y” 1P) = H(Y” 1X”), $(X”) 5 c - ,urJ (7.14)

(7.9)

HAN AND VERDtl: APPROXIMATION THEORY OF OUTPUT STATISTICS 767

are satisfied. Fix 0 < y < ~0 and choose (7.16) for some S > 0 becauseof (7.15) and (7.18). Combining (7.21)
and (7.22), we get
T< PO-Y
c - PO’ d(Y”, Y”) > 2P+%(B) - 2Pp(B)

xc--L > &(l - A) - 2X - 2 exp (-&A)
27 + 1

For all sufficiently large n, select an (n, M, A) code which is bounded away from zero because of (7.16). 0

{(G, W>g”=, (in th e maximal error probability sense) with In connection with Theorem 11, it should be pointed out
rate that the conclusion of the achievability result in Theorem 4
holds even if Xn is restricted to be absolutely continuous’
; log M 2 c - y. with respect to X” as long as X” is a discrete distribution.’
(Recall that in the proof of Theorem 4, Xn is generated by
Let X” be equal to ci with probability l/M, i = 1, +. . , M. random selection from Xn.)
The restriction P+ < Pp means that the approximating
distribution can only put mass on the codewords { ci , . +. , CM}. Remark 4: Recall that a general formula for the individual
However, the mass on those points is not restricted in any way resolvability of input processesis attainable only for channels
(e.g., P+ need not have finite resolution). that avoid the pathological behavior of Example 1. In [8], it
is shown that
Define the set of the most likely codewords under P+

S(X) = 1(X; Y) (7.23)

T, = {z~ E A”: P+(zn) 2 exp (-n(c - Po)(~ f 7”1 along with
whose cardinality is obviously bounded by ’ //J 1
(7.17)

s=z=c, (7.24)

I Z I< exp (n(C - PO)(~ + ~1). (7.18) if the channel W is discrete memoryless with full rank, i.e.,
the transition vectors {W(. ] u)}~~A are linearly independent.
From (7.14), we have This class of channels includes as a special case the BSC
(Theorem 10). Even in this special case,however, the complete
n(C - PO) 2 E[log l/P*, (Xa>] characterization of s(X) remains unsolved.
2 n(C - Po)(l + 4pj+ (T”)
VIII. DIVERGENCE-GAUGEADPPROXIMATION
or the lower bound
So far, we have studied the approximation of output statis-
P*,z (TT) L 7/(1+ 7). (7.19) tics under the criterion of vanishing variational distance. Here,
we will consider the effect of replacing the variational distance
We will lower bound the variational distance between Y” and d(Y”, Y”) by the normalized divergence:
Y-n by estimating the respective probabilities of the set

B = UDi, (7.20)

iEI We point out that this criterion is neither weaker nor stronger
than the previous one. Although we do not attempt to give
where 1 = {i E {l,...,M}: ci E T7}. Since the sets in a comprehensive body of results based on this criterion, we
(7.20) are disjoint, we have will show several achievability and converse results by proof
techniques that differ substantially from the ones that have
P+(B) 2 CW”(B I Ci)P&i) appeared in the previous sections.
iEI
We give first an achievability result within the context of
information stable input/outputs.

Theorem 12: Suppose that (X, Y) is information stable
and I(X; Y) < 00, where Y is the output of W due to X.
Then,

;I(xi”,...;x$; F-n)

On the other hand, = iI(X”: +

Y”) - ; log A4 + o(l) (8.1)
I
Pp(B) = &JW”(” I Ci) + $W”(O ) c;)
where {XF, . . . , XL} is i.i.d. with common distribution X”,
iEI @I
Yn is connected with Xp, . +. , X& via the channel in Fig. 4,
<-+pIC-pI and the term o(1) is nonnegative and vanishes as n -+ co.

5 exp (An) + A, (7.22) 9Theorem 1 holds for infinite alphabet channels, as well.

768 IEEE TRANSACI-IONSON INFORMATIONTHEORY, VOL. 39, NO. 3, MAY 1993

obtain x;, . . *,X&,; +) - ;I(Xn; Y”) + $ log M

$t”,

5 kE[log (M exp (-1(X”; Y”)) + exp A,)]

5 iE[log (exp (-nS) + exp A,)]

5 i log 2 + E[max { -6, A,/n}] (8.4)

Fig. 4. Input-output transformation in random coding. Switch is equally where the second inequality is a result of M I
likely to be in each of its A4 positions. exp(I(Xn; Y”) - nfi). Th e expectation on the right side of
(8.4) is upper bounded by

Proof: Note that the joint distribution of (X’“, Yn) is .[;A,,{ ;An > -6}]
that of (Xn, Y”). By Kolmogorov’s identity and the condi-
tional independence of Yn and X;, * . . , XL given Xn = .[-;,,I{ ;A, 5 -“)1

I(X”; Y”) = I@; Py 5 $(X”; Y”)p[;A,, 5 -63
= I(R”, x;, ‘. . ) xgf,; P)
=I(X;,-,Xgf;P’n) - ;E[ix-&Xn, Y”)l{ixn~“(X~, y”) < o}],
+ I(P; ly 1x;, ’* * ,x2,>
(8.5)
(8.2)

where the second term on the right side is less than or equal to where we have used E[A,] = 0 and the first term vanishes
log M. This shows that the term o(1) in (8.1) is nonnegative. asymptotically because (X, Y) is information stable with
It remains to upper bound the left side of (8.1): finite mutual information rate, whereas the expectation of the
negative part of an information density cannot be less than
I(?, x;, . ‘. )X$,; 5-y -e-i log e (e.g., [14, (2.3.2)]). Thus, the second term in (8.5)
also vanishes asymptotically. In the remaining case,
~ “E B ”c I E A ” CM EA”
I+

Y”) - ; log M = 0

. cW-(f 1S) log and we can choose any arbitrarily small 6 > 0 while satisfying
i=l M 2 exp (1(X”; Y”) - n6). Now, normalizing by n we can
further upper bound (8.3) with
=c **. c PX-(Cl)**-PX-(CM)
iE[log (1 + exp (A, + nS))]
ClEA” CMEA”

where the inequalities follow from log (1 + exp t) 5 log 2 +

.log ; exp ix-w- (Cl, Y/“> tl{t > 0) and (8.5), respectively. Now, the theorem follows
(
since 6 can be chosen arbitrarily small. 0

< E log 1 + $ exp ixnwn(Xn, Y”) , (8.3) Theorem 12 evaluates the mutual information between the
channel output and a collection of random codewords such
I[ ( as those used in the random coding proofs of the direct part
of the channel coding theorem. However, the rationale for its
where the first inequality follows from the concavity of the inclusion here is the following corollary. The special case of
logarithm and the second is a result of this corollary for i.i.d. inputs and discrete memoryless channels
is [21, Theorem 6.31, proved using a different approach.
E[exp ixn~n(Xy, y/“)] = 1,
Corollary: For every X such that (X, Y) is information
for all y” E En and j = 1,. +. , M. Consider first the case stable, and for all y > 0, E > 0, there exists x whose
where M < exp (l(Xn; Yn) - ns) for some S > 0. Using resolution satisfies
(8.3) and denoting A, = ix-w-(Xn, Yn) - l(Xn; Y”) we
$2’“) < I(X; Y) + y

HAN AND VERDlj: APPROXIMATIONTHEORY OF OUTPUT STATISTICS 769

and for all xn E A”. Thus, for every distribution Xn,

for all sufficiently large n. 0 5 I(T; T) - 1(X”; P’“) 1P)
= D(W”llTn 1X”) - ll(wyF’” ) P)
= D(WllY” 1P) - D(W”llP”
= D(Pypq
Proof: The link between Theorem 12 and this section is
the following identity

where the second equation follows from (8.9). Thus,

qPyx;, * ’* , ~&lllY” I x,“, . . . >xzf,)
= 1(X?,. . . )XL,; P),

where Y” [Xl”, . . . ,X&l is defined in (4.1). As in the proof

of Theorem 4, (8.1) implies that there exists (CT, +. . , cb)

such that the output distribution due to a uniform distribution from where the result follows because of (8.8) and the channel
capacity converse.
on (CT,+.. , c%) approximates the true output distribution in

the sense that their unconditional divergence per symbol can C < liminf,,, ;’w -n. >-y-n 1. 0

be made arbitrarily small by choosing (l/n) log A4 to be

appropriately close to (l/n)l(X”; Y”). 0

A sharper achievability result (parallel to Theorem 4) Remark 5: It is obvious that the channel in Example 1 does
whereby the assumption of (X, Y) being information stable not satisfy the sufficient condition in Theorem 13, whereas the
is dropped and 1(X; Y) is replaced by 7(X; Y) can be full-rank discrete memoryless channels (cf. Remark 4) always
shown by (1) letting M = exp (nI(X; Y) + TM?),(2) using satisfy that condition.
log (1 + exp t) 5 log 2 + tl{t > 0} to bound the right side
of (8.3), and (3) invoking Lemma Al (under the assumption A counterpart of Theorem 11 (mean-resolvability semi-
that the input alphabet is finite). converse) with divergence in lieu of variational distance is
easy to prove using the same idea as in the proof of Theorem
The extension of the general converse resolvability results 13.
(Theorems 5 and 9) to the divergence-gauged approximation
criterion is an open problem. On the other hand, the analo- Theorem 14: For any finite-input channel with capacity C,
gous exercise with the converse mean-resolvability results of there exists an input process x such that if X satisfies
Section VII is comparatively easy. A much more general class
of channels than the BSCs of Theorem 10 is the scope of the and Pkn << Pxn then,
next result.

Theorem 13: Let a finite-input channel W with capacity C liminf,,, $2”) > c.
be such that for each sufficiently large n, there exists r for

which Proof: Let ri” maximize l(Xn; Y”). It follows from

I(T; 7”) = m&x1(X”; Y”) (8.6) Lemma 11 and the assumed condition P+, < Pj~n that

D(W”IIY” 1P) = D(W”(lF ) rr”)
and

= I(F; T).

P[T = x”] > 0, ‘for all xn E A”. 63.7) Thus,

If X is such that H(X”) = H(P 1P’“) - D(~‘nIIT)

nl-im+00 -n5ZJ(Y”]]yn) = 0 (8.8) + D(W”llT 1X”)
2 1(X”; Y”) - D(iqFn)
then

from where the result follows immediately.

liminf,,, Let us now state our concluding result which falls outside

Proof: The following result will be used. the theory of resolvability but is still within the boundaries
Lemma ll(3, p. 1471: If 1(7?;m = maxxn I(Xn; Y”),
then, of the approximation theory of output statistics. It is a folk-

qr; T) 2 D(WQ(. ) x”)IIY”) theorem in information theory whose proof is intimately

related to the arguments used in this section.

Fix a codebook {CT, . . . , c&}. If all the codewords are

with equality for all Z? E A” such that Pyn(z”) > 0. equally likely, the distributions of the input and output of the

According to Lemma 11 and the assumption in Theorem 13, channel are

1

I@“;7”) = D(W(.1xyq, (8.9) P&?) = ;, if xn = c”

770 IEEE TRANSACTIONSON INFORMATION THEORY, VOL. 39, NO. 3, MAY 1993

and where the inequality follows from Lemma 11. If we now

particularize Xn to the uniform distribution on the (nL M, X,)

P&P) = ~gvyYn 1CT), (8.10) codebook of the above statement, then (l/n)l(X”; Y”) must
approach capacity because of the Fano inequality:
j=1

respectively. The issue we want to address, in the spirit of $2”; P) > (1 - An); log $l- ; log 2
this paper, is the relationship between Yn and the output 2 (1 - X,)(C - ;) - ; log 2. (8.13)
distribution corresponding to the input that maximizes the
mutual information. It is widely believed that if the code is ~~~~ (8.12) and (8.13) result in
good [with rate close to capacity and low error probability),
then Y” must approximate the output distribution due to the +yP) 5 $xn, B”) - (1 - X,)
input maximizing the mutual information. (See, e.g., [2, section .(c-$)+;lo,,
8.101 for a discussion on the plausibility of this statement.)
5 Y,
To focus ideas take a DMC with capacity

--
C = 1(X; Y) = m;xI(X; Y).

Is it true that the output due to a good code looks i.i.d. for sufficiently large n because X, --) 0 and the strong-
with distribution Py? Sometimes this is erroneously taken for
granted based on some sort of “random coding” reasoning. converse assumption guarantees that the inequalities in (5.1)
However, recall that the objective is to analyze the behavior
of the output due to any individual good code, rather than actually reduce to identities owing to Theorem 8. 0
to average the output statistics over a hypothetical random
choice of codebooks. As a simple exercise, we may particularize Theorem 15 to

Our formalization of the folk-theorem is very general, and a BSC, in which case, the output y due to x’” achieving
only rules out channels whose capacity is not obtained through
the maximization of mutual information. Naturally, the result capacity is given by
ceases to be meaningful for those channels.
P&y”) = 2F,
Theorem 15: For any channel W with finite input alphabet
and capacity C that satisfies the strong converse, the following for all yn E (0, l}“. Then, (8.11) is equivalent to
holds. Fix any y > 0 and any sequence of (n, M, X,) codes
such that log 2 - y 5 ;H(En),

and ilogM>C-; for an arbitrarily small y > 0 and all sufficiently large n. This
Then,” n implies that the output Yn due to the input distribution Xn
of a good codebook must be almost uniformly distributed on
(0, l}” (cf. [2, example 2, section 8.101).

Can a result in the spirit of Theorem 15 be proved for the
input statistics rather than the output statistics? The answer
is negative, despite the widespread belief that the statistics of
any good code must approximate those that maximize mutual
information. To see this, simply consider the normalized
entropy of Xn versus that of x”:

(8.11) ;H(XII) - ;H(Xn) =

for all sufficiently large n, where Y-n is the output due to the where the last two terms in the right-hand side are each asymp-
(n, M, X,) code (cf. (8.10)) and 7 is the output due to x=” totically close to capacity. However, the term (l/n)H(xn I
that satisfies
7”) does not vanish in general. For example, in the case of
I(T; Fn) = mxyl(Xn; Yn
Proofi For every Xn, we write a BSC with crossover probability p, (l/n)H(r I y”) =

“It can be shown that the output distribution due to a maximal mutual h(P).
information input is unique. Despite this negative result concerning the approximation

of the input statistics, it is possible in many cases to bootstrap
some conclusions on the behavior of input distributions with
fixed dimension from Theorem 15. For example, in the case

of the BSC (p # l/2), the approximation of the first order
input statistics follows from that of the output because of the

invertibility of the transition probability matrix. Thus, in a
good code for the BSC, every input symbol must be equal

HAN AND VERDtJ: APPROXIMATION THEORY OF OUTPUT STATISTICS 171

to 0 for roughly half of the codewords. As another example, APPENDIX
consider the Gaussian noise channel with constrained input
power. The output spectral density is the sum of the input In this appendix, we address a technical issue dealing with
spectrum and the noise spectrum. Thus, a good code must the information stability of input-output pairs.
have an input spectrum that approximates asymptotically the
water-filling solution. Lemma Al: Let G > log I A 1, then,

The conventional intuition that regards the statistics of good ,1ixnwn(X”, Y”)l ;i&X”, Y”) > G + 0.
codes as those that maximize mutual information, constitutes
the basis for an important component of the practical value Proof: The main idea is that an input-output pair can
of the Shannon theory. The foregoing discussion points out attain a large information density only if the input has low
that that intuition can often be put on a sound footing via the probability. Since for all (9, y”) E A” x B”
approximation of output statistics, despite the danger inherent
in far-reaching statements on the statistics of good codes. 1 (A.11
ixnWn(xn, y”) I log P&I?)

IX. CONCLUSION we can upper bound

Aside from the setting of system simulation alluded to

in the introduction, we have not dwelled on other plausible

applications of the approximation theory of output statistics.

Rather, our focus has been on highlighting the new information

theoretic concepts and their strong relationships with source

coding, channel coding and identification via channels. Other where
applications could be found in topics such as transmission

without decoding and remote artificial synthesis of images, DG = (9 E A”: Pp(z”) 5 exp(-nG)} (A.31
speech and other signals (e.g., [12]).

A novel aspect of our development has been the unveil- The right side of (A.2) can be decomposed as

ing of sup/inf-information rate and sup-entropy rate as the

right way to generalize the conventional average quantities E ;l{Xn 1E DG} log px- (DG)
(mutual information rate and entropy rate) when dealing px- (xn)

with nonergodic channels and sources. We have seen that

those concepts actually open the way towards new types of - APx~(DC) log Px-(DG)

general formulas in source coding (Section III), channel coding 15 Px- (DG) i log ( DG I -i log Px~ (DG) (A.4)
[17] and approximation of output statistics (Section IV). In [
particular, the formula (5.5) for channel capacity [17] exhibits

a nice duality with the formula for resolvability (4.26). because entropy is maximized by the uniform distribution.

In parallel with well-established results on channel capacity, Now, bounding I DG 11 I A In and

it is relatively straightforward to generalize the results in this

paper so as to incorporate input constraints, i.e., cases where Px-(DG) <I DG ( em-@ 5 ev(-nS),

the input distributions can be chosen only within a class that

satisfies a specified constraint on the expectation of a certain where S = G - log I A I > 0, the result follows. One

cost functional. consequence of Lemma A.1 is Lemma 1.

Presently, exact results on the resolvability of individual Proof of Lemma 1: First, we lower bound mutual informa-

input processescan be attained only within restricted contexts, tion for any y > 0 as
such as that of full-rank discrete memoryless channels [8].

In those cases, the resolvability of individual inputs is given il(X”; Y”) 2 E ;ixnw-(X”, Y”)
n
by the sup-information rate; this provides one of those rare

instances where an operational characterization of the mutual

information rate (for information stable input/output pairs) is

known. Whereas our proof of the achievability part of the

resolvability theorem holds in complete generality, the main 1Y”) > 1(X; Y) - y .
weakness of our present proof of the converse part is its strong
reliance on the finiteness of the input alphabet. So far, we

have not mentioned how to relax such a sufficient condition. (A.5)

However, it is indeed possible to remove such a restriction for

a class of channels. In a forthcoming paper, the counterpart of It is well known [ 141that the first term in the right side of (A.5)

Theorem 5 will be shown for infinite-input channels under a vanishes and the probability in (A.5) goes to 1 by definition

mild smoothness condition, which is satisfied, for example, by of L(X; Y). Thus, 1(X”; Y”)/n > 1(X; Y) - 2y for all

additive Gaussian noise channels with power constraints. sufficiently large n.

712 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 39, NO. 3, MAY 1993

Conversely, we can upper bound mutual information as 161R.M. Gray and L.D. Davisson, “The ergodic decomposition of sta-

tionary discrete random processes,” IEEE Trans. Inform. Theory, vol.

$X”; Y”) IT-20, pp. 625-636, Sept. 1974.

5 E ;ixnwn(Xn, [71 T. S. Han and S. Verdti, “New. results in the theory and applications of
[
1 1Y”) 2 G identification via channels,” IEEE Trans. Inform. Theory, vol. 38, pp.

Yn)l ;ix&X”, 14-25, Jan. 1992.
{
PI -, “Spectrum invariancy under output approximation for full-rank

+ GP ;ixnwn(X’“, discrete memoryless channels,” Probl. Peredach. Inform. (in Russian),

1[ no. 2, 1993.
Y”) > 7(X; Y) + y
+Pi1Cy. +X>Y; l 1’P nZX”pv”(Xn, Y”) gx;Y)+y PI G.D. Hu, “On Shannon theorem and its converse for sequences of
1 G4.6)
communication schemes in the case of abstract random variables,”
Trans. Third Prague Conf Inform. Theory, Statistical Decision Func-

tions, Random Processes, Czechoslovak Academy of Sciences, Prague,

1964, pp. 285-333.

11013. C. Kieffer, “Finite-state adaptive block-to-variable length noiseless
coding of a nonstationary information source,” IEEE Trans. Information
If G is chosen to satisfy the condition in Lemma Al then Theory, vol. 35, pp. 1259-1263, 1989.
(A.6) results in
1111D. E. Knuth and A. C. Yao, “The complexity of random number gen-
eration,” in Proceedings of Symposium on New Directions and Recent

ll(X’“, Y”) 5 7(X; Y) + 27, Results in Algorithms and Complexity. New York: Academic Press,
n 1976.

for all sufficiently large 12. q I121 R. W. Lucky, Silicon Dreams: Information, Man and Machine. New

York: St. Martin’s Press, 1989.

ACKNOWLEDGMENT P31 D.L. Neuhoff and P.C. Shields, “Channel entropy and primitive ap-
proximation,” Ann. Probab., vol. 10, pp. 188-198, Feb. 1982.

Discussions with V. Anantharam, A. Barron, M. Burna- 1141 M. S. Pinsker, Information and Information Stability of Random Sari-
ables and Processes. San Francisco: Holden-Day, 1964.

shev, A. Orlitsky, S. Shamai, A. Wyner, and J. Ziv are 1151 J.M. Stoyanov, Counterexamples in Probability. New York: Wiley,
1987.
acknowledged. References [9], [13], [21] were brought to the
authors’ attention by I. Csiszar, D. Neuhoff, and A. Wyner, P61 S. Verdti, “Multiple-access channels with memory with and without

frame-synchronism,” IEEE Trans. Inform. Theory, vol. 35, pp. 605-619,

respectively. Fig. 3 was generated by R. Cheng. May 1989.

[I71 S. Verde and T. S. Han, “A new converse leading to a general channel

REFERENCES capacity formula,” to be presented at the IEEE Inform. Theory Workshop,
Susono-shi, Japan, June 1993.

PI R. Ahlswede and G. Dueck, “Identification via channels,” IEEE Trans. Psi J. Wolfowitz, “A note on the strong converse of the coding theorem for
the general discrete finite-memory channel,” Inform. Contr., vol. 3, pp.
Inform. Theory, vol. 35, pp. 15-29, Jan. 1989.
89-93, 1960.
PI T. M. Cover and J. A. Thomas, Elements of Information Theory. New
1191-, “On channels without capacity,” Inform. Contr., vol. 6, pp.
York: Wiley, 1991.
[31 I. Csiszar and J. Komer, Information Theory: Coding Theorems for 49-54, 1963.

Discrete Memoryless Systems. New York: Academic, 1981. PO1 -, “Notes on a general strong converse,” Inform. Contr., vol. 12,
141 A. Feinstein, “A new basic theorem of information theory,” IRE Trans.
pp. 14, 1968.
PGIT, vol. 4, pp. 2-22, 1954.
1211A.D. Wyner, “The common information of two dependent random
variables,” IEEE Trans. Inform. Theory, vol. IT-21, pp. 163-179, Mar.
[51 -, “On the coding theorem and its converse for finite-memory 1975.

channels,” Inform. Contr., vol. 2, pp. 25-44, 1959.


Click to View FlipBook Version