The words you are searching are inside this book. To get more targeted content, please make full-text search by clicking here.

POISSON APPROXIMATION AND THE CHEN-STEIN METHOD 405 distance between the distribution of such a sum and the Poisson. Chen's work has resulted in advances in

Discover the best professional documents and content resources in AnyFlip Document Base.
Search
Published by , 2017-05-18 00:40:03

Poisson Approximation and the Chen-Stein Method

POISSON APPROXIMATION AND THE CHEN-STEIN METHOD 405 distance between the distribution of such a sum and the Poisson. Chen's work has resulted in advances in

StatisticalScience
1990,Vol. 5, No. 4, 403-434

Poisson Approximationand the
Chen-Stein ethod

RichardArratiaL,arryGoldsteinand LouisGordon

AbstractT. he Chen-SteinmethodofPoissonapproximatioins a powerful
toolforcomputinagn errorboundwhenapproximatinpgrobabilitieussing
thePoissondistributionIn. manycases,thisboundmaybe givenin terms
offirstandsecondmomentaslone.Wepresenta backgrounodfthemethod
andstatesomefundamentPaloissonapproximatiotnheoremsT.he bodyof
thispaperis an illustrationt,hroughvariedexampleso, fthewideapplica-
bilityandutilityoftheChen-SteinmethodT. heseexamplesincludebirth-
day coincidencesh, ead runsin coin tosses,randomgraphs,maximaof
normavl ariatesandrandompermutationans dmappingsW. econcludewith
an applicationtomolecularbiologyT. hevarietyofexamplespresentedhere
does not exhaustthe rangeof possibleapplicationsof the Chen-Stein
method.
Key wordsand phrases:Poisson approximationi,nvarianceprinciple,
Stein'smethod.

1. INTRODUCTION is ruledout.Howeverg, eneralizatioinn thedirection
The centrallimittheoremhas enjoyeda longand towarddependenceis quitefruitfuals, manyimpor-
muchdeservedcelebratedhistoryO. vershadowebdut tant and interestingquestionsmay be phrasedin
perhapsofno lessimportancaeretheoremisnvolving termsofsumsofpossiblydependenitndicatorandom
rareeventsandPoissondistributionIsn. generalizing variablesI.n fact,ourgoalinthispaperis toillustrate
the centrallimittheoremo,ne examinesthe conse- thebroadrangeofproblemtshatmaybe successfully
quencesof relaxingthe assumptionthat the sum- attackedbya powerfuPloissonapproximatiomnethod
mandsareindependenatnd identicallydistributedIn. dueto Stein(1972)andChen(1975).In Section2,we
the same spirit,one may followthis path in the presenta reviewofthistechnique.
simplestpossiblePoissonlimittheorem.
In Section3,wepresentthreePoissonapproxima-
THEOREM 0. Let Xi,,, *.., X,,, be independent tiontheoremsbased on the Chen-Steinmethod.In
indicatorandomvariablews ith Section4,thesetheoremasreappliedtoa widecollec-
tionofexamplesthatreduceto questionsaboutsums
P(Xin = 1) = PO- ofpossiblydependenitndicatorandomvariablesT. he
intuitionthatsucha sum has a Poissonlimit,and
Let Xn= Pi,n,-=1P0i,,nanadndXnW-AnX=> Z,=X1i,n If n -- oo, thatthefinitesummaythereforbee approximatebdy
maxi-n 0, thenWnconverges a Poisson randomvariable,is essentiallythe same
in distributionto Z, a Poisson random variablewith hereas itis forthesimplestheoremabove.Thereare
mean X. a largenumberof events,each of whichhas small
probabilityof occurringI.f the dependencebetween
In whatfollowsw, e willwriteZ - (X) to mean eventsis somehowconfinedt,henthesumW should
thatZ hasa PoissondistributiownithmeanX,thatis, behaveas in thecase ofno dependenceI.n addition,
P(Z= k) = e-XXk/kf!ork= O, 1, .... notonlyis Wcloseto Poisson,buttheentireprocess
ofindicatoriss closetoa PoissonprocessI.n practice,
Focusingon occurrenceosfeventst, hatis,on indi- howeveru,singa Poissonapproximatiotno compute
cator randomvariables,the generalizationof the probabilitieisnvolvintgheindicatoriss notsufficient.
abovelimittheoremto thecase ofotherdistributions One also needsto knowwhaterroris madein using
theapproximationTh. attheChen-Steinmethodsup-
RichardArratia,LarryGoldsteinand Louis Gordonare pliesan upperboundonthiserroris itsmainutility.
members of the Mathematics Department at the
UniversityofSouthernCalifornia. When the dependencestructureis local, finding
the Chen-Steinboundsinvolvesthe same efforats
computingfirstand second momentsof the total

403

This content downloaded on Sat, 9 Feb 2013 01:24:07 AM
All use subject to JSTOR Terms and Conditions

404 R. ARRATIA,L. GOLDSTEIN AND L. GORDON

numberofoccurrenceIsn. someoftheexamplesbelow, outshowsthattheaboveexpectationis indeedsmall
we showthatthe rateachievedby the Chen-Stein if f is sufficientlsymooth.The argumentmay be
methodis sharpforthedistancebetweenthedepend- completedby demonstratintghatsmoothnesps rop-
entindicatorprocessand theapproximatinPgoisson erties assumed on h translateinto the required
process. smoothnesps ropertieosn f throughthe differential
equation(1). Stein'smethodhas been appliedwith
The sixsubsectionosfSection4 maybe readinde- muchsuccessin the area of normalapproximation
pendentlyofeachotherE. ach is an exampleofusing (See, forexample,Erickson,1974;Chen,1978;Chen
theChen-Steinmethodtoestablisha Poissonapprox- and Ho, 1978;Bolthausen,1984;Barbourand Hall,
imationI.n Section4.1,wedeterminbeoundsonprob- 1984b;BarbourandEagleson,1985;Stein,1986;Baldi
abilitiesforthegenerabl irthdaycoincidencperoblem. and Rinott,1989;Baldi,Rinottand Stein,1989;and
In Section 4.2, we-studythe distributionof the Barbour,1990).
lengthofthelongestrunofheadsin a sequenceofin-
dependenctointosses.In Section4.3,weconsiderthe Thereare othertechniquesthatprovethecentral
distributionof the numberof cyclesin a random limittheoremwithouitnvolvinFgouriermethods(for
graphN. ext,inSection4.4,wediscusstheproblemof example,Breiman's1968 treatmenotf the proofof
approximatintghe distributioonfthe maximaof se- Lindebergo,rRosenblatt's1974treatmenotfa proof
quencesof normalvariates.Section4.5 bringsthe of Petrovskyand Kolmogorov)S. tein's technique,
Chen-Steinmethodto bear on theproblemofper- howeveri,s uniquein that one maydeterminethe
mutationws ithrestrictepdositionst;helastexample, boundon the errormade in the approximationa,
Section4.6,considerscyclesin randompermutations propertoyfparamounitmportancientheexamplesto
and mappings. followin Section4.

Ourinterestin Poissonapproximatioanrosefrom Equation(1) above appearsin otherconnections
problemsin molecularbiologyand the statistical involvintghenormadl istributionD.efiningho(x) = 1,
analysisofDNA. An exampleofthePoissonapprox- and h,+1= Lhnforn = 0, 1, *--, one generatesthe
imationmethodappliedto thisarea is thesubjectof Hermitepolynomialst,hatcompleteorthogonaslys-
Section5. temofpolynomialosn R withmeasuree-x'/2dx.One
mayusea multidimensionvaelrsionofequation(1) to
2. THE CHEN-STEINMETHOD recoverand generalizeStein's(1956) remarkablree-
In 1972,CharlesSteinpublished"A boundon the sult on the inadmissabilitoyf the normalmean in
errorin thenormaal pproximatiotnoa distributioonf threeormoredimension(sHudson,1978),orto study
a sumofdependenrtandomvariables.T"hegoalofthis otherquestionsarisingin theestimationofthemean
workwas to showconvergencien distributiotno the omfeantmiountlthiavtaLrfia' tinesotrhme agle(nSteeriant,o1ro9f8t1h).eLOarsntlsytewine-
normaal ndproducean associatedBerry-Esseentype Uhlenbeckprocess,whichhas a normalstationary
theoremforsumsofdependenrtandomvariablesT. he distribution.
techniqueusedwas novel.
Stein'stechniquewas freeofFouriermethodsand In 1975,ChenappliedStein'sideasin thePoisson
reliedinsteadon theelementardyifferentieaqluation settingC. orrespondintgo thedifferentieaqluationin
(1) f ' (x) - xf(x) = h(x) - Nh. thenormalcase above,one has an analogousdiffer-
In equation(1) above,h is a functionthatis usedto enceequationinthePoissoncase.WithZ now P(X),
testconvergencien distributioannd Nh = E[h(Z)], ifwedefine
whereZ is standardnormalT. he connectionbetween
thisequationand thenormaldistributioins thefol- (2) (Lf )(x) = Xf(x + 1) - xf(x),
lowingcharacterizatioFno. r W an arbitraryrandom
variableand thenE(Lf)(W) =0 -fo9r@al(Xl f).sFuocrhWthaatsEuImZfo(fZ)mI an<y
00,if and only if W
(Lf)(x) = f'(x) - xf(x), Bernoullri andomvariablese,achwithsmallexpecta-
tion,an argumenitnvolvingleavinga giventermout
E(Lf )(W) = 0 forall differentiabfluenctionfssuch ofthe sumdemonstratetshatE[(Lf)(W)] is small
and so W is approximatelPyoisson.Again,one re-
that EIZf(Z)I < o0, if and only if W itself has a quiresthatpropertieosfthe"testfunctionh"translate
standardnormaldistributionIt. nowseemsplausible intothedesiredpropertieosffthroughthedifference
equation
thatifE(Lf )( W) is smallformanyfinctionsf,then
thedistributioonfWisclosetothatofZ. IfWhappens (3) Xf(x + 1) - xf(x) = h(x) -xh;
tobe a normalizesdumofan appropriatceollectionof here,gAsh = E[h(Z)], whereZ - NX).It is in
randomvariablest,henan argumenitnvolvinga Tay- this way that bounds may be obtainedon the
lorexpansionaboutthesumWwitha giventermleft

This content downloaded on Sat, 9 Feb 2013 01:24:07 AM
All use subject to JSTOR Terms and Conditions

POISSON APPROXIMATION AND THE CHEN-STEIN METHOD 405

distancebetweenthedistributioonfsucha sumand typeII errorprobabilitieasc + fc is
thePoisson.Chen'sworkhas resultedin advancesin P(YO E C) + P(Y1 ( C)
thetheoryofPoissonapproximatioannd has helped = 1 - IP(Yl E C) - P(Yo E C)j.
to developand improveupon a bodyof interesting
applicationsand examples.(For theoreticadlevelop- Hence,
mentss,eeBarbourandEagleson,1983,1984;Barbour inf (ac + 13c) = 1-l/211 | (Y) - (Yo) 11
and Hall, 1984a; Barbour,1987; Arratia,Goldstein
andGordon1, 989;BarbourH, olstandJanson,1988b. c
For applicationsand examples,see Barbour,1982;
Bollobas,1985;Holst,1986;Janson1, 986;Stein,1986; Allexamplesand theoremtshatfollowwillbe setin
Barbour,Holst and Janson,1988; Heckman,1988; thefollowinfgrameworTkh. ereisa finiteorcountable
Barbourand Holst, 1989; and Holst and Janson, indexset I. For each a E I, let Xa be a Bernoulli
1990.) randomvariablewithpa.= P (Xa = 1) > 0. Let

3. POISSON APPROXIMATIONTHEOREMS W X= Xa and X = EW.
In thissectionw, ewillstatethreePoissonapprox- aEI
imationtheoremse,ach givingboundsin termsofthe
totalvariationdistancebetweentwodistributions. WeassumeXE (0,oo).Z willdenotea Poissonrandom
Here is the definitionof totalvariationdistance. variablewiththesamemeanas W. For each a E I,
Write2 (Y) forthelaw or distributioonf Y. For a supposewehavechosenBa C I witha E Ba. Wethink
realvaluedfunctionh definedon thesupportof Yo ofthesetBa as a neighborhooodfa consistingofthe
and Y,, let setofindicesj3 suchthatXa andXfiaredependent.

Define

(4) b= PaPt,6

aEI j3EBa

11hjj = sup I h(k)I. (5) b2 = X Pati, wherePain= E[XaXj],
k aEI aA#3EBe,

Definethe totalvariationdistancebetweenYOand and
Y,,a realnumberbetween0 and2,by
(6) b3 fI E EEXJX- pa I a-(X6: A (4 B I1.
1Y(1Yo)-2Y(Y) 11= sup IE[h(Yo)]-E[h(Y,]). aEl

11h 11=1 Loosely,b1measurestheneighborhoosdize,b2meas-
ures the expectednumberof neighborsof a given
Equivalentlyo,ne maywrite occurrencaend b3measuresthedependencebetween
an eventand the numberof occurrenceosutsideits
Y(YO) - Y (Y,) 11 neighborhood.

= 2 sup IP(Yo E A) - P(Y1 E A)I. Computinbg1andb2usuallyinvolvesthesamework
A as computintghefirstand secondmomentosfW.In
applicationws hereXa is independenotfthecollection
= 2 minP(Yo ? Y,). IX,61: (4Bat,, thetermb3 = 0. Whenb3 = 0, b2-b, =
E(W2) - E(Z2). Thus whenb3 = 0 and b1is small,
In the last equality,the minimumis takenoverall theupperboundson totalvariationdistancegivenin
realizationsof Yo and Y, on the same probability thetheorembselowarecomparablteothediscrepancy
space. betweenthe secondmomentof W and thatof the
Poisson.
The total variationdistancehas the following
statisticalinterpretationC.onsider the following Togetherwitherrorbounds,our resultsare that
two hypotheseosn the distributionof the random whenb1,b2,and b3are all small,then
variableY:
1. Theorem1. The total numberW of eventsis
Ho: Y(Y) = Y(Yo) approximatelPyoisson.

versus 2. Theorem2. The locations of the dependent
eventsapproximatelfyorma Poissonprocess.
Hi: Y(Y) = Y(Y,).
Ifweadoptthetestwithcriticarl egionC rejectintghe 3. Theorem3. The dependenteventsare almost
nullhypothesiws henY E C andacceptingotherwise, indistinguishablferoma collectionof inde-
thenforany C thatsatisfiesthe naturalcondition pendent events having the same marginal
P(Y1 E C) - P(Yo E C), thesumofthetypeI and probabilities.

This content downloaded on Sat, 9 Feb 2013 01:24:07 AM
All use subject to JSTOR Terms and Conditions

406 R. ARRATIA,L. GOLDSTEIN AND L. GORDON

The followintgheoremasreprovedin ArratiaG, old- Thus X and Y can be coupled,i.e., constructed
steinand Gordon(1989). on a singleprobabilityspace,so thatP(X ? Y) =
11C p2. For the Poisson processY
THEOREM 1. Let W = ZaEI Xa be thenumberof 1/12 27(X) - 27(y) theindependenetventsprocessX'
occurrencesofdependentevents,and letZ be a Poisson ofTheorem2,and
randomvariablewithEZ = EW = X< oo.Then ofTheorem3 above,couplingeachcoordinatsehows
that

II5t(W) - 2(Z)II Y(Y) - 2(X') fl c 2P(Y $ X') ' 2 Zpa.

c 2 (b, + bi) 1eX + b3(1 A 1.4X-1/2)] Thus,Theorem3 aboveis an elementarcyorollaryof
Theorem2,usingthetriangleinequality
< 2(b1 + b2+ b3),
1?(X) - 5 (X') 11
and
' IIY(X) -y (Y) I + II-(Y) - Y2(X') I.
IP(W = 0) - e-A Since E pa is small in typicalapplicationsT, heo-
< (b1+ b2+ b3)(1- e-N)/
< (1 A X-1)(b, + b2+ b3). rem2 is "almost"equivalentto Theorem3. More
preciselyt,heweakeningofTheorem2, in whichthe
The nexttheoremis a processversionoftheabove boundis increasedby4 i p2, is an elementarcyorol-
theorem. laryofTheorem3, usingthetriangleinequality.

3.1 Compound Poisson Process Limits

THEOREM 2. For a E I, let Ya be a randomvariable The Chen-Steinmethodis usefulforsituationsin
whosedistributionis Poisson withmean Pa, withthe whichoccurrencehsappenin clumpsand thedistri-
Ya mutuallyindependent.The totalvariationdistance butionofnumberofclumpsis approximatelPyoisson.
betweenthedependentBernoulliprocessX (Xa )aEI, In manysituationst,hedistributioonfthenumberof
and the Poisson process Y on I with intensityp(.), occurrenceiss approximatelya compoundPoisson
Y (Ya)aEI satisfies distributioannd thedependenpt rocessitselfis close
to a mosaicprocessin whichlocationsareputdown
112(X) - 2'(Y)II c 2(2b1+ 2b2+ b3). accordingto a spatialPoissonprocessand thenat
eachlocationa typeis assignedin someindependent
Theorem3 comparetshedependenBternoullpi roc- andidenticallydistributewday.(See Aldous,1989,or
essX withan independenBternoullpi rocessX'. Since Hall, 1988.)

a Pac b1,Theorem3 impliesthatiftheChen-Stein This situationcan be handledbytheChen-Stein
methodsucceedswithb1,b2 and b3 all small,thenin methodA. ll thatneedsto be doneis to enlargethe
thesenseoftotalvariationdistancethedependenXt indexsetso thatitkeepstrackofthetypesas wellas
processis closetobeingindependent. thelocationsoftheclumpsI.n thesesituationsT,heo-
rems2 and 3 area toolforshowingthata dependent
THEOREM 3. For a E I, let X' have the same processis closeto a mosaicprocess.
distributionas Xa, withtheXa mutuallyindependent.
The total variationdistance betweenthe dependent Hereis a generalovervieww; ewillshowhowthese
Bernoullpi rocessX (Xa)aEI, and theindependent considerationaspplytotheexampleoflongheadruns
BernoulliprocessX' (Xa )aEI havingthesame mar- in Section4.2. Startwitha successfusletupforthe
ginals,satisfies Chen-Steinmethod:an indexset I, eventsXa for
a E I and neighborhoodBs(a) fora E I suchthat
112'(X) - 2(X')II c 2(2b1 + 2b2+ b3) + 2 Epa. b1, b2 and b3 can be shownto be small. Suppose
DirectelementarycomputationshowsthatifX is that each event tXa = 1 can also be associated
Bernoulli and Y is Poisson, with EX = EY = p E witha "type"chosenfromsome countableset T.
[0,1],thenthetotalvariationdistance Ournew,enlargedindexsetwillbe I* I x T, and
for(a, i) E I*,
11Y(X) - Y(Y) 11 + I0-P(Y>1)j
= j1-p-e-PI + lp-pe-PI Xa,j Xa 1(the occurrenceat a is of typei),
c 2p2. so thatforeacha E I, thereis a partition:
(7) Xa = E Xa.,j

iET

This content downloaded on Sat, 9 Feb 2013 01:24:07 AM
All use subject to JSTOR Terms and Conditions

POISSON APPROXIMATION AND THE CHEN-STEIN METHOD 407

The newneighborhoodBs(a, i) willbe basedon the (1986),Holst (1986) and Stein (1987),whichgives
oldneighborhoods: proofs of more general results using similar
techniques.
()B(at, i)--B(a) x T
= 1(d,]j) E I*:: E B(a), j E T}. In theusual formulatioonfthebirthdayproblem,
we assumethatbirthdayosfn individualsare inde-
The newvalueb*is equaltotheoldvalueb1: pendentoverthed daysin a yearand computethe
probabilittyhatat leasttwosharethesamebirthday,
bl*= S E (EXa,j)(EXa,j) thatis,thatthereis at leastonetwo-waycoincidence.
In thespecialcasewherebirthdayasreuniformt,here
aEI i,jET is a simpleexactformulaL.ettingWdenotethenum-
berofbirthdaycoincidencest,hatis, thenumberof
+i i (EXa,j) (EXj6,j) pairsof peoplethatsharea birthdayw, e have the
a#/3EB(a) i,jET probabilitoyfno coincidencgeivenby

/ ~~~~2

=-Xi ( Xi EXa, )
aEI iET

(+ i iET EXa,i)(j jET EX6,j) P(W=O)= ni=-1I1ii(1--)d.i
a0flEB(a)
Ifonewerenowinteresteidncomputintgheprobabil-
- fi (EXa)2 + i (EXa)(EXj3) = bi. ityof,say,exactlym two-waycoincidenceso,r the
aI at#f3B(a) probabilitoyfat leastthreepeoplesharingthesame
birthdayo,rtheprobabilittyhattherearetwopeople
Similarlyt,hanksto thepartitionstructur(e7) and bornwithina weekof each other,or probabilities
the neighborhoosdtructure(8), the value of b2 is undera nonuniformbirthdaydistributionth, enthe
unchanged: countingargumenttso arriveat exactformulabs e-
comemuchless tractableH. owever,extremelygood
b* = b2. approximataenswersare quiteeasy to obtainusing
the Poissonapproximatioannd one mayuse Theo-
In generalb3*-b3, butin manyexamplestheneigh- rem1 to givean upperboundontheerror.
borhoodsBa captureall ofthedependenceand it is
easilyverifietdhatb3*= b3= 0. Let us beginby consideringthe generalbirthday
problemofa k-waycoincidencewhenbirthdayasre
Becauseofthepartitionstructur(e7), andbecause uFannordifeloexrtamtmhL. epeitlne1d,i1ne,x2ts,he*etcI-,la-nsa}sidceCanlc1oat1se,e2a9kg..r=*o,u2npa}o:nfdInaIpIei=soptklhe}1e,.
thePoissonprocessY maybe similarlyconstructed set of all pairs of people amongwhoma two-way
fromthe Poisson process Y* by settingYa = ji Y,i coincidencceouldoccur.LetX, betheindicatorofthe
foreach a, thetotalvariationdistancefortheproc- eventthatthepeopleindexedby a sharethe same
essesinvolvedinTheorem2 cannotdecrease: birthdayT. he totalnumberof coincidencesis now
givenas thesumofdependenitndicatorandomvari-
11Y(X) - (Y) 11 11Y(X*) - Y(Y*)II ables, W = 2X,.
Here,X* = (X,j)JEI,iET is thedependenetventsproc-
ess,withvaluesin 1f,lIXT, and Y* = (Ya,J)aEI,iET is Because W is thesumofmanyBernoullirandom
the Poissonprocess,withvaluesin {0, 1, 2, . . . IXT, variables,each withsmallsuccessprobabilitpya =
havingindependenctomponentasnd thesameinten- d 1-k, it seems reasonableto approximateW as a
sityas X*. PoissonrandomvariableZ withmeanX= EW. Easily
thenX = (*n)d1k and theprobabilityofno birthday
4. APPLICATIONS coincidencies approximately
We demonstratteheutilityofPoissonapproxima-
tionbyapplyingtheaboveresultsto sixexamplesa,ll P(Z= O) = e- = exp-(k)d1}.
of whichreduceto questionsaboutthe numberof
occurrenceosfpossiblydependenetvents.

4.1 The BirthdayProblem For theclassicalcase ofa birthdaycoincidencein a
We firstlearnedaboutChen(1975a)froma lecture yearofd = 365 days,it is widelyknownthatn = 23
is theleastnumberofpeoplerequiredto makesuch
on the birthdayproblemand its variantsby Persi a coincidencemore likely than not; amusingly,
Diaconis,whoalso suggestedreferenceosn thebirth-
dayproblemD: iaconisand Mosteller(1989),Janson 2= (2)/365 is equalto ln(2)to4 digits.

This content downloaded on Sat, 9 Feb 2013 01:24:07 AM
All use subject to JSTOR Terms and Conditions

408 R. ARRATIA,L. GOLDSTEIN AND L. GORDON

Theprobabilitoyfcoincidencies approximatecdon- Puttingthe abovetogethero,ne findsthe following
servativelbyythePoissondistributioinn thiscase; boundfortheerrorin approximatinPg(W= 0) by
e- inthecase k = 2:
P(W = 0) 0.492 < 0.499998

exp(-X) = P(Z = 0). P(W= 0)-e-AI, (b, + b2) 1

The approximatioins alwaysconservativwehenbirth- 1 n (4n 1
daysareuniform;
7)
P(W =0) = nI-1I (1 /; i}n-i1
Althoughitis moredifficutltoexactlycalculatethe
< expt E probabilitoyfa triplebirthdaycoincidenceo,ne may
applyPoisson approximationwithaboutthe same
= e- = P(Z = 0). ease as fortheclassicalcase. Supposethatwe wish
to computethe probabilitythat in a groupof 50,
In addition,the probabilityof coincidenceis mini- threeor moresharea birthdayW. e havethenthat
mizedwhenbirthdayasre uniform(see,forexample,
Olkinand Marshall,1979),makingthe Poissonap- D=( /d2 and theapproximatioPn(W = 0)-e-X;
proximationa,ssuminguniformityco, nservativeno hence,in a groupof 50, the probabilitythatthere
matterwhatthetrueunderlyindgistributiomn aybe. is at leastone triplecoincidenceis about1 - =
1 - 0.863 = 0.137.
PoissonapproximatiounsingXcomputedfromthe
truedistributioins notnecessarilyconservativwehen To determinea bound on the error,one may
birthdayasrenonuniformO.neclassofexamplesmay calculate
be constructedby consideringa distributiown here
one day has probabilitye and all otherdaysdivide bi= IlI IBal Pa
theremaininpgrobabilituyniformlyea,ch withmass
(1 - e)/(d - 1). In particular, for d = 7 days, (n){() (n 3)}d-4
n = 5 individualsand e = 2/3, we have

X= + 6 = 4.63 and,fora givena, breakingup B - Ia intothose,B
suchthat ,Bn a I = 1 andthoseforwhich d n a =
and 2,wesee

P(W= 0) = 0.0118 > 0.0098 b2= II{3( 2 )d4 + 3(n - 3)d3}.

= expl-EW} = P(Z = 0). This showstheapproximatioanbovehas an errorof
no morethan
We mayboundthe errorin makingthe Poisson
approximatiowniththehelpofTheorem1 in Section (b, + b2)(1- e-X)/X = 0.0597,
3. RecallthatBa is a "neighborhooodfdependence"
fortherandomvariableXa. Notethat,ifa n ,s= 0, so that
wtheesnhXoauladntdakXe#thareseeitndependentT.his suggeststhat
0.803c P(W = 0) c 0.923

Ba = { 3 E I:a n / 5 0} Withouttoo muchdifficultyon, e can writedown
the exact formulaforthe probabilityof no triple
as oursetofdependenceW. iththischoice coincidenceI.n orderforthereto be no triplecoinci-
dence,thed daysoftheyearmustbe partitioneidnto
E I E{Xa-p-apT (X,3:ABa) =?0 h dayswherethereareno birthdaysi,dayson which

byindependenceh;enceb3 = 0. tawsioningldeiinviddiuvaidlsuhaawlraesabboirrnt,handdajyAd. afyacswtohroefrnee!x/a2cjtilsy
Sinceallpa.areidenticalw, ecalculate neededto countthe numberof arrangementosf n
bi = III Ba lpa personsintosucha configurationfi + j days.Hence,

(n){() (n k)}22k )nP(PWX==0WO)=)d-= d
n ((,,c~ isj)
Specializingnowto thecase k = 2, we mayuse that +2n 2jl
Xa andX#arepairwiseindependenta,nd thatthere-
forepai,,= PaP3. Hence, i+2j

bb=2l?III(IBaI - l1))pD-,Aq == bi(IBIaBIa - 1) For n = 50 and d = 365 we have that P(W = 0) =
0.8736,foran actual errorof 0.8736 - 0.8632 =
0.0104< 0.0597,theChen-Steinboundontheerror.

Forgeneralk,in thecase wherebirthdayasre uni-
form,it is possibleto considera slightimprovement

This content downloaded on Sat, 9 Feb 2013 01:24:07 AM
All use subject to JSTOR Terms and Conditions

POISSON APPROXIMATION AND THE CHEN-STEIN METHOD 409

on thechoiceon Ba. For a, ,BE I, knowingonlythe therewillbe somestretchews herethecoincomesup
birthdayofone memberof a, say min(a), does not headseverytime.To begintheanalysisofthedistri-
change the probabilitythat XaXg = 1. Hence one butionofRn,thelengthofthelongestofthesehead
couldtake runs,firstnotethatfora testlengtht,appropriately
choseno, neseesa headrunoflengthtbeginata given
Ba= {E I: (a - min(a)) nfl 0}, positiona onlywithsmallprobabilityA.sthenumber
ofpositionswheresucha runcouldoccuris large,a
whichis strictlysmallerthanthe old choiceofBa; Poissonapproximatiosnhouldbe valid.
onestillhasb3= 0. Workingthroughthecalculations,
one findsonlya slightimprovemenitn the error Howevero,nemustfirsatdjustforthefactthatruns
bound.Forexample,forthecase ofthetriplecoinci- ofheadsoccurin "clumps"t;hatis,ifthereis a runof
dencewithn = 50 and d = 365,theboundimproves headsoflengtht beginningat positiona, thenwith
from0.0597to 0.0582.The changeis slightbecause probabilitpytherewillalsobea runofheadsoflength
themaincontributiotnotheupperboundcomesfrom t beginningat positiona + 1, withprobabilitpy2 a
the part of b2wherea nf = 2; this contribution runofheadsoflengtht beginningat positiona + 2
remainsunchangedusingthesmallerBa. andso forthB. ycountingonlythefirstsuchrun,the
runsnowcountedare no longerclumpedand,indeed,
In thegeneracl aseofcomputinkg-waycoincidences theirnumberis Poissonin thelimit.This is an ex-
whenbirthdayasre uniformt,heChen-Steinmethod ample,withaverageclumpsize 1 + p + p2, **.., ofthe
givesthebestpossiblerateofconvergencoefthetotal "Poissonclumpingheuristica"s describedbyAldous
variationto zero.Take n,d -* oo in sucha waythat (1989).Byusingthefactthathavingnorunsoflength
X/1staysboundedaway fromzero and oo, which tis equivalento havingthelongestheadrunshorter
we denoteby X >_ 1. This conditionimpliesthat thant one mayapproximattehedistributiofnunction
nk dk"l and hencethatb, = I2IBaI/Il n"1. ofthelengthofthelongestrunofheads.
The orderoftheChen-Steinboundhereis thesame
as theorderofb2, Let thenC1,C2, *** be independenBternoullri an-
dom variableswithsuccessprobabilityp, and let
b2 = k-i n kt k 1j2 Rnbe the lengthofthe longestrunofheadsbegin-
j=E1 +j-2 ningin thefirstn tosses.Set theindexsetto be I =
kkl lIo1c,2a,ti.o.n*,swnh3;etrheeloenlegmheenadtsroufnthsme ianydbeexgsientA.wihlledaednroutne
oflengtht ormorebeginsat positiona ifandonlyif
The dominantcontributiotno b2comesfrompairs theindicatorandomvariable
(a, /) witha n A = j = k - 1T, ahnudsbt2hise oCfhtehne-oSrtdeeirn
nl+kd -k n/d a+t-1
n-1/(k-1).
methodyieldsthatthetotalvariationdistancedecays Ya= f Ci
ata ratenoslowetrhan0(n-4/(k-1)).
A lowerboundon the totalvariationdistancein i=a
the case of a k-waycoincidencecan be givenby
consideringthe event E that k + 1 individuals takesthevalueone.To declumpt,hatis,in orderto
share a birthdayt,hat is, that thereexist a, /3of countonlythe firsthead runin a clump,we take
size k withIa n /3I= k - 1 suchthatXaX: = 1. Xi = Y1and
The actualprobabilityP(E) can be boundedfrom
below by the firsttwo termsof the inclusion- Xa = (1 - CO1) Ya, a = 2, 3,.. *, n.
exclusionformulat;hefirstermis dominantand of For a = 2, 3, . .. , n,Xa willbe one ifand onlyifa run
the order(k+4)dk> n/d n-1/(ki) LettingE' oftormoreheadsbeginsat positiona, precededbya
be the same eventforthe independenptrocess,we tail. If we had ignoredclumpingand simplytaken
have Ya = Xa, wewouldhaveb2nottendingto zero,and,
infact,a Poissonapproximatiownouldnotbe valid.
P(E ) ZE .a=k,j,t6l=k,jan,t6l=k-1 EXa'X#
Writenowthe totalnumberof clumpsof runsof
=0((kn+(l(D2-)d2k 2 ) =- o(n-/1-(k11 lengtht or moreas the sumofdependentindicator
randomvariables
Thus,the orderofthetotalvariationdistanceis at
leastas largeas I P(E) - P(E') I P(E) n->/(k-1). W = z Xa.
Hence,theChen-Steinmethodyieldsthecorrector-
deroftherateofdecayofthetotalvariationdistance aEI
to zero.
The Poissonapproximatiohneuristiscayswe should
4.2 The LengthoftheLongestHead Run be able to approximattehe distributioonf W by a
Considermanyindependenthrowsof a coin of Poissonrandomvariablewithmean

successprobabilitpy, 0 < p < 1. No matterwhatp, X = (t) = EW = pt{(n - 1)(1 - p) + 1.

This content downloaded on Sat, 9 Feb 2013 01:24:07 AM
All use subject to JSTOR Terms and Conditions

410 R. ARRATIA,L. GOLDSTEIN AND L. GORDON

In particulatrhen,sincewehaveas events whichcountsthenumberoflocationsamongthefirst
n at whicha headrunoflengthat leastt begins.As
{Rn < ti = {W =O0 wehavenotedt, heselocationstendtooccurinclumps;
Wcountsthenumberofclumpsandis approximately
thedistributiofnunctionofRnmaybe approximated PoissonindistributioTnh. e sizeofeachclumpm, inus
as one,is the lengthbywhichthe associatedhead run
exceedst and is distributeads a geometrircandom
P(Rn < t) = P(W = 0) _ e variablewithparametepr. The clumpsizes are mu-
The test lengthis dictatedby requiringX to be tuallyindependenotfeach otherand approximately
boundedawayfrom0 andoo;thisis equivalento the independenotfthetotalnumberWofclumpss,o the
conditionthatt - log,/p(n(1- p)) is bounded.In fact, distributionof U is approximatelyPoisson com-
forintegert,withc definedby poundedbygeometricF.urthermorteh, eclumpsizes
areapproximateliyndependenotfthelocationsofthe
t = log,/p((n - 1)(1 - p) + 1) + c, clumpss,o thatwehaveapproximatelaymosaicproc-
theaboveapproximatiopnredicttshat ess. The Chen-Steinmethodgivesus totalvariation
boundsto makeall ofthisprecise.
P(Rn < t) _ e-An(t) = exp(_pc),
AsdescribedinSection3.1,wewillenlargetheindex
that is, that Rn- logl/p((n- 1)(1 - p) + 1) has an setfromI to I* = I x T in orderto keeptrackofthe
asymptotiecxtremve aluedistributionT.hisis almost typesofclumpst; hevaluesofb*,b*andb*aregiven
so;thelimitindgistributioins complicatebdythefact by(4), (5) and(6) usingI* andthenewneighborhoods
thatRncan assumeonlyintegervalues.Howevert,his Ba*, definedbelow. Here we take T = I0, 1, *, t I as
factdoesnotcomplicatteheapproximatiointself. thesetofpossibletypesofclumpsA. nyrunofexactly
For example,withn = 2047 and a faircoinwith t+ i headsfor0 c i < tcorrespondtsoa clumpofsize
p= we look forruns o*f1l/e2n+g1t)h=log10,/. W,((onu-ld1a) i; a runof2tormoreheadscorrespondtsoa clumpof
* (1 1/2, 1) =log2(2046 typei = t. The interpretatioisnthateach oftheXa
+ runsof heads of lengthat least t startingat a can
- p) independentlbye assigneda typei, correspondintgo
runoflengths,ay t = 14 be unusual?By usingthe a runofexactlyt + min(i,t) heads.(Forthepurpose
Poissonapproximationw,e see thatP(R2047' 14) = ofprovingconvergencoef U to a compoundPoisson
1 - P(R2047< 10 + 4) may be approximatedby limitt,heupperboundtcouldbe replacedbyanything
1- exp(-(?/2)4)= 0.06059. tendingto infinitays n grows.F) ora E I, i E T let
To assess the accuracyof the above Poissonap-
proximationw, e apply Theorem1. Define Ba =
oI#fE,-{XI: 9I: a,B-4 Al < tI forall a. SinceXayis independent
Ba }, wehaveb3= 0. Furthermorief,1 < X.,j e(1 - 1la > 1jCa_i)CaCa+i
s-incelw< eth,waveeciannsnisottehdtahveattahartubnobtehgXianwanitdhXa#taarile;
Ia * Ca+t+-i (1 - lti < tlCa+t+j),
1,
thereforePat3= 0 for,BE Ba ,B3 a, hence b2 = 0.
upInthoersduemrtoovcearlfcuEl8aBteabiin=totwa Eop6GarEBt,Psa,dPfeapewnedbinregoank sothatforall a E I, Xa,= >iET oXr,die.rUtoshinagvtehbe*n=ot0a,twioen
introduceidn Section3.1,in
whetherornotPi appears.Thisyieldsthebound expandthe neighborhoodbsy a factorof two:Let
Bo* 3 EI: Ia - < 2t}, whichyieldsb* < 2bj,
(9) bi < X2(2t+ 1)/n + 2Xpt. wherean upperboundon bi is givenin (9). We have
Theorem1 nowrevealsthatthePoissonapproxima- b* < bO. Appliedto thesetupwithindexsetI x T,
tion is quite accurateforthe exampleconsidered Theorem2 yieldstheresult
above;theprobabilitycomputedis correctto within
bi < 6.297 x 10-5, so that (10) 1/c2?2IIb(X,**+) 2-b* (Y*) 8bj.
+ b3*<
0.060527c P(R2047' 14) < 0.0606453.

4.2.1 Compound Poisson process limitsand In theexamplewithn = 2047, t = 14,p = 1/2thatwe
long head runs treatedabove, this upper bound is 8 x 6.297 x 10'.
The PoissonprocessY*3 (Y, i)aEI,iET maybe viewed
Whatfollowsis a concreteillustrationofthe dis- as a refinemenotfthePoissonprocessY= (Ya)aEI,
cussionin Section3.1. Specificallyw,eshowhowthe withYa = hET Yaj foreach a E ILThe distribution
problemoflongheadrunsmaybe treatedto obtaina ofthetypei ofeachclumpis exactlygeometri(cp),
compoundPoissonlimitfortherandomvariable truncatedat heightt.

U3 aEI CaCa+i ... Ca+t-1i To show that U is approximatelycompound
Poisson, considerany set A C {0, 1, 2, .* and let

This content downloaded on Sat, 9 Feb 2013 01:24:07 AM
All use subject to JSTOR Terms and Conditions



412 R. ARRATIA,L. GOLDSTEIN AND L. GORDON

totic probabilityof at least one such pair. Thus independentP,oissonrandomvariableswithmeans
Eg(X) - n-rp5/4,whereg is the functionawl ith
gan(Xd)in=de1p(1en' de2naedt,vEI(e3)n,I#tEspB(ar) ocXeasXsw,). e For the Poisson )
have aodd(p) EZodd 2(- 3 +

Eg(Y), Eg(X') c , EX,aEX#=0(n 2). n1 _+__ _
a,#eI(3),#eB(a)
(14) 4\l- p 4'
Thus

liminfn11Y (X) - Y (X') 1 aeven(p) EZeven 24 6
2 limn(Eg (X) - Eg (X'))
11/1 A p
- p5/4 > 0. 4 1 - p2J
2
Noticealso thatb22 E(#pairsofneighborintgrian-
gles)so thatb2decaysno fasterthan0(n-1). We havejustprovedthattheprobabilittyhatthere
Forj 2is3t,hleentuWmj1bero,fEuc(yj) cXleas,ofZjlengtahejI,(j) Ya, are morecyclesof even lengththan of odd length
thatWj SO converges to P(Zeven > Zodd), where Zevenand Zod are
and Zj independentPoisson, with parametersaeven(p)and
is aodd(p), respectively.Furthermore,the distance be-
a Poissonrandomvariablehavingthesamemeanas tweentheactualprobabilitfyorthegraphonnvertices
o(WWf,e.a)jFc3 horlisejnth>getpnhra,,onWcdejZstisshiad(tZcejon)ujt,3nictaitsslhlayzenePruoomi.sTsbhoeonurpfscrWyocc-elesss and its limitingvalueis no greaterthanthesumof
withindependenctomponentsS.incethereis a func- (12) and (13).
tional h(.) such that W = h(X), Z = h(Y), we have
as a corollaryto Proposition4 that The exactexpressionforb2is complicatedb;elow
we givean upperbound.In the secondline of the
(12) 11c (W11Y) (-X)2'(-Z)2 11 = 0(1/n) bound(15),j -3 is thenumberofverticeisn a, k- 1
(Y)I1 is thenumberofsharedsegmentcsommontoa andA
and1 - 0 is thenumberofverticesin : whicharenot
uniformliynp c a < 1. The convergencoefthefinite onthecommonsegmentsF.ormallya,sharedsegment
dimensionadlistributionosfW to theirindependent of a and : is an unorientedm, aximalsequenceof
Poisson process limit is given in Bollob'as (1985 edgesthatoccurconsecutiveliynbotha andfAE.ach
page79). ofthek commonsegmentcsorrespondtso a factorof
We nowshowhowthePoissonprocessZ supplies p/nin E(XarX) thatis notmatchedby a choiceof
an answertothequestionw: hatistheprobabilittyhat one of n verticesw, hichsuggeststhatforp nottoo
thereare morecyclesof even lengththan of odd large,themaincontributiotnob2comesfromthecase
lengthC. onsiderthefunctional(sone foreach value k = 1, and b2 = 0(1/n); we provethis belowfor
ofn) definebdyf(c3, C4, Cn p < 1/2U. nfortunatelfyo,rp sufficientlcylose to 1,
(C4 + C6 + > bothb2 and thesecondmomentof W blowup expo-
C3 + C5 + * * *), so thatour questionis: whatis the nentiallyfastas n -- oo.In thesecases we mustresort
valueofEf (W). Ouransweris:approximatelEyf (Z), to a truncationargumenttoproveProposition4.
witherrorat most1/2112(W) - (Z)I, sincethe
functionafltakesvaluesin [0,1]. b2= i X E(Xa.X)
Onemaysimpliftyheanswerfurthearttheexpense aEI I#EB(a)\JaJ
k(15)
of some additionalerrorof approximationas fol- 3z5n2 (nj )
lows.The Poisson parameterforZ1, namelyXj = 2 n (2k
n i(n)jpj/(2j) (whichis zero forj > n), can be
replacedbyitslimitvalue,pi/(2j ), forj = 1,2, 3jn 2] In k-1\/\k
The totalvariationerrorintroducebdythisapproxi- X (n(k
mationis ofthesameorderas theincreasein expec- + 1 - 1)!2

tation,namely Here are furthedretailsforexplainingthe upper
bound(15). Considerforexample,
(13) j )
a = (1 2 3 4 5 6 7 8)
which for fixed p is 0(1/n), the same as the and
errorcontrolledby the Chen-Steinmethod.Now
and Zeven -Z4 + Z6 + *. are A=(1 2 3 8 9 5 4 6 7),
Zodd Z3 + Z5 +

This content downloaded on Sat, 9 Feb 2013 01:24:07 AM
All use subject to JSTOR Terms and Conditions

POISSON APPROXIMATION AND THE CHEN-STEIN METHOD 413

whichhas i = 8 verticesin the cyclea. Thereare TInhtehseeacboonvdeffaocrtmorcuolufanortgs,tthheenfuimrsbtfeacrotfowrisayE(sXto,affoXrom).
k = 3 commonsegmentswithoutregardto orderof knonoverlappinegdgestoserveas commonsegments
transversalt;heseare (123), (45), (67). The three fora and fAT. he thirdfactorcountsthenumberof
commonsegmentsharethe six commonendpoints waysto pickmadditionavl erticesfora, to orderthe
11,3, 4, 5, 6, 7}. Thereare 1= 2 verticesof : noton k edgesand mverticesa,ndto orientthek edges,and
t2Uh(e2kc)iosmthmeonnusmegbmereonfwtsta;yhsetyoacrheo1o8s,e9t1h.eTkhceomfamctoonr todo thesameforfAT.he finalfactora,n exponential
segmentsfroma, since k segmentsmusthave 2k thatis boundedawayfromzero,is thePoissonap-
distinctverticesas endpointsa,nd a set of 2k end- proximationforwhatfractionof the arrangements
pointsdeterminetswo sets of k segmentsI.n our just countedactuallyhave k commonsegmentsof
exampleabove,theotherchoiceofsegmentussingthe length2. For an upperboundon ,u = ,u(n, k,m), we
samesix endpointsis (34), (56), (78 1). The factor have,us 2, which,in case botha and : had chosen
(k + 1 - 1)!2k-1 is the numberof waysthatthe k thesamemadditionavl erticesi,stheexpectednumber
commonsegmentfsroma andthe1additionavl ertices ofpairsofobjectse,itheredgesorverticesa,djacentin
can be arrangedintoa cyclefAc,hoosingorientations botha andA3.
foreach segmentafterthe first.Not all of these
arrangemenctosrrespontdo a choiceof: sharingthe The pairs a, ,Bcountedin g(n, k, m, p) form
givenk segmentsi:n ourexamplei,fthesegment(4 5) part,but not all, of the termin (15) indexedby
were given the oppositeorientationA, would be k,j = 2k + m, 1 = m. As a check on the above
changedinto fd'= (123894567), witha and fd' computationof exponentialgrowtht,he values of
classifiedas sharingnot k = 3 but ratherk = 2 g(n, .02n, .ln, 1)exp(,u) for n = 1000, 2000, *..,
segmentsn,amely(123) and (4567). Anothereason 7000, 8000 are approximately .0001132, .001077,
thatourboundon b2is an overestimaties thefactor .0210,.5496, 16.90,575.4,21020.3 and 808488.
(n 2k) forchoosing1additionalpoints forfAI.f i is the
actualnumberofpointsused in thek commonseg- Howeverf, orsmallp,we haveb2= O(n-1),which
ments,with2k c i c j, thenthenumberofwaysto weshowat theendofthissectionW. e notethat

choose additionalpointsforA is (LVi) < (n2I). X (EX.)2 = E i(2 =i 0(n-3).
To see that b2 and thereforeEW2 _3 00 for
Using(11) to boundtheoff-diagontaelrmsEXafEXg
p sufficientlcylose to 1, considerg(n, k, m, p), ofb,as multipleosfthecorrespondintgermsE(X,fX,3)
the contributionto b2 frompairs of cyclesa, A, ofb2showsthat
each oflength2k + m,and sharingk commonseg-
ments,eachconsistingofa singleedge.Observethat (16) bi c E (EXa)2 + Pb2.
g(n,k,mg p) = p3k+2mg(n, k,mg1). Let aEI
n
f(a, b) = lim n-1logg(n, LanJ,LbnJ1,),
n-i-oo Thus,whenp is smallenoughthatb2= O(n-1), we
wherea > 0, b > 0, 2a + b < 1. haveb1= O(n2), and Proposition1 followsdirectly
fromtheChen-Steinmethodgivenin Theorems1-3.
Froma calculationbelow, Forp closeto 1,howeverb,2-- 00, andwemustresort
f(a,b) = alog2- (3a + 2b)+ L(1 -2a) tothetruncationargumengtivenbelow.

- L(a) + 2L(a + b) - 2L(b) - 2L(1 - 2a -b) Fix e c 1 and consideronlycyclesa of lengthI a
whereL(x) = x log(x).Numericaslearchgivesus,for up to en.Formallyc,onsiderthetruncatedindicators
a = .02, b = .1, that 0 < f(a, b) = .00398 *. .. Thus ofcyclesf: ora E I,
for p sufficientlcylose to 1, g(n, L.02nJL, .lnJ,p)
and henceb2and thesecondmomentof W blowup X'c 1(1 a I en)X,,
exponentiallyas n -- oo.To derivethe formulaabove
whichformtheprocessXc (X).,a,. We have
forf(a, b), we startwithan asymptotifcormulafor
1/21Y1(Xc) - 27(X) || ' P(X $? X)
g(n, k,m,p), withn, k -+ 00:
en

g(n, k,m,p) j>en 1 -p

P+2m (n)2k n - 2k\ so thattheapproximatioenrrorin replacingX byXc
is exponentiallysmall as n -+ a). The same holds for
n k!2k m truncationof the PoissonprocessY and the inde-
pendenteventsprocessX'. The bound(16) applies
x (k + m - 1)! 2k-1)exp(-,u). also to thetruncatedprocess,so thatProposition1

This content downloaded on Sat, 9 Feb 2013 01:24:07 AM
All use subject to JSTOR Terms and Conditions

414 R. ARRATIA,L. GOLDSTEIN AND L. GORDON

forthetruncatepdrocesswillbeprovedifwecanshow distributioonfthemaximumofnormalvariatesA. n
that forthe truncatedprocess,b2 = O(nD'). The extensivetreatmenotf thistopicalso usingStein's
originavl ersionofProposition1thenfollowbsyusing methodappearsin HolstandJanson(1990).See also
thetriangleinequalityto comparetheoriginalwith Barbour,Holst,and Janson(1988b)fora treatment
thetruncatedprocesses. ofthem-dependenctaseverysimilartoours.Consider
a sequenceofindependensttandardnormalvariates
UsingtheChen-Steinmethodwiththesameneigh- IZ1, Z2, ... 1.Let Mn= Mln= maxa.SnZa.
borhoodsas before,we have the followingupper
bounds,correspondintgo (15) for the truncated Hall (1980) analyzesthe distributioonf Mn. He
process. concludesthattheusualapproximatioans scaledex-
tremevalueis tooslowlyconvergenttobe satisfactory
MO E)= E E(X X) forpracticallyoccurrinsgamplesizes,and he suggests
aEI #IEB(a)\IaJ alternativaepproximationfsoritandforMk,n, thekth
largestof the firstn observationsH. is derivation
en (n)j p (2k involvesa carefulasymptotiacnalysis.A numberof
j.= 2Jf2 similarresultsmaybe obtainedusingtheChen-Stein
methodin theindependenctase.
( j=3i (n-k n(k
Choose a test value t. We wish to approximate
keenn pjj i/2p /k./L(,Ak\l- 1)!2k1 PIMn C t}. Let X, = lZ,,<t> t}, so that EX, =
p(t) = 1 - 1(t). Ifthetestvalueis tobe sufficiently
j=3 2jk21 n JI\2kJ (1-p)k largetobeofinterestw, emayexpectp(t) toberather
small,so thatthePoissonapproximatiosnhouldap-
Ien (Y \( 1/ ply. With I = 11,2, ..., n}, we have W = Z2asiXa,
2j k-1 ( ( ao EW = Xn(t) = np(t); we are led to believe that
1=3 PiMn c t I _=eAnt

Boundson the qualityof the approximationasre
givenbyTheorem1. Usingindependencew,e choose
theneighborhooodfdependencBe (a) = Ia}andfind

j=3 k-1Pi2p1n/k2k2iI)l - __k bi = np2(t) = X2(t)/n

en j /2 2pk/n k b2 = 0
b = 0.
j=321-
We mayconcludeimmediateltyhatsince{Mn< t =
To get the second equality, use the identity
20Oxl(k + 1 -l)k-1 = (k - 1)! (1 - X)-k. To getthe
nextline,justreplace(k - 1)! bykk. To getthenext X2(t)/n < PlMn C t}
line,weuse2k< j. Forthefinaline,usetheinequality e-xn(t) c e n( ) + Xn(t)/n.
(18)
kF21or(20k < pC< X2j2(1 + x)i forx > 0.
6 < 1,thefinallineof(17), whichhas Hall's approximationessentiallyinvolvewriting
j/n ' e, showsthat

b2(0)< n 1 ,j326+ 6i piMn < tI ( (t))n

(19)
= (1 - (1 -_ f(t )))n -exp(-Xn(t)),

Given6 < 1, we can finde > 0 so smallthat6 + approximatintghe uppertail of the normalby the
6 `6e/(1-b) < 1.Forsucha choiceofe,wethushave asymptotiecxpansion(26.2.12)of Abramowitaznd
b2(e) < C(6)/n, uniformliyn 0 < p < 6 < 1. This Stegun(1964) and providingusablysimpleexplicit
completetsheproofofProposition1.Inparticularw, e boundsfortheerrorofapproximationN.otethatthe
observethate = 1 worksif6 < 1/2s,o thatforp < 1/2, last termof (19) can be interpreteads the Poisson
b1+ b2= O(n-1) and theChen-Steinmethodworks probabilitwy hoseerrorofapproximatioins bounded
directlywithno truncation. by(18).
In TiMabnl_e 1t,}w. CeogmivpevaraerdiaoruestlohwebeoraunnddusgpipveernbinouHnaldls
4.4 Maxima of Independentand Dependent Normal forP
Variates (1980)withthebounds(18),andwithmodifiebdounds
givenbyreplacingthe normaldistributiowniththe
The poweroftheChen-Steinmethodis wellillus- bounds(20) forMill'sratio.
tratedby the classicalproblemof determinintghe

This content downloaded on Sat, 9 Feb 2013 01:24:07 AM
All use subject to JSTOR Terms and Conditions

POISSON APPROXIMATION AND THE CHEN-STEIN METHOD 415

FollowingHall (1980),write: distributionsi,ncludingthe left-truncatendormal
family.
Q, = Q1ln(t)
Write X,,(t) = 4n)(t)/(3t + V8+ t) and
=-e=xp- (t__ I 1- 1 +-+t3n4 1 +(t) \ Xn(t) = 2n4(t)/(t + V4+ t2). Observe that
2(n-)1)/ An(t) < An(t) < Xn(t), followingdirectlyfrom(20).
-tt t2 We presentin Table 1 a comparisonof the lower
boundsQln(t),theChen-SteinboundL0n(t) from(18)
Q2n(t) = exp( n 1t - andLln(t) obtainedbysubstitutintgheupperbound
Xn(t) in (18). The upperbounds UOnand U1nare
Q3n(t)= expn (t) 1 1 + 3 )15\ similarlyobtained,save thatbothAn(t) and An(t)
t2 t0 needtobe usedin obtainingU1nfromUOnA. llerrors
t arereportedas percentageosftheactualuppertailof
theexactdistributioonfthemaximum.
all validwhen27rt2et > n. Hall showsthefunction NotethattheChen-SteinboundsLo and Uousing
Qlnis a lowerboundon thedistributioonfMn,and thenormadl istributiofnunctioanrebyfarthetightest
thatQ2nand Q3nareupperbounds. ofall theboundsdisplayedovertherangeofvalues
tabulatedI.fonewishesto approximatteheuppertail
In thefollowinagnalysiswe makerepeateduse of ofthenormaldistributiounsingMill'sratio,thenthe
theinequalities: boundsL1 and U1are nextpreferredsa,ve whenthe
testvaluetexceeis4,inwhichcase Q3,whichusesan
(20) 2 1-(td)) 4O< 3t +1,t-2+ 8 asymptotiecxpansionto sixthorder,is preferretdo
t t+~4 U1.AsHall notest, akingmoretermsinan asymptotic
expansionis notalwaysdesirableC. omparetheerrors
fort - 0. The lowerboundis duetoBirnbaum(1942); forQ2 andQ3 whenn = 10.
theupperboundis provedin Sampford(1953).Both An appealingfeatureofthePoissonapproximation
boundscanbeobtainedas corollarietsoKarlin(1982), is itsversatilitySi.ncelMk,,<n t I = IW< kj, fromthe
in whichtotalpositivityis used to provethemonoton-
icityof the varianceof certainfamiliesof truncated

TABLE 1

Percentrelativeerrorsforboundson thedistributionofthemaximumofindependentnormalvariates

n t 'k4()n(t Qln_4n L1i-n-zn+ Lon-+ Uon-4 Uln-4 Q2n-_sn Q3n_4n
1 -, n 1 - <|n 1 - <|n 1 - 4,n 1 - 4,n 1- n

10 1.6 .5692 -24.26 -5.40 -4.89 9.05 11.14 20.02 73.76
50 2.0 .7944 -10.30 -1.78 -1.50 3.53 5.05 10.84 15.90
100 2.4 .9210 -.62 -.46 1.24 2.29 6.08 4.78
500 2.8 .9747 -4.33 -.22 1.10 3.55 1.62
1000 4.0 .9997 -1.94 -.13 .38
-.02 -.00 .00 .25 .96 .11
2.2 .4966 -.28 -1.60 2.40 3.35 6.14
2.6 .7917 -1.44 .73 1.52 4.22 6.50
3.0 .9347 -4.76 -.42 -.31 .20 2.70 2.50
3.4 .9833 -2.58 -.14 -.07 .05 .78 1.74
4.5 .9998 -1.30 -.06 -.02 .00 .46 .96
2.4 -.01 -.00 .17 .62 .38
2.8 .4391 -.67 -1.04 1.46 4.18 .05
3.2 .7743 -.14 -.26 -.93 .40 2.17 3.17
3.6 .9336 -2.82 -.09 -.18 .10 1.04 2.12 3.27
4.5 .9842 -1.71 -.04 -.04 .02 .58 1.41 1.45
.9997 -.91 -.01 -.01 .00 .59
3.0 -.49 -.00 .37 .62
3.4 .5090 -.14 -.19 .23 .17 .25
3.8 .8449 -.06 -.14 .05 1.97 .05
4.2 .9645 -.94 -.03 -.02 .01 .65 1.62
3.2 .9933 -.62 -.02 -.00 .00 .43 1.14 .69
3.6 .5029 -.36 -.11 -.00 .12 .30 .36
4.0 .8529 -.21 -.04 -.07 .02 .21 .80 .16
4.4 .9688 -.65 -.02 -.01 .00 1.54 .08
.9946 -.46 -.01 -.00 .00 .46 1.31 .43
-.27 -.00 .34 .95 .23
-.16 .25 .67 .11
.18 .05

This content downloaded on Sat, 9 Feb 2013 01:24:07 AM
All use subject to JSTOR Terms and Conditions

416 R. ARRATIA,L. GOLDSTEIN AND L. GORDON

identicacl alculationst,heChen-Steinmethodyields pendenceis local.HencebyTheorem1
withno morework P{ maxYK, t}_t - n(t)

k-|1 X (t) e - -t n

n=

Similarboundsare not explicitlyavailablein Hall (22) < (1 - exn())(3At + 2C (P)(2
(1980)andwouldrequiresubstantiallmy orework.
x(tn ))(1-p)/(l+P)
Our interestin the Chen-SteinPoisson approxi- The boundsareusefulwhenXn(t)/nis closeto0,and
mationarose fromour studyof maximaof weakly so arePoissonapproximationtso thedistributionosf
dependenrtandomsequencesO. urinitialtoolwasthe otherextremoerderstatistics.
Bonferronini equalitiesw, hoseeffectivuesewelearned
fromthe seminalpaperof Watson(1954). Watson's Althoughboundsforratesofconvergencaere im-
methodimplicitlyrequiresthatone computeall mo- plicitin Watson's(1954) use oftheBonferroniine-
mentsofthesumofindicatorEs X,. Hencetheuse qualitiest, heyarecertainlymoreconvenientlayvail-
ofWatson'smethodis equivalentto provingconver- able in theChen-Steinformulation.
gencein distributiotno thedistributiohnopedto be
determinedby limitsof momentsof the counting Anexhaustivtereatmenotfratesofconvergencfeor
process.Watsonillustratetsheutilityofhis method stationaryGaussiantimeseriesis givenin Rootzen
by evaluatingthe limitingdistributioonfthe maxi- (1983).There,ratesofconvergencaereestablisheadnd
mumofa stationaryk-dependenstequenceofjointly boundswithexplicitconstantasregiveninsubstantial
normalvariates. generalityC.onnectionsare madewithPoissonap-
proximatiounsingcouplingmethodsdue to Serfling
Here is the correspondincgomputationusingthe (1975).The chieftool is a technicalemmarelating
Chen-Steinmethodforthe case of a 1-dependent the distributionof dependentand independent
movingaverage of normal variates. Let Y, = (Z, + Gaussianvariates.
OZ,+i)/vfl+T be a stationarysequenceof normal
variateswithmean 0, unit variance,and common The boundsobtainedwiththeChen-Steinmethod
lag-1 autocorrelationp = 0/(1 + 02). Let M* = are frequentlqyuitegood.This is truein oursimple
max,asn Y,. Againchoosetest value t. FormX, = exampleabove,inwhichw, hent growslike 12ln(n),
11Ya > t} so that the rates of convergenceof the bounds given
above are exactlythoseof the boundsobtainedby
n Rootzen(1983),shownthereto be of bestpossible
PMn* < t} = P{ 2 Xa = 0} orderI.n thiscase,thecoefficienotftheleadingterm
in (22) is about1.92,comparedtoRootzen's4.47.The
at=l computatiosnketchedabovecarriesoverwithobvious
modificatiofnorfinitemovingaverages.
Ca h+oo1s1enne1i1g,h..*bo,rnh}o. oLdesotfpd(te)pe=nd1e-nbce(tB). = { - 1, a,
and An(t)= Finally,we end the sectionwiththe promised
np(t) be as beforeW. e thenhaveforpositivet lemma:

3X2n(t) LEMMA 1. Let Y1, Y2 be jointlystandard normal
withcovariancep. For t > 0, writeu = t 12/(1 + p).
n Then

b2 < 2C(P)Xn(t)( 2 + 1)

b3= 0, PfminlYi, Y2 > t}

whereu = t-12/(1+ p), and (23) < (1 +) (u) - u(1 - 4(u)))

(21) C(p) = /2r(1-P)/(1+o) 2(1 +) (24) < /(1 - p/(l+p) -(t,,21'1+p)
C(p)(2
+ J

The boundon b1is immediatferomthedefinitioonf whereC (p) is definedin (21).
Ba. The boundon b2is a consequenceoftheelemen- PROOF. Note that Y1 + Y2 and Y1 -Y2 are un-
taryinequalityofLemma1-stated andprovedat the
end ofthe section.The termb3 is zerobecausede- correlateda,nd thatthe eventIminY1, Y21 > t} =

This content downloaded on Sat, 9 Feb 2013 01:24:07 AM
All use subject to JSTOR Terms and Conditions

POISSON APPROXIMATION AND THE CHEN-STEIN METHOD 417

{lY1-Y2 1/2< (Y1 + Y2)/2-t . Hence elementi. The randomvariable

P{min{Yi, Y21 > t}I W-W(7r)- E l(i E Fi)

=2PO < Y1 -Y2 countsthenumberofrestrictepdositionstakenbya
< Y1+ Y2 - 2tl Y1+ Y2> 2t} randompermutationIt.s expectationis

x P Y1+ Y2 > 2t} X-EW =-En IFi.
Our goal is to understandthe relationbetweenthe
<2 E{ 47r(l - p) EEYY numberofpermutationwsithnoelementisnrestricted
+ positions,and the Poisson approximationn,!eX.
Y2 - 2tI Y1 + Y2> 2t1 Moregenerallyw, e are concernedwithhowwellthe
distributioonf W matchesthe Poissondistribution
x (1 - 4(u)) withparameterX,and howclosein distributioanre
thefamilyofdependentevents(t-riE Fi)l)i-n and a
12(1 + P) u) familyofindependenetventsofthesameindividual
probabilities.
x (1 -b(U))
provingthefirstinequalityT.o provethesecond,use The problemofpermutationwsithrestrictepdosi-
(20) repeatedly: tionsis alsotreatedinChen(1975b)andBarbourand
Holst(1989),whichcontainsmanyreferenceTs.hese
o(u)-u(l - u)) ](u) papersalso startwithStein'smethodas embodiedby
~2u u Equations(2) and (3). The nextstep,as presented
<[1- clearlyin Barbourand Holst (1989),is to look,for
eachoftheeventsbeingcountedf, ora goodcoupling
+4+ betweenthetotalnumberWofeventsa,nda random
?[/ +4- u] variableequalindistributiotnothenumberofevents,
minusone,conditionedon theoccurrencoeftheone
+ 4 +u- selectedevent.That treatmenotfStein'smethodal-
lowstheusermorefreedomofchoicethanwhatwe
= (l-p)/(l+) /2 2 are presentingin this paper as the "Chen-Stein
- 7('P)/(1+P)[ method.F"orthebenchmarekxampleofpermutations
(t )2/(l+p) withrestrictepdositionst, he Chen-Steinmethodas
presentedhereis bothharderto use andgetsa worse
+2 4 + u boundonthePoissonapproximatiofnorW.In detail,
2-2/(l+p) apartfromconstantfactorst,heboundsin theother
twopapers,and ourtermbl, are equivalentb, utwe
x 1(t- LvU2 + 4 + u also have a termb3,whichis greaterthan b, by a
2 2/(l+p factorwhichis oftheorderoflogn.
/
4+ u A Overallthen,fortheclassofproblemass described
in Example1.3,ourboundshowsthatthenumberof
2 2pl(l+p) restrictepdositionstakenbya uniformlsyelectedran-
dompermutationis approximatelPyoisson,witha
>M+ 4 + u boundon theerrordecreasingat ratelogn/n.How-
ever,forno additionawl orkt, heChen-Steinmethod
x [-(t )]2/(l+p) yieldsinformatioanbouttheentireprocessofoccur-
rencesv, ia Theorems2 and3.
< J2r(1-P)I(l+P) _ _p/(l+p)

+

x [1 - b (t )]2/(l+p)

EXAMPLE 1.1. Derangements. Let Fi = {il for
i = 1 to n. Then W is the numberof fixedpoints
4.5 PermutationswithRestrictedPositions of a randompermutationI,W = 0Ois the set of
Considera probabilitymodel in which all n! derangemenotsfn objectsandX= 1. Thisexampleis
exceptionaalnd misleadingi,n thattheerrorin the
permutationsir on $1, 2, *.., n are equally likely. Poissonapproximatioins superexponentialslmyallas
For i = 1, 2, *.. , n, let Fi C {1, 2, *.., n be given, n -- oo.
to be thoughotfas thesetofrestrictepdositionsfor

This content downloaded on Sat, 9 Feb 2013 01:24:07 AM
All use subject to JSTOR Terms and Conditions

418 R. ARRATIA,L. GOLDSTEIN AND L. GORDON

EXAMPLE 1.2. The menage problem. Let Fi = Randompermutationesmbodylong-rangeg,lobal
{i, i + 11fori = 1 to n - 1 and Fn = In, 1[. Here X = dependences,o anychoiceofneighborhoodwsillhave
2, and carefuluse ofinclusion-exclusiosnhowsthat b3> 0, otherthanthechoiceBa I, whichgivesthe
thePoissonapproximatiosnatisfiesIP( W=0) -e-2 l uselesslargevalueb1= X2. It is possiblebutdifficult
- cn-1, in contrastto the superexponentialdecay of to givea usefulupperboundon b3;wecarrythisout
Example1.1. inthenextthreelemmasT. he firstlemmastatesthe
boundon b3and relieson thenexttwolemmas.The
EXAMPLE 1.3. Derangementsm, enageproblems, thingmostworthobservingin the lemmasbelowis
etc. Let Fi= figi + 1, * , i +d - 1 fori= lto n, thetechniqueforgettinga handleon b3,displayedby
withtheadditionstakenmodulon. HereX = d, and the equalityin (26): since each termof b3 is the
thetwoexamplesabovearethespecialcasesd = 1,2 expectationofthe absolutevalue ofthe conditional
ofthismoregeneralexample.In even greatergen- expectationofsomethinwg ithmeanzero,eachterm
erality,followingRiordan(1978), we mayconsider can be expressedas twicetheexpectationofthepos-
{W = 0Oto be the set of permutationdsiscordant itivepart (or the negativepart)ofthatconditional
withd givenpermutations,o1,02 ***, ad, bytaking expectation.
Fi= {al(i), *-, od(i)} for i = 1 to n. We will ana-
lyzethisexampleusingtheChen-Steinmethodbelow LEMMA 2.
to getbi c d (2d - 1)/n,b2= 0 andb3= O(logn/n).
b3 min (2nX-k k + 2nX2k-eke)
EXAMPLE 2. Let Fi = {i,n) fori = 1 to n-1 and
Fn= In,11.Here X= 2,and thePoisson approximation 1<k<n
is not at all valid, since P( W = 0) = 0.
2X (2 log2(n) + Xe/ln2) if = o(n).
The "natural"wayto use theChen-Steinmethod n
wouldbe to takel= {1, *., nI andXXi 1((ri E F)
fori E I. The neighborhooodf dependencein this PROOF. Fix a E8 I, let V = Z,B. X,6,and for
setupwouldthenbe BiI j GEI: Fi n Fj$ 01. J C I - Ba definethe event

Insteadw, etakean approachwhichgivessymmetric (252)5) E J--X#= 1 VlE J,
treatmento the domainand rangeof the random X=0 V E I- Ba - JI,
permutationT.hus,welet

lca= (i, j): jGEFi I so thaton theeventEJ wehave V = IJI. Thereare
nXcontributiontos b3oftheform
and

forae=(,j)EEI, X.=l(,xi=j)g

so thatI maybe thoughtofas theset ofrestricted SaC E E(Xa --o(Xo: f ,Ba))
edgesin the bipartitegraphKn,n of possibleedges
betweenn menand n womenw, ithEX,,= 1/nforall E(X. _XE )P(E )
a E I, and II I = An.Ourchoiceoftheneighborhood
of dependenceof a is the set of edgessharingan
endpointwitha:

fora = (i,i ), (26) = , 2(E(X - EJ)) P(EJ)

B3 = (i',j') E I: i = i' orj =j'1.

The firsttwocomponentosfthe Chen-Steinbound < 2 E( 1 -1 P(EJ)
thenare

bi = , , EXaEX = n-2 E IBal, b2=0? (usingLemma3)
aEI OE-B, aEI

In Example 1.3 we have IlB, I c 2d - 1, so b1< =2 ( n- )Pn(V j)
d (2d - 1)/n. If we have used the "natural" setup
describedin thefirstparagraphofthissection,then 0-<jn
b,wouldbe increasedbya factorofd inthisexample.
Insteadofhavingb2 = 0, in the "natural"setupwe < 2( n -k --)n + 2P(V -k).
would use the easily established bound E(XaXo) c
tnh/e(nte- ch1)n(EiXqufeEwXe,u),sesboetlhoawtlbe2a<dsnt/o(nth- e1s)abml. eFuopr pbe3r, To boundthelasttermw, eusetheupperboundbased
boundon b3witheithersetup. o2kn*EL2ewmm_ a24-:*esxien. ceMVucltiWpl,yP(iVnb-ygk)II c P(W-k) <
I =nX, forevery

This content downloaded on Sat, 9 Feb 2013 01:24:07 AM
All use subject to JSTOR Terms and Conditions

POISSON APPROXIMATION AND THE CHEN-STEIN METHOD 419

positiveintegerk,we havetheupperbound Forthekthtermofthelastsumwe have

b3 c 2Xk/(n- k) + 2nX2ke e. 1 (nX)k = (Xe)k

If X = o(n), then taking k = r2 log2(n) + Xe/In21 Xk!J (n)k- k! (n/e)k k!
makesthesecondtermnegligibleand demonstrates
theasymptoticclaim. O 4.6 Cycles in Permutationsand Random Mappings
As in Section4.5,we againconsidera probability
LEMMA 3. For theeventEJ definedin (25),
(27) E(XaIEj) ' 1/(n - IJI). modelin whichall n! permutationsiron 11,2, ... , n}
areequallylikelyO. urgoalis to understandto what
PROOF. To describetheconditioninegventEJ,let extentcyclesofa randompermutatioonccurapprox-
k = IJ , and relabelthe menand womenso that imatelyindependentalyndwhethear Poissonapprox-
a = (1,1) and J = I(n-k + 1, n-k +1), *.. , (n, imationholdsforthe numbersof cyclesof various
an)c1o.lTlhecetcioonnodfimtioanticnhgeinvegnbsteEtwJ eceannatsheetnobfenv-iewkemdaesn lengthsF. or any fixedj 2 1, it is easy to showby
and n - k women,and the conditionsof the form inclusion-exclusiotnhatthe numberWj of cyclesof
X, = 0 forbidcertainmatchingsw,hichdo notinvolve lengthj convergetso Zj, a Poissonrandomvariable,
man1 or woman1 sincef 4 Ba. This eventEJ can withEZj = 1/j.
be partitionedinto n - k s{uJbrsEeEtsj:aicrc1o=rdj iIn, gstootthhate
mate chosenby man 1, Uj The firstinterestinpghenomenoinllustratebdythis
the conditionalprobabilityabove is the ratio exampleis thattheChen-Steinmethodandtheidea
tIoUWn1 I-e/pIkEr,wove.ehTtahhvauetsI,UiUt11sIu1ffIiUIcUjetIsj.Io showthat,forj = 1 ofcomputintghetotalvariationdistancetoa process
for] = 2 to n- k withindependenctoordinatelsetus computea "criti-
by presentinga one-to-onemap f whichmaps U1 cal boundary"forPoissonapproximationC.onsider
into Uj, namelycompositionwiththe appropriate Z (Z1, Z2, *...) the PoissonprocesswithEZj =
transposition: 1/jand independenctoordinatesI.t can be shownby
inclusion-exclusiotnhatfinitedimensionadl istribu-
f(7) = (1 j) o X tionsofthecyclecountingprocessconvergteo those
ofZ, and,sinceE j Wj= n,it is easyto see thatthe
Informallyf,is the map thathas women1 andj fullprocess countingcycles,W (W1, *.., WA),is
swaptheirmates.We observethatwhenirE U,,then notclose,intotalvariationt,o thefirstn coordinates
f(7r) E Uj, becausethetwonewmatchingcsreated ofthePoissonprocessZ. We willconsiderjointlyall
involveman 1 or woman1, and henceEJ placesno cyclesoflength1 (i.e.,fixedpoints),2, 3, ** f(n),
restrictioonn the use of theseedges.We have ine- wheref, orexamplef, (n) growslike sInorn/logn. It
qualityin Lemma3 becausef maynotbe onto,forif turnsout that a Poisson approximationforI W1, ***,
Wf(n}) is goodas longas f(n) = o(n).
a E Uj with as = 1, then ir = f-1(a) has 1rj= 1 and
wi= j, butiff = (i, j) E I is a restricteeddge,then The secondphenomenonillustrateids thata Pois-
X,(r) = 1 so that r 4 Ej, hence r 4 U1. C1 son approximatiofnortheprocessIW1, ***, Wf(n)}
maybe validevenwhentheChen-Steinmethodfails.
LEMMA 4. This occursherein all cases wheref(n) / nH ooand
f(n)/n -O 0. In these cases, the process of indicators
E 2 W c eAe. is notapproximateliyndependent,hetotalvariation
distancesin Theorems2 and 3 tendto 2 and,hence,
PROOF. Observethat,foranyJ C I withIJ I =k (b1 + b2 + b3) cannottendto zeroand Theorem1
E [l aECJXa,is eitherzero,in case anyoftheedgesin cannotyielda successfualpproximatiofnorW.This
Jintersecto,relseis (n - k)!/n! = l/(n)k. Recallthat reflectswhatis usuallya virtueof the Chen-Stein
methodw; henthe methoddoes certifya successful
I = nX.Thus approximatiofnorthenumberW ofoccurrencevsia
(b, + b2 + b3) -O 0, the methodwouldalso imply,
throughTheorems2 and 3, thattheprocessofindi-
catorsis approximateliyndependent.

E2w = E fl (1 + Xa,) 4.6.1 Independence among the shortcycles
aEI The naturalwayto establisha Poissonapproxima-
= S E [I X.
JCI aEJ tion for (W1, ..., Wf(n)) is to use an index set I
consistingof all cyclesof lengthat mostf(n). For
j-1, let Ij be the set of cyclicpermutationas of
exactlyj elements of 11, 2, *.., n}, so that IIh =

(fl)1!]. Fora E Ij, letXa be theindicatoroftheevent

This content downloaded on Sat, 9 Feb 2013 01:24:07 AM
All use subject to JSTOR Terms and Conditions



POISSON APPROXIMATION AND THE CHEN-STEIN METHOD 421

approximateliyndependentT.he cycleeventsprocess (i,j)-a practiceleadingto thebiologistst'erm"dot
X ofthepreviousectionnotonlydetectsthenumber matrixanalysis."
of cyclesof each lengthup to f(n), it also carries A naturalquestionarisesfromcomparisonoftwo
informationabout whichelementsof 1,2, ** , n are ormoresuchstringsw, henthescientiswt antstoknow
relatedbybeinginthesameshortcycle. whethera comparisondetectsan unusualcongruence
sharedamongthestringsI.n ourproblemc,ongruence
It is possibleto getaroundthisproblemb,yassign- is measuredby the numberof lettersthat match
ingto eachcycleofa randompermutatioinr exactly betweentwosubwordsofthesesequences.Although
oneofitselementsas its"marker.I"n moredetail,let importanbtiologicalquestionsinvolvethemoregen-
I=1,*, n x 1,.. , f(n)}, andfor(i,j) E I, eralnotionsofinsertionanddeletionw, erestricotur
studyto thesimplerquestionofmatchingand non-
Xi = 1{i "marks"a cycle in ir of lengthj }. matchinpgositionsS. uchstatisticaplroblemasrenat-
Withthissetup,Wj,thenumberofcyclesoflengthj, urallycastin theusualhypothesis-testicnogntextin
equals 1-si-n Xij. Two elementsa = (i,j), d = (k,1) whichwe needto computethetail probability(the
are takento be neighborisf i = k, so b2 = 0. The biologistsp'-value)fora seemingluynusualevent.
computatioonfb,showsthatweshouldbe carefuiln The standardtoolusedto solvesuchproblemshas
pickinga notionof "markinga" cycle:if a cycleis beena probabilistiucseoftheBonferronini equalities
ims aarmkeedsbsyyditescsrmeaaslilnegfstuelnecmtieonnotft,ih. eIfnifnosrteeaacdhwj,eEtaXkiej as pioneeredin Watson(1954).See, forexamplet,he
an independentauxiliaryrandompermutationto momenctalculationisnKarlinandOst(1987)andthe
serveas a rankingandmarka cyclebyitselementof discussioninKarlin,GhandourO, st,Tavar6andKorn
smallestrank,thenEXij = 1/(nj) and (1983). Use of the Bonferroniinequalitiesrequires
computatioonfmomentosfarbitrarilyargeordert;he
= (xt n(ni ) 1) 2 task is always tediousand frequentlytechnically
(log fn(n))2 demandingT. he technicaldifficultietshatcan now

even forf(n) = n. be avoidedare exemplifieidn Arratia,Gordonand
Waterman(1986).The Chen-Steinmethodallowsfor
Finally,witheithernotionof markingi,t shouldbe an easiertreatmenotfthesameproblemthatleadsto
thecase thatb3-- 0 ifand onlyiff(n)/n -O 0. We strongeresultswithno extrawork.
believethison intuitivgeroundbs utwehavenottried In ArratiaG, ordonandWaterman(1990),westudy
to givea detailedproofs,incethedirectproofof(29) a moregeneravl ersionofthefollowinpgroblem.
is simpleandyieldsa strongeresultthantheChen- Let A1, ... AAn, ... and B1, *.. Bn, *. * be inde-
Steinmethodwouldyieldwiththissetup. pendentlychosenaccordingto thesamecommondis-

5. A BIOLOGICALEXAMPLE tributionIujI froma commonalphabet 11,2, * d [.
Ourcontinuindgesireis to solveproblemsrelevant Choosea testvaluetandcompute
to molecularbiology.This desirewas the original
motivationforstudyingthe Chen-SteinPoissonap- (30) Mn(t) = max t-1 Bj+k3,
proximationIn. thissectionw, epresentitsapplication 1-iXj-n-t+1
toa problemmotivatebdycurrendtataanalytictech- Ek=O llAi+k=
niquesin molecularbiology.
A strandofDNA canberepresenteads a longstring thelargestnumberofmatcheswitnessedbyanycom-
aofllaertgteearmfs rooumntthofeflaibnoitraealtpohraeybfeft{oar,itcs,bge,tin1.gCeuxrpreenndteldy, parisonoflengthtsubstringWs. hatisthedistribution
in thedeterminatioannd subsequentcompilationof ofMn(t)?
geneticinformatiofnromvariousorganismsT.his in- Anasymptotiacnalysisis possibleusingtheChen-
formationconsistsof listingsof theselongstrings. SteinmethodE. ffectivelyw,e rigorousluyse Aldous'
These data are collected in internationaldata- (1989) Poissonclumpingheuristitco obtainratesof
bases,GenBankin theUnitedStatesand EMBL in convergencfeorthe Erdos-R6nysitronglaw,witha
Europe.Currentlyr,elease62.0ofGenBankcontains two-dimensionianldexset. Whilea proofis beyond
37,183,950lettersof DNA, madeup roughlyof se- thescopeofthispaper,we can easilyguesstheulti-
quencesof 1,000letterseach. Giventwostringsof n materesult.
and mlettersi,nformatioanbouttheircomparisonis sC=hWotorcsoietrsear<e=stp(aion,jnd)d,tlseaotnpYdearwf=reic1tmteISSaaat=2chsik3n=.Tgwh1,helAiskcp+heii=scisaBilmkc+ajisl[ea.r
conceptuallsyummarizeads a matrixofn x mposi- to the case ofperfecthead runsdealtwithin Sec-
tionsin whicha matchoflettersin positionsi andj tion 4.2. We have P$Mn(t) < sl = PIE Ya = 01.
is traditionallyrepresentedby a dot in position Denotebyp = tgL2 theprobabilitoyfseeinga match
betweentwo arbitrarilyselectedletters.Each Sa
is distributedas binomial(t,p), and there are

This content downloaded on Sat, 9 Feb 2013 01:24:07 AM
All use subject to JSTOR Terms and Conditions





424 R. ARRATIA,L. GOLDSTEIN AND L. GORDON

BREIMAN, L. (1968). ProbabilityA. ddison-Wesley,Reading,Mass. 293-351.
CHEN, L. H. Y. (1975a). Poisson approximationfordependenttrials. LEADBETTER,M. R., LINDGREN,G. and ROOTZEN, H. (1983).

Ann. Probab.3 534-545. Extremesand Related Propertiesof Random Sequences and
CHEN, L. H. Y. (1975b). An approximationtheoremforsums of Processes.Springer,New York.
MOLER,C., ULLMAN,M., LITTLE,J.and BANGERTS, . (1987). Pro-
certainrandomlyselectedindicators.Z. Wahrsch.Verw.Gebiete MATLAB User's Manual. The Math Works, Sherborn,
33 69-74. Mass.
CHEN, L. H. Y. (1978). Two centrallimitproblemsfordependent OHYAMAK, ., FUKUZAWAH,. and KOHCHI,T. ET AL. (1986). Chlo-
randomvariables.Z. Wahrsch.Verw.Gebiete43 223-243. roplastgene organizationdeduced fromcompletesequence of
CHEN, L. H. Y. and Ho, S. T. (1978). An L, boundfortheremainder liverwortMarchantia polymorphachloroplastDNA. Nature
in a combinatorialcentral limit theorem. Ann. Probab. 6 322 572-574.
231-249. OLKIN,I. and MARSHALLA, . (1979). Inequalities: TheoryofMajor-
DIACONIS, P. and MOSTELLER, F. (1989). Methods for studying izationand Its Applications.Academic,New York.
coincidences.J. Amer.Statist.Assoc. 84 853-861. RIORDAN,J. (1978). An Introductionto CombinatorialAnalysis.
ERICKSON, R. V. (1974). L1 bounds forasymptoticnormalityofm- PrincetonUniv. Press.
dependent sums using Stein's technique. Ann. Probab. 2 ROOTZEN,H. (1983). The rate of convergenceof extremesof sta-
522-529. tionarynormalsequences.Adv. in Appl.Probab. 15 54-80.
FICKETT, J. W. and BURKS, C. (1988). Developmentof a database ROSENBLATTM, . (1974). RandomProcesses.Springer,New York.
fornucleotidesequences. In MathematicalMethodsforDNA SAMPFORD,M. R. (1953). Some inequalities on Mill's ratio and
Sequences(M. S. Waterman,ed.) 1-44. CRC Press,Boca Raton, relatedfunctionsA. nn. Math. Statist.24 130-132.
Fla. SERFLING,R. J. (1975). A generalPoisson approximationtheorem.
HALL, P. (1980). EstimatingprobabilitiesfornormalextremesA. dv. Ann. Probab.3 726-731.
in Appl.Probab. 12 491-500. STEIN,C. M. (1956). Inadmissabilityofthe usual estimatorforthe
HALL, P. (1988). Introductionto the TheoryofCoverageProcesses. mean ofa multivariatenormaldistributionP. roc. ThirdBerke-
Wiley,New York. leySymp. Math. Statist. Probab. 1 197-206. Univ. California
HECKMAN, N. (1988). Bump hunting in regression analysis. Press, Berkeley,Calif.
Preprint. STEIN,C. M. (1972). A bound forthe errorin the normalapproxi-
HOLST, L. (1986). On birthday,collectors',occupancy and other mation to the distributionof a sum of dependent random
classical urnproblems.Internat.Statist.Rev. 54 15-27. variables.Proc. SixthBerkeleySymp.Math. Statist.Probab.2
HOLST, L. and JANSON, S. (1990). Poisson approximationusingthe 583-602. Univ. CaliforniaPress,Berkeley,Calif.
Stein-Chen methodand coupling:Number of exceedances of STEIN, C. M. (1981). Estimation of the mean of a multivariate
Gaussian randomvariables.Ann. Probab. 18 713-723. normaldistributionA. nn. Statist.9 1135-1151.
HUDSON, H. M. (1978). A naturalidentityforexponentialfamilies STEIN, C. M. (1986). ApproximateComputationsof Expectations.
withapplications in multiparameterestimation.Ann. Statist. IMS, Hayward,Calif.
6 473-484. STEIN, C. M. (1987). The numberof monochromaticedges in a
JANSON, S. (1986). Birthdayproblems,randomlycolored graphs, graphwithrandomlycoloredvertices.Unpublishedmanuscript.
and Poisson limitsof dissociatedvariables.Tech. report1986 TAKACS,L. (1988). On thelimitdistributionofthenumberofcycles
16. Dept. Math., Uppsala Univ. in a randomgraph.J. Appl.Prob. 26 359-376.
KARLIN, S. (1982). Some resultson optimalpartitioningofvariance TAVARE,S. and GIDDINGS,B. W. (1989). Some statisticalaspects
and monotonicitywithtruncationlevel.In Statisticsand Prob- of the primarystructureof nucleotidesequences. In Mathe-
ability: Essays in Honor of C. R. Rao (P. Kallianpur, R. maticalMethodsforDNA Sequences (M. S. Waterman,ed.)
Krishnaiah and J. K. Ghosh, eds.) 375-382. North- 117-132. CRC Press, Boca Raton,Fla.
Holland, Amsterdam. WATSON,G. S. (1954). Extremevalues in samples fromm-depend-
KARLIN, S., GHANDOUR, G., OST, F., TAVARE, S. and KORN, L. J. ent stationarystochastic sequences. Ann. Math. Statist. 25
(1983). New approaches forcomputeranalysis of nucleic acid 798-800.
sequences.Proc. Nat. Acad. Sci. U.S.A. 80 5660-5664. WILF,H. S. (1983). Three problemsin combinatorialasymptotics.
KARLIN, S. and OST, F. (1987). Countsoflongalignedwordmatches J. Combin.TheorySer. A 35 199-207.
among random letter sequences. Adv. in Appl. Probab. 19

Comment

J. Michael Steele

Thisbeautifuelxpositionleaveslittleroomforquib- bestshotis to pointoutthat,despiteitspower,the
bles.Still,ifforcedto raisesomeissue,I suspectmy Chen-Steinmethodis notomnipotentIn. fact,there
J. MichaelSteeleis ProfessorofStatisticsW, harton are simpleproblemswhereone mightsuspectthata
School, Universityof Pennsylvania,Philadelphia, Poissonlaw lurksbelowthe surfacey, etthe hooks
Pennsylvani1a9104-6302a,nd EditorofTheAnnals providedbytheChen-Steinmethodleaveus without
ofAppliedProbability. a catch.

Consider a simple random walk S,, = X1 + X2 +
... + X, in R2 wherethe Xi are iid. To make lifeas

This content downloaded on Sat, 9 Feb 2013 01:24:07 AM
All use subject to JSTOR Terms and Conditions


Click to View FlipBook Version