The words you are searching are inside this book. To get more targeted content, please make full-text search by clicking here.
Discover the best professional documents and content resources in AnyFlip Document Base.
Search
Published by INTERTU℠ EDUCATION, 2022-08-19 08:19:44

Math AI HL

Math AI HL

8B Distribution of X 293

9 Given that X ~ N (12,16) and Y ~ N (8,25) , find:
a P(3X + 1 < 5Y)
b P (X1 + X2 > Y1 − Y2)

10 The random variable X has mean 100 and standard deviation 25. For a sample of 30 observations, find:

a P (X < 98)
b P( X − 95 > 10)

11 An airline has found that the mass of their passengers follows a normal distribution with mean 82.2 kg
and variance 10.7 kg2. The mass of their hand luggage follows a normal distribution with mean 9.1 kg and
variance 5.6 kg2.

a State the distribution of the total mass of a passenger and their hand luggage and find any necessary
parameters.

b Find the probability that the total mass of a passenger and their luggage exceeds 100 kg.

12 Evidence suggests that the time Alan takes to run 100 m is normally distributed with mean 13.1 seconds and
standard deviation 0.4 seconds. The time Bryan takes to run 100 m is normally distributed with mean 12.8 seconds
and standard deviation 0.6 seconds.

a Find the mean and standard deviation of the difference (Alan – Bryan) between Alan’s and Bryan’s times.

b Find the probability that Alan finishes a 100 m race before Bryan.

c Find the probability that Bryan beats Alan by more than one second.

13 A machine produces metal rods so that their length follows normal distribution with mean 65 cm and variance
0.03 cm2. The rods are checked in batches of six, and a batch is rejected if the average length is less than 64.8 cm

or more than 65.3 cm.

a Find the mean and the variance of the average of a random sample of six rods.

b Hence find the probability that a batch is rejected.

14 The distribution of lengths of pipes produced by a machine is normal with mean 40 cm and standard deviation 3 cm.

a Find the probability that a randomly chosen pipe has a length of 42 cm or more.

b Find the probability that the average length of a randomly chosen set of 10 pipes of this type is 42 cm or more.

15 The random variable X has mean 12 and standard deviation 3.5. A sample of 40 independent observations of X is
taken.

Use the central limit theorem to calculate the probability that the mean of the sample is between 13 and 14.

16 The mean mass of a pineapple is 145 g with variance 96 g2. A crate is filled with 70 pineapples.

Find the probability that the total mass of the pineapples in the crate is less than 10 kg.

17 Winnie eats an average of 1900 kcal each day with a standard deviation of 400 kcal.

Find the probability that in a 31-day month she eats more than 2000 kcal per day on average.

18 The masses, X kg, of male birds of a certain species are normally distributed with mean 4.6 kg and standard
deviation 0.25 kg. The masses, Y kg, of female birds of this species are normally distributed with mean 2.5 kg and
standard deviation 0.2 kg.

a Find the mean and variance of 2Y − X.

b Find the probability that the mass of a randomly chosen male bird is more than twice the mass of a randomly
chosen female bird.

c Find the probability that the total mass of three male birds and 4 female birds (chosen independently)

exceeds 25 kg.

19 A shop sells apples and pears. The masses, in grams, of btheeaasspupmleesdmtaoyhbaeveasasuNm(1e0d0t,o10h2a)vdeiastrNib(1u8ti0o,n1.22)
distribution and the masses of the pears, in grams, may

a Find the probability that the mass of a randomly chosen apple is more than double the mass of a randomly

chosen pear.

b A shopper buys 2 apples and a pear. Find the probability that the total mass is greater than 500 grams.

294 8 Probability

20 The length of a grass snake is normally distributed with mean 1.2 m. The probability of a randomly selected
sample of 5 grass snakes having a mean length greater than 1.4 m is 5%.
Find the standard deviation of the length of a grass snake.

21 Given that X ~ B(10, 0.6), find the probability that the mean of 35 independent observations of X is greater
than 6.5.
22 The average mass of a sheet of A4 paper is 5 g, and the standard deviation of the masses is 0.08 g.

a Find the mean and standard deviation of the mass of a ream of 500 sheets of A4 paper.
b Find the probability that the mass of a ream of 500 sheets is within 5 g of the expected mass.
c Explain how you have used the central limit theorem in your answer.
23 Company A and company B specialize in roadside repairs of vehicles that have broken down. The mean time for
A to attend a breakdown is 45 minutes with a standard deviation of 22 minutes; for B the mean time is 51 minutes
with a standard deviation of 25 minutes.
A sample of 40 breakdowns for A and 50 breakdowns for B are taken.
a Find the probability that, from these samples, the mean time for B to attend is within 4 minutes of the mean

time for A to attend.
b Explain why you needed to use the central limit theorem in part a.
24 Boys’ scores in a test follow the distribution N (50,25). Girls’ scores follow N (60,16).
a Find the probability that a randomly chosen boy and a randomly chosen girl differ in score by less than five.
b Find the probability that a randomly chosen boy scores less than three quarters of the mark of a randomly

chosen girl.
25 The daily rainfall in Statsville follows a normal distribution with mean μ mm and standard deviation σ mm.

The rainfall each day is independent of the rainfall on other days.
On a randomly chosen day, there is a probability of 0.1 that the rainfall is greater than 8 mm.
In a randomly chosen 7-day week, there is a probability of 0.05 that the mean daily rainfall is less than 7 mm.
Find the value of μ and of σ.
26 Tambara goes to school by train. The time she waits each morning is normally distributed with a mean of
12 minutes and a standard deviation of 4 minutes.
a On a specific morning, find the probability that Tambara waits more than 20 minutes.
b During a particular week (Monday to Friday), find the probability that

i her total morning waiting time does not exceed 70 minutes
ii she waits less than 10 minutes on exactly two mornings of the week
iii her average morning waiting time is more than 10 minutes.
c Given that the total morning waiting time for the first four days is 50 minutes, find the probability that the
average for the week is over 12 minutes.
d Given that Tambara’s average morning waiting time in a week is over 14 minutes, find the probability that it is
less than 15 minutes.
27 The times Johannes takes to answer a multiple choice question are normally distributed with mean 1.5 minutes
and standard deviation 0.6 minutes. He has one hour to complete a test consisting of 35 questions.
a Assuming that the questions are independent, find the probability that Johannes does not complete the test in time.
b Explain why you did not need to use the central limit theorem in your answer to part a.
28 A random variable has mean 15 and standard deviation 4. A large number of independent observations of the
random variable is taken.
Find the minimum sample size so that the probability that the sample mean is more than 16 is less than 0.05.

8C Poisson distribution 295

8C Poisson distribution

You have already met two ‘standard’ probability distributions that arise as the result of
commonly occurring circumstances: one discrete (the binomial distribution) and one
continuous (the normal distribution).

Another of these ‘standard’ distributions is the Poisson distribution, which models
the number of events in a fixed period given the average rate at which they occur.
For example, if you know the average number of visits per hour to a website, then you
could use the Poisson distribution to predict the probability of getting a certain number
of visits in the next hour.

As with the binomial distribution, you need certain conditions to be satisfied for the
Poisson distribution to be appropriate.

Tip KEY POINT 8.7

If the average rate The Poisson distribution occurs when the following conditions are satisfied:
of success, or events  events are independent of each other
occurring at a constant  events occur at a constant average rate
rate, is mentioned, you  events occur singly (one at a time).
should use the Poisson
distribution. CONCEPTS – MODELLING

If you can identify a It is often the case in the real world that the conditions for a Poisson distribution won’t
fixed number of trials, be perfectly met. In the case of the example of visits to a website, it is unlikely that the
then the binomial average rate of visits is constant throughout the day. Nevertheless, it may still be that
distribution is usually modelling this situation with a Poisson distribution provides useful information.
required.

If a random variable X has the Poisson distribution with constant average rate λ , we
write X ~ Po(λ).

WORKED EXAMPLE 8.8

Jorge receives emails independently of each other at a constant rate of 6 per hour.

The random variable X is the time between emails arriving.

Is this situation suitable for X to be modelled by a Poisson distribution?

The conditions are met, No, since X is not the number of emails arriving in a
but for X to follow a given time period.

Poisson distribution it
would need to measure
the number of events, not
the time between events

You are the Researcher
In Worked Example 8.8, X is actually modelled by a distribution closely related to
the Poisson distribution called the exponential distribution.

As with the binomial and normal distributions, you can calculate probabilities from the
Poisson distribution with your GDC.

296 8 Probability

WORKED EXAMPLE 8.9

A water pipe has on average three leaks in each 5 km section, distributed independently of
each other. An inspector from the water company examines a 5 km section of the pipe.

Find the probability that he

a finds exactly 2 leaks Let X be the number of leaks in a 5 km section of pipe.
b finds at least 2 leaks.
X ~ Po 3
X is Poisson with λ = 3 a P X 2 = 0.224
Use the GDC in Poisson

P.D. mode as you want
the probability of X

taking a single value

Use the GDC in C.D. mode bP  
as you want the probability
Tip of X taking a range of values

Some calculators can = 0.801
find this probability
directly without
the preliminary
calculation.

 Mean and variance of the Poisson distribution

As for the binomial distribution, there are formulae for the expected value and variance
of a Poisson distribution. Since λ is the constant average rate at which events occur, it
should be no surprise that λ is the expected value.

KEY POINT 8.8

If X ~ Po(λ), then

 E(X) = λ

 Var (X) = λ

CONCEPTS – MODELLING

Notice that the mean is equal to the variance for the Poisson distribution. This is
something we look out for when determining whether data is likely to fit a Poisson
model, whether although that in itself is not sufficient to decide – there are other
distributions with this feature.

8C Poisson distribution 297

WORKED EXAMPLE 8.10

A radioactive substance emits beta particles at a average constant rate of 20 per hour. If X is
the number of beta particles emitted per hour, find

a the mean of X X ~ Po 20
b the standard deviation of X.

20X is Poisson with λ =

Use E (X) = λ a E X 20

Use Var (X) = λ b Var X 20
So, standard deviation = 4.47
Standard deviation is the
square root of the variance

The Poisson distribution is scalable. For example, if the number of birds seen on the

branch of a tree in 10 minutes follows a Poisson distribution with mean λ, then the

number of birds seen on the branch in 20 minutes follows a Poisson distribution with

mean 2λ, and the number of birds seen on the branch in 5 minutes follows a Poisson

distribution with mean λ .

2

WORKED EXAMPLE 8.11

Assuming that the number of buses arriving at a bus stop in a one-hour period follows a
Poisson distribution, with mean 15, find the probability that there are fewer than 8 buses in
a 20 minute period.

Let X be the number of buses arriving
in 20 mins.

Buses arrive at a rate of 15 per X ~ Po 5
hour so λ = 5 per 20 minutes

Find the required probability P X X 7
from the GDC. Remember that
the GDC will give probabilities = 0.867

of the form P(X  k)

 Sum of two independent Poisson distributions

The scalability of the Poisson distribution is a consequence of a more general result
about the Poisson distribution. If two independent variables both follow a Poisson
distribution, then so does their sum.

KEY POINT 8.9

If X ~ Po(λX ) and Y ~ Po(λY ) are two independent Poisson distributions and Z = X + Y, then
Z ~ Po(λX + λY)

298 8 Probability

WORKED EXAMPLE 8.12

Zhuo runs a website that provides video tutorials for IB maths and IB physics topics.
The website receives an average of 7.8 hits an hour for maths and 6.5 hits an hour for physics.

a Assuming that the hits for maths and physics are from an independent Poisson
distribution, find the probability that his website gets more than 15 hits an hour.

b Explain why the assumption that the hits for maths and physics form independent
Poisson distributions is unlikely to be true.

a Let Z be the number of hits per hour.

Use Z ~ Po(λX + λY) Z ~ Po(14.3)
= Po(7.8 + 6.5
P(Z > 15) = 1 − P(Z  15)
Find the required
probability from the GDC = 1− 0.639
= 0.361

b The rate of hits to the website is unlikely to be constant
– there will probably be more at some times of the day
than others.
The two distributions are probably not independent of
each other, as times when more maths hits occur are
likely to be similar to times when more physics hits
occur (possibly even by the same users).

Be the Examiner 8.3

The number of errors in a Maths text book is believed to follow a Poisson distribution with a
mean of 1.5 errors per 10 pages.

Find the probability that there are more than 5 errors in 50 pages.

Which is the correct solution? Identify the errors made in the incorrect solutions.

Solution 1 Solution 2 Solution 3

X ~ Po(1.5) X ~ Po(7.5) X ~ Po(7.5)

5 errors in 50 pages is P(X > 5) = 1 − P(X  5) P(X > 5) = 1 − P(X  4)
equivalent to 1 error in
each 10 pages. = 0.759 = 0.868

P(X > 1) = 1 − P(X  1)

= 0.442

8C Poisson distribution 299

Exercise 8C

For questions 1 to 4, use the method demonstrated in Worked Example 8.8 to decide whether the random variable X
could be modelled by a Poisson distribution. If so, state how well the conditions are likely to be met.

1 a A radioactive element emits beta particles at a mean rate of 3 per minute and X is the number of minutes in the
next hour in which at least 2 particles emitted.

b Fish occur at a mean rate of 4 per m3 in a certain 1000 m3 volume of the sea and X is the number of 1 m3
volumes of water, out of 5 surveyed, that contain exactly 4 fish.

2 a A radioactive element emits beta particles at a mean rate of 3 per minute and X is the number emitted each
minute.

b Fish occur at a mean rate of 4 per m3 in a certain 1000 m3 volume of the sea and X is the number of fish per m3.

3 a A radioactive element emits beta particles at a mean rate of 3 per minute and X is length of time until the next
emission.

b Fish occur at a mean rate of 4 per m3 in a certain 1000 m3 volume of the sea and X is the volume of water than
needs to be searched before the next fish is found.

4 a A radioactive element emits beta particles at a mean rate of 3 per minute and X is the number of minutes that
pass until the first emission is observed.

b Fish occur at a mean rate of 4 per m3 in a certain 1000 m3 volume of the sea and X is the number of 1 m3
volumes of water that need to be searched until the first fish is found.

For questions 5 to 12, use the method demonstrated in Worked Example 8.9 to find the required probabilities given that
X ~ Po(8.4) .
5 a P (X = 7) 6 a P(X  4) 7 a P (X < 11) 8 a P(X  6)
b P (X = 10) b P(X  9) b P (X < 8) b P(X  3)

9 a P (X > 12) 10 a P(7  X  10) 11 a P (7 < X < 9) 12 a P(10  X < 13)
b P (X > 7) b P(5  X  8) b P (6 < X < 12) b P(10 < X  15)

For questions 13 and 14, use the method demonstrated in Worked Example 8.10 to find the required values.

13 a Find E(X) if X ~ Po(3.1)
b Find Var(X) if X ~ Po(3.1)

14 a Find the standard deviation of X if X ~ Po(5.3)
b Find the mean of X if X ~ Po(5.3)

For questions 15 to 17, use the method demonstrated in Worked Example 8.11 to find the distribution of Y given that X is

the number of cars passing a checkpoint in a 20 second period and X ~ Po(12) .

15 a Y is the number of cars passing in 1 minute

b Y is the number of cars passing in 40 seconds

16 a Y is the number of cars passing in 4 seconds

b Y is the number of cars passing in 5 seconds

17 a Y is the number of cars passing in 1.5 minutes

b Y is the number of cars passing in 16 seconds

For questions 18 and 19, use the method demonstrated in Worked Example 8.12 to find the distribution of Z, where

Z X Y= + , and X and Y are independent.
18 a X ~ Po(2) and Y ~ Po(4.5) b X ~ Po(5.3) and Y ~ Po(3.6)

19 a X follows a Poisson distribution with mean 6.2 and Y follows a Poisson distribution with mean 7.8

b X follows a Poisson distribution with mean 4 and Y follows a Poisson distribution with mean 3

20 From a particular observatory, shooting stars are observed in the night sky at an average rate of one every
five minutes.

Assuming that this rate is constant and that shooting stars occur (and are observed) independently of each other,
find the probability that more than 20 are seen over a period of one hour.

300 8 Probability

21 When examining blood from a healthy individual under a microscope, a haematologist knows that she should see
on average four white blood cells in each high power field.
Find the probability that blood from a healthy individual will show
a seven white blood cells in a single high power field
b a total of 28 white blood cells in six high power fields, selected independently.

22 A wire manufacturer is looking for flaws. Experience suggests that there are on average 1.8 flaws per metre in
the wire.
a Determine the probability that there is exactly one flaw in one metre of the wire.
b Determine the probability that there is at least one flaw in 2 metres of the wire.

23 Mario receives an average of 5.4 emails and 2.1 texts each hour. These are the only types of messages he receives.
a Assuming that the emails and texts each form independent Poisson distributions, find the probability that he
receives more than 4 messages in an hour.
b Explain why the assumption that the emails and texts form independent Poisson distributions is unlikely to
be true.

24 The number of road traffic accidents at a particular intersection follows a Poisson distribution. If the probability
of there being at least one accident in a given week is 0.6, find
a the mean of the distribution
b the probability of there being more than two accidents in a given week.

25 The number of outbreaks of mumps in a certain district is modelled by a Poisson distribution.
a If the probability of there being more than two outbreaks in a given month is 0.3, find the probability of there
being fewer than two outbreaks in a given month.
b Explain whether the Poisson distribution is likely to be a good model for this situation.

26 The number of telephone calls per minute to a call centre follows a Poisson distribution with a mean of 6. Let X be
the number of calls received in one minute and let Y be the number of calls received in 10 minutes.
a Calculate:

i P (X = 6)
ii P (Y = 60)

b Find the probability that the call centre receives exactly 6 calls in at least 5 minutes of a 10 minutes period.
27 The number of eagles observed in a forest in one day follows a Poisson distribution with mean 1.4.

a Find the probability that more than three eagles will be observed on a given day.
b Given that at least one eagle is observed on a particular day, find the probability that exactly two eagles are

seen that day.
28 The number of mistakes a teacher makes while marking homework has a Poisson distribution with a mean of

1.6 errors per piece of homework.
a Find the probability that there are at least two marking errors in a randomly chosen piece of homework.
b Find the most likely number of marking errors occurring in a piece of homework. Justify your answer.
c Find the probability that, in a class of 12 pupils, fewer than half of them have errors in their marking.
29 A car company has two limousines that it hires out by the day. The number of requests per day has a Poisson
distribution with a mean of 1.3 requests per day.
a Find the probability that neither limousine is hired.
b Find the probability that some requests have to be denied.
c If each limousine is to be equally used, on how many days in a period of 365 days would you expect a

particular limousine to be in use?
30 The daily sales of cupcakes from a bakery can be modelled by a Poisson distribution with mean 21.5. A fresh

batch is baked every three days, and any cupcakes older than three days cannot be sold.
a Find the probability of selling more than 15 cupcakes per day for three consecutive days.
b Find the number of cupcakes the bakery would have to produce to be at least 95% certain of not selling out.



302 8 Probability

WORKED EXAMPLE 8.14

Represent the information in the transition matrix in a transition diagram.

T = S ⎛ 0S.8 0R.4 ⎞
R ⎜⎝ 0.2 0.6 ⎟⎠

The graph has a loop from S 0.2 R 0.6
0.8
SS with probability 0.8,
and a loop from RR
with probability 0.6

The directed edge from 0.4
SR has probability 0.2,
and the directed edge from
RS has probability 0.4

 Powers of transition matrices

From a transition matrix, T, you can read off the probability of being in any particular

state at t = 1 given that you know the state initially (t = 0). To find these probabilities at
t = n, find the matrix T n.

Tip WORKED EXAMPLE 8.15

A transition matrix, For the system in Worked Example 8.14, find the probability that it is sunny three days later
T, is said to be regular given that it is rainy today.
if for some integer, n,
all entries of Tn are Use the GDC to find T3 T3 = ⎛ 0.8 0.4 ⎞3
positive (none are 0). 0.2 0.6 ⎟⎠
The corresponding You want the probability in
Markov chain is said the R column and the S row = ⎛ 0.688 0.624 ⎞
to be a regular Markov (going from Rainy to Sunny) 0.312 0.376
chain. In this course,
all Markov chains will The probability of it being sunny three
be regular. days later given that it is rainy today is

0.624 .

CONCEPTS – REPRESENTATION

Try finding the probability from Worked Example 8.15 by representing the system as a
tree diagram instead of a Markov chain. Although you could already do this question by
using tree diagrams, the Markov chain method is far more efficient.

8D Markov chains 303

 Initial state probability vectors

Another way of finding the required probability in Worked Example 8.15 would have
been to multiply T3 by a vector giving the probability of being in the initial state

(a rainy day). Since it is known to be a rainy day initially, this would be the vector S ⎜⎛⎝10⎠⎞⎟:
R
0.4⎞3
⎛0.8 0.6⎟⎠ ⎜⎛⎝10⎠⎟⎞ = ⎛0.624⎞ S
⎝⎜ 0.2 ⎝⎜ 0.367⎟⎠ R

The resulting vector gives the probabilities for being sunny or rainy after three days,
given that it was rainy initially.

KEY POINT 8.11

For a transition matrix T and initial state vector s0, the state after n time periods, sn, is given

by Tns0 = sn

WORKED EXAMPLE 8.16

Two taxi firms, Alpha Cabs and Beta Cars are well established in the market and have market

shares of 50% and 40% , respectively. A new company Gamma Carriages has recently
entered the market and has market share of 10% .

Given the monthly transition matrix,

⎛ A BG ⎞
A ⎜ 0.85 0.05 0.1 ⎟
B ⎜ 0.05 0.8 0.1 ⎟
G ⎝⎜ 0.1 0.15 0.8 ⎠⎟

find the projected market share of each firm after five months.

Use sn = Tns0 with ⎛ 0.85 0.05 0.1 ⎞5 ⎛ 0.5 ⎞

⎛ 0.5 ⎞ A s5 = 0.05 0.8 0.1 ⎟ 0.4
⎜ 0.4 ⎟ B 0.1 0.15 0.8 ⎠⎟⎟ 0.1
s0 = ⎜⎜⎝ 0.1 ⎠⎟⎟ G

Evaluate using the GDC ⎛ 0.375
= 0.287

⎝ 0.338

So, after 5 months, the market shares
for Alpha, Beta and Gamma are

37.5%, 28.7% and 33.8%, respectively.





306 8 Probability

3 a P (A → A) = 0.71, P (A → B) = 0.29, P (A → C) = 0
P (B → B) = 0.88, P (B → C) = 0.05, P (B → D) = 0.07
P (C → A) = 0.24, P (C → B) = 0, P (C → C) = 0.57
P (D → A) = 0.45, P (D → C) = 0.07 , P (D → D) = 0.48

b P (A → B) = 0.1, P (A → C) = 0.25, P (A → D) = 0.3
P (B → A) = 0.15, P (B → C) = 0.2, P (B → D) = 0.25
P (C → A) = 0.1, P (C → B) = 0.05, P (C → D) = 0.35
P (D → A) = 0.2 , P (D → B) = 0.2 , P (D → C) = 0.05

For questions 4 to 6, use the method demonstrated in Worked Example 8.14 to write down the transition matrix for each

given transition diagram. 0.5
AB
4a A 0.86 B b

0.14 0.77 0.5

0.23
1

5 a 0.3 0.1 1 b 0.22 0.33 0.29
A BA B
0.5

0.6 0.4 54.0 12.0
0.4
0.12 0.8

C C
0.08
0.2
6a A b 0.84

0.05 0.35 0.3 0.35 A
D 0.3 0.55 0.31
0.16 0.34 0.5
0.25 0.2 B D
0.6 0.1 0.25 B
0.5
0.7
0.13
C 0.44 0.22

C

0.56

8D Markov chains 307

For questions 7 to 9, use the method demonstrated in Worked Example 8.15 to find the required probability from the
given transition matrix.

AB
⎛ 0.2 0.4 ⎞ A
7 a T = ⎝⎜ 0.8 0.6 ⎠⎟B

Probability of being in state B four days from now, given that in state A now.

AB
⎛ 0.65 0.5 ⎞ A
b T = ⎝⎜ 0.35 0.5 ⎟⎠ B

Probability of being in state A three days from now, given that in state A now.

ABC
⎛ 0.25 0.45 0.35 ⎞ A
⎜ 0.5 0.15 0.35 ⎟⎟⎟⎠ CB
8 a T = ⎜⎜⎝ 0.25 0.4 0.3

Probability of being in state C five days from now, given that in state C now.

A BC
⎛ 0.9 0.1 0 ⎞ A
⎜ 0.1 0.7 0.2 ⎟ B
b T = ⎝⎜⎜ 0 0.2 0.8 ⎟⎟⎠ C

Probability of being in state C six days from now, given that in state B now.

A BCD
⎛ 0.54 0.25 0.1 0.3 ⎞ A
⎜ 0.16 0 0.73 0.08 ⎟ B
9 a T=⎜ 0.2 0.61 0.17 0 ⎟ C

⎜ ⎜⎝ 0.1 0.14 0 0.62 ⎟⎠ D

Probability of being in state D two days from now, given that in state B now.

AB C D

⎛ 1 0 0.5 0.2 ⎞ A
⎜ 0 0.4 0.1 0 ⎟ B
b T=⎜ 0 0.3 0.4 0.6 ⎟ C

⎜ ⎜⎝ 0 0.3 0 0.2 ⎟⎠ D

Probability of being in state B four days from now, given that in state D now.

For questions 10 to 12, use the method demonstrated in Worked Example 8.16 to find the probability vector sn for each
given transition matrix T and initial state probability vector s0.

⎛ 0.82 0.26 ⎞ ⎛ 0.7 ⎞ ⎛ 0.43 0.61 ⎞ ⎛ 0.1 ⎞
10 a T = ⎜ 0.18 0.74 ⎟, s0 = ⎜ 0.3 ⎟ b T = ⎜ 0.57 0.39 ⎟, s0 = ⎜ 0.9 ⎟
⎝ ⎠ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠

Find s3. Find s2.

⎛ 0.8 0.05 0.06 ⎞ ⎛ 0.4 ⎞ ⎛ 0.9 0 0.1 ⎞ ⎛ 0.2 ⎞
T ⎜ 0.12 0.92 0.1 ⎟⎟⎠⎟, s0 ⎜ 0 ⎟ ⎜ 0 0.8 0.2 ⎟⎟, ⎜ 0.3 ⎟
11 a = ⎝⎜⎜ 0.08 0.03 0.84 = ⎜⎝⎜ 0.6 ⎠⎟⎟ b T = ⎜ 0.2 0.7 ⎠⎟ s0 = ⎜ 0.5 ⎟
⎝⎜ 0.1 ⎝⎜ ⎠⎟

Find s4. Find s3.

308 8 Probability

12 a T = ⎛ 0 0.7 0.1 0.2 ⎞ s0 = ⎛ 0.1 ⎞
⎜ 0.2 0.2 0.6 0 ⎟ ⎜ 0.2 ⎟
⎜ 0.3 0.1 0 0.7 ⎟⎟, ⎜ 0.3 ⎟
⎜ 0.5 0 0.3 0.1 ⎟⎠ ⎜ 0.4 ⎟
⎜⎝ ⎝⎜ ⎠⎟

Find s5

⎛ 0.05 0.2 0.2 0.05 ⎞ ⎛ 0.7 ⎞
⎜ 0.1 0.6 0.2 0.7 ⎟⎟, ⎜ 0 ⎟
b T = ⎜ 0.45 0.15 0.4 0.05 s0 = ⎜ 0.1 ⎟
0.05 0.2 ⎟ ⎜ ⎟
⎜⎝ 0.4⎜ 0.2 ⎠⎟ ⎜⎝ 0.2 ⎠⎟

Find s4

For questions 13 to 15, use the method demonstrated in Worked Example 8.17 to find the steady-state probability vector

for each of the transition matrices from questions 10 to 12. As noted in the tip below Worked Example 8.17, you do not

need an initial state vector to calculate this for these matrices.

13 a T = ⎛ 0.82 0.26 ⎞ 14 a T = ⎛ 0.8 0.05 0.06 ⎞ 15 a T = ⎛ 0 0.7 0.1 0.2 ⎞
⎜ 0.18 0.74 ⎟ ⎜ 0.12 0.92 0.1 ⎟ ⎜ 0.2 0.2 0.6 0 ⎟
⎝ ⎠ ⎜ ⎟ ⎜ 0.3 0.1 0 0.7 ⎟
⎜ 0.5 0 0.3 0.1 ⎟
⎛ 0.43 0.61 ⎞ ⎝⎜ 0.08 0.03 0.84 ⎟⎠ ⎝⎜ ⎠⎟
⎜ 0.57 0.39 ⎟
b T = ⎝ ⎠ ⎛ 0.9 0 0.1 ⎞ ⎛ 0.05 0.2 0.2 0.05 ⎞
⎜ 0 0.8 0.2 ⎟ ⎜ 0.1 0.6 0.2 0.7 ⎟
b T = ⎜ ⎟ ⎜ 0.45 0.15 0.4 ⎟
⎝⎜ 0.1 0.2 0.7 ⎟⎠ ⎜ 0.4 0.05 0.2
b T = ⎜⎝ 0.05 ⎟
0.2 ⎟⎠

For questions 16 to 18, use the method demonstrated in Worked Example 8.18 to find the exact steady-state probability

vector for each of the transition matrices from questions 13 to 15 by setting up and solving a system of equations.

16 a T = ⎛ 0.82 0.26 ⎞ 17 a T = ⎛ 0.8 0.05 0.06 ⎞ 18 a T = ⎛ 0 0.7 0.1 0.2 ⎞
⎜ 0.18 0.74 ⎟ ⎜ 0.12 0.92 0.1 ⎟ ⎜ 0.2 0.2 0.6 0 ⎟
⎝ ⎠ ⎜ ⎟ ⎜ 0.3 0.1 0 0.7 ⎟
⎜ 0.5 0 0.3 0.1 ⎟
⎜⎝ 0.08 0.03 0.84 ⎟⎠ ⎜⎝ ⎟⎠

b T = ⎛ 0.43 0.61 ⎞ ⎛ 0.9 0 0.1 ⎞ ⎛ 0.05 0.2 0.2 0.05 ⎞
⎜ 0.57 0.39 ⎟ ⎜ 0 0.8 0.2 ⎟ ⎜ 0.1 0.6 0.2 0.7 ⎟
⎝ ⎠ b T = ⎜ ⎟ ⎜ 0.45 0.15 0.4 ⎟
⎜ 0.4 0.05 0.2
⎜⎝ 0.1 0.2 0.7 ⎟⎠ b T = ⎝⎜ 0.05 ⎟
0.2 ⎟⎠

19 The transition matrix T gives the probabilities of people’s next purchase of car being an automatic A or a manual (M).

AM
0.95 0.35 ⎞ A
T = ⎛ 0.05 0.65 ⎟⎠ M
⎝⎜

a Write down the probability of someone who currently owns a manual buying an automatic as their next car.

b Find the long-term probability matrix.

c Hence state the long-term percentage of manual cars.

20 If Cristiano scored his previous penalty, the probability of him scoring his next penalty is 0.95. If he missed his
previous penalty, his probability of scoring next time is 0.8.

a Write down the transition matrix for this situation.

b Given that he missed his last penalty, find the probability that he scores with his third penalty after the one he
missed.

c Find his steady-state probability vector for scoring and missing.

8D Markov chains 309

21 There are three convenience stores in a small town. The transition diagram gives the probabilities of people
swapping between stores for their weekly shop.

0.8 0.1 0.7
A B

0.2

1.0 1.0

0.1 0.3
C

0.6

a Write down the transition matrix for this information.

In a particular week, 400 people use Store A, 240 people use Store B and 360 people use Store C.

b Find the number of customers using each store four weeks later.

c Find the steady-state number of customers using each store.

22 The population of otters in a particular area of wetland was surveyed over the course of many years. Each year,
the area is designated as either empty (E), lightly populated (L) or heavily populated (H).

The probabilities of changing between these states each year are given in the transition matrix T:

⎛ E LH ⎞
⎜ 0.1 0.1 0.6 ⎟ E
T = ⎜ 0.4 0.6 0.4 ⎟ L
⎝⎜ 0.5 0.3 0 ⎟⎠ H

a Draw a transition diagram for this matrix.

b If the area is initially empty, find the probability of it being heavily populated three years later.

c Find the long-term probability of the area being lightly populated.

23 Financial markets are categorized as bullish (values generally rising), bearish (values generally falling) or stagnant
(values generally neither rising nor falling).

If a market has had a bullish week, the probability of the next week also being bullish is 0.85 and the probability
of the next week being bearish is 0.05.

If a market is bearish, the probability of the next week being bullish is 0.1 and the probability of the next week
being bearish is 0.75.

If a market is stagnant, then the probability of the next week being bullish is 0.25 and the probability of the next
week being bearish is 0.25.

a Write down the transition matrix for this system.

b If the current week is bearish, find the probability that three weeks from now will be a bullish week.

c Find the long-term probabilities for this system.

24 A gambler takes $20 into a casino to play roulette. He places $10 on red each time. If the ball lands on red, he wins
$10; if the ball lands on black, he loses $10. He continues to play until he either loses all his money or reaches $30.

0.5a Assuming that the probabilities of getting red and black are both , set up a transition matrix between the
states $0, $10, $20 and $30.

b Find the probability that he ends up leaving with $30.

310 8 Probability

25 At the end of the summer season, the entire colony of a particular species of bird migrates to one of two locations,
A or B.

0.7If a bird migrates to A this year, the probability that it migrates to A next year is .
0.4If a bird migrates to B this year, the probability that it migrates to A next year is .

a Write down the transition matrix for this system.

Initially, 45% of birds migrate to location A and 55% to location B.

b Find the proportion of birds migrating to A two years later.
c Set up a system of equations for the long-term proportions migrating to A and B.

d Solve these equations to find the exact steady state proportions.
26 Pairs of genes can be dominant (D), hybrid (H) or recessive (R).

If one of the parents is known to be hybrid, then the probabilities of any offspring being dominant, hybrid or
recessive depends on the second parent as given by the matrix

Parent

⎛ DHR ⎞
⎜ 0.5 0.25 0 ⎟ D
T = ⎜ 0.5 0.5 0.5 ⎟ H Offspring
⎝⎜ 0 0.25 0.5 ⎠⎟ R

A rabbit of unknown characteristic is mated with a hybrid rabbit, and then the offspring are mated with a hybrid,
and so on through the generations.

a Find the probability that that the second generation offspring are recessive given that the parent was dominant.

b Set up a system of equations for the long-term probabilities of offspring being dominant, hybrid or recessive.

c Solve these equations to find the long-term probabilities.

27 In a model of population movement, people either live in an urban (U) or a rural (R) area. The proportion of those
moving from urban to rural areas each year is 0.1 and the proportion of those moving from rural to urban areas
is 0.15.

a Write down the transition matrix, T, for population movement each year.

b Find the eigenvalues and corresponding eigenvectors of T.

T PDPc Hence write down matrices P and D such that = −1.
Initially, 65% of people live in urban areas and 35% live in rural areas.

d Find an expression for the proportion of people in urban areas after n years.

e Hence find the proportion of people living in urban areas in the long term.

28 In any given year, a field of crops is either diseased or healthy. If the crop is healthy in a particular year, then the
probability of it being diseased the next year is 0.2. If the crop is diseased in a particular year, then the probability
of it being healthy the next year is 0.6.

a Write down the transition matrix, T, for the state of the field.

b Find the eigenvalues and corresponding eigenvectors of T.

T PDPc Hence write down matrices P and D such that = −1.

In year zero the field is healthy.

d Find an expression for the probability of it being diseased in year n.

e Hence find the long term probability of the field being diseased.

Checklist 311

Checklist

 You should know how linear transformations of a single random variable affect the expected value and variance:

E(aX + b) = aE(X) + b
Var(aX + b) = a2Var(X)

 You should be able to find the expected value and variance of a linear combination of two or more random
variables:

E(aX + bY + c) = aE(X) + bE(Y)
Var(aX + bY+ c) = a2Var(X) + b2Var(Y) + c

The second result is only true if X and Y are independent.

 You should be able to find the expected value and variance of the mean of a random variable.

If X1, X2 , …, Xn are independent observations of the random variable X, then:

E(X) = E(X)

Var (X) = Var (X)
n
 You should know about the distribution of linear combinations of normally distributed random variables.

If the random variables X and Y are independent with X ~ N (μX , σ 2 ) and Y ~ N (μY , σ Y2) and if W = aX + bY+ c ,
then:
X

W ~ N (aμX + bμY + c, a2σ 2X + b2σY2)

 You should know about the distribution of the mean of a normally distributed random variable.

If X1, X2 , …, Xn are independent observations of the random variable X ~ N (μ, σ 2) , then:

X ~ N ,⎜⎝⎛μ σ2⎞

n ⎟⎠

 You should know about the distribution of the sum or mean of many observations of a random variable from any
distribution (the central limit theorem).

aIfppXr1o,xXim2 ,a…tely, :Xn with n > 30 are independent observations from any distribution you will meet, then

n

∑Xn ~ N (nμ, nσ 2)
i=1

X ~ N ,⎛⎝⎜ μ σ2⎞

n ⎠⎟

 You should know about the circumstances under which the Poisson distribution is an appropriate model.

The Poisson distribution occurs when the following conditions are satisfied:

events are independent of each other
events occur at a constant average rate
events occur singly (one at a time).

 You should know how to find the mean and variance of the Poisson distribution.

If X ~ Po(λ), then:

E(X) = λ

Var (X) = λ

312 8 Probability

 You should know about the distribution of the sum of two independent Poisson distributions.

If X ~ Po(λX ) and ~ Po(λY) are two independent Poisson distributions and Z = X + Y, then:
Z ~ Po(λX + λY)

 You should know about the role of transition matrices in Markov chains.
A transition matrix is a matrix in which:
the probability of going from state A to state B is given by the entry in column A and row B
the entries in each column must sum to 1.

 You should know about initial state probability vectors.

T s sFor a transition matrix T and initial state vector s0, the state after n time periods, sn, is given by n 0 = n.

 You should know how to calculate steady-state and long-term probabilities of Markov chains.

Ts sFor a transition matrix T, the steady-state vector s satisfies = .

 Mixed Practice

1 The probability distribution of a random variable X is given in the following table.

x 12345
0.1 0.2 0.3 0.2 0.2
P(X = x)

a Find E (X) .

The random variable Y is given by Y = 2 − 3X .

b Find E (Y) .

c Given that Var (X) = 1.56, find Var (Y) .

2 Random variable V has the probability distribution given in the table.

v 2357

P(V = v) 0.4 p 2p 3p

a Find the value of p.
b Find the mean of V.

The random variable W is given by W = 9 − 2V.

c Find the mean of W.
d Given that the standard deviation of V is 2.14, find the standard deviation of W.

3 The random variable X has the probability distribution given by
3x − 1
P (X = x) = 26 for x = 1, 2, 3, 4

a Show this probability distribution in a table.

b Find the exact value of E (X) .

c Given that Var (X) = 0.917 to three significant figures, find Var (20 − 5X) correct to three
significant figures.

Mixed Practice 313

4 A case of wine contains 5 bottles. The mean mass of a bottle of wine is 1.2 kg with a variance
of 0.1 kg2.

An empty case has a mass of 0.4 kg with a variance of 0.02 kg2.

a Find
i the mean mass of a full case of wine
ii the variance of a full case.

b State an assumption you needed to make in part a ii.

5 The heights of trees in a forest have a mean of 20 m and a variance of 55 m2. A sample of 35 trees

is measured.

a Find the mean and variance of the average height of the trees in the sample.
b Use the central limit theorem to find the probability that the average height of the trees in the

sample is less than 18 m.

6 The number of cars arriving at a car park in a five-minute interval follows a Poisson distribution with
mean 8, and the number of motorbikes follows a Poisson distribution with mean 1.4.

Find the probability that exactly 10 vehicles arrive at the car park in a particular five-minute interval.

7 The number of tweets per day from a school Twitter account follows a distribution with mean 9 and
standard deviation 3.
a Find the mean and standard deviation of the total number of tweets put out in a five-day week.
b State any assumptions made in part a.

8 X is the random variable ‘number of pizzas ordered per hour in a restaurant’. It is thought that

X ~ Po(7.3).

a Write down two conditions required for the Poisson distribution to model data.

b Find P(4 < X  10).

9 Based on long experience, a gardener knows that birds tend to arrive at his garden at an average rate
of 12 per hour.
a State two assumptions required to model the birds’ arrival using a Poisson distribution. Are these
reasonable assumptions?
b If these assumptions do hold, find the probability of the gardener observing more than 15 birds in
an hour.

10 Consumers have three options for their broadband provider: Pacey Play, Rapid Rate or Super Speedy.

The transition matrix, T, for changing between these companies each year is given by

P RS
⎛ 0.71 0.08 0.04 ⎞ P
⎜ 0.22 0.82 0.05 R⎟
T = ⎜

⎜ 0.07 0.10 0.91 ⎟ S
⎝⎠

a State the probability of a customer who currently uses Super Speedy changing to Pacey Play
next year.

b Given that the current market share is Pacey Play 42%, Rapid Rate 38%, Super Speedy 20%, find
the market share of each company in three years time.

c Find the steady-state market share of each company.

314 8 Probability

N ,11 The n independent random variables X1, X2, … , Xn all have the distribution (μ σ 2).

a Find the mean and the variance of
i X1 + X2
ii 3X1
iii X1 + X2 – X3

iv X = (X1 + X2n+ ...+ Xn)
Mathematics HL November 2012 Paper 3 Statistics and probability Q2 part a

12 The mass of men in an office block is known to be normally distributed with mean 78 kg and
standard deviation 8 kg. An elevator in the office block has a maximum recommended load of 600 kg.

With 7 men in the elevator, calculate the probability that their combined mass exceeds the maximum
recommended load.

13 The volume of soda, X ml, in a can follows a normal distribution with mean 331 ml and standard
deviation 2 ml. These cans are sold in packs of 6.

The total amount of soda in a pack is Y ml.

a Find P (Y < 1980).
b Find E(Y− 6X) and Var(Y− 6X) .

c Find the probability that the volume of soda in a randomly chosen pack is more than 5 ml greater
than 6 times the volume in a randomly chosen can.

14 The error a machine makes in cutting 10 m lengths of rope has a mean of 0 cm and a standard
deviation of 0.5 cm.

A sample of 35 pieces of rope is taken.

Find the probability that the mean error of the sample is less than 0.1 cm.

15 A receptionist at a hotel answers on average 45 phone calls a day.
a State a possible model for the number of calls received per day and any assumptions you
are making.
b Use your model to find the probability that, on a particular day, she will answer more than
50 phone calls.
c Find the probability that she will answer more than 45 phone calls every day during a
five-day week.

16 In a particular town, rainstorms occur at an average rate of two per week and can be modelled using
a Poisson distribution.
a Find the probability of at least eight rainstorms occurring during a particular four-week period.
b Given that the probability of at least one rainstorm occurring in a period of n complete weeks is
greater than 0.99, find the least possible value of n.

Mixed Practice 315

17 A geyser erupts randomly. The eruptions at any given time are independent of one another and can be
modelled using a Poisson distribution with mean 20 per day.
a Determine the probability that there will be exactly one eruption between 9 am and 10 am.
b Determine the probability that there are more than 22 eruptions during one day.
c Determine the probability that there are no eruptions in the 30 minutes Dale spends watching
the geyser.
d Find the probability that the first eruption of a day occurs between 3 am and 4 am.
e Determine the probability that there will be at least one eruption in at least six out of the eight
hours the geyser is open for public viewing.
f Given that there is at least one eruption in an hour, find the probability that there is exactly
one eruption.

18 Patients arrive at random at an emergency room in a hospital at the rate of 14 per hour throughout the day.
a Find the probability that exactly four patients will arrive at the emergency room between 12:00
and 12:15.
b Given that fewer than 15 patients arrive in one hour, find the probability that more than 12 arrive.
c Dr Chris works a 10 hour shift. Find the probability that, in at least 5 of those 10 hours, more than
15 patients arrive at the emergency room.

19 Compared to its value at the start of trading, the value of a share at the end of the day can have risen
(R), fallen (F) or stayed the same (S). Over a prolonged period, a certain share was observed to
perform in the following way. If its price:
 rises on a given day, the probability of it rising again the next day is 0.6, while the probability of it
falling is 0.1
 falls on a given day, the probability of it rising the next day is 0.4, while the probability of it
falling again is 0.4
 stays the same on a given day, the probability of it rising the next day is 0.5, while the probability
of it falling is 0.2.
a Write down the transition matrix for the share price movement.
b If the price fell today, find the probability that it will rise in 3 days time.
c i Set up a system of equations satisfied by the steady-state probabilities of the share price
movement.
ii Hence find these exact steady-state probabilities.

20 Engine oil is sold in cans of two capacities, large and small. The amount, in millilitres, in each can, is
normally distributed according to Large ~ N (5000, 40) and Small ~ N (1000, 25).
a A large can is selected at random. Find the probability that the can contains at
least 4995 millilitres of oil.
b A large can and a small can are selected at random. Find the probability that the large can
contains at least 30 millilitres more than five times the amount contained in the small can.
c A large can and five small cans are selected at random. Find the probability that the large can
contains at least 30 millilitres less than the total amount contained in the small cans.

Mathematics HL May 2015 Paper 3 Statistics and probability Q1

316 8 Probability

21 The number of cats visiting Helena’s garden each week follows a Poisson distribution with mean

λ = 0.6.

Find the probability that
i in a particular week no cats will visit Helena’s garden
ii in a particular week at least three cats will visit Helena’s garden
iii over a four-week period no more than five cats in total will visit Helena’s garden
iv over a twelve-week period there will be exactly four weeks in which at least one cat will visit

Helena’s garden.

Mathematics HL November 2013 Paper 2 Q11 part a

22 The random variable X has probability distribution Po(8).

 a i Find P(X = 6).
ii Find P(X = 6|5 X 8).
b X denotes the sample mean of n > 1 independent observations from X.

i Write down E( X ) and Var( X ).

ii Hence, give a reason why X is not a Poisson distribution.

c A random sample of 40 observations is taken from the distribution for X.

  i Find P(7.1 < X < 8.5).
ii Given that P(| X k) = 0.95, find the value of k.

Mathematics HL May 2014 Paper 3 Statistics and probability Q1

23 A marine biologist is investigating the lengths of a particular type of fish. It is known that the lengths
have standard deviation 4.6 cm. She wishes to take a sample to estimate the mean length. She requires
that the standard deviation of the sample mean is smaller than 0.5.

What sample size should she take? Justify any assumptions that you make.

24 Jars of jam have mean mass of 498 g and standard deviation σ g. The probability that 50 jars of jam

weigh more than 25 kg is 4.23%.

Find the value of σ .

25 The marks students scored in a maths test follow a normal distribution with mean 63 and variance
64. The marks of the same group of students in an English test follow a normal distribution with
mean 61 and variance 71.
a Find the probability that a randomly chosen student scored a higher mark in English than
in maths.
b Find the probability that the average English mark of a class of 12 students is higher than their
average maths mark.

26 The number of worms in each square metre of woodland is modelled by a Poisson distribution with
mean 1.2.
a Find the probability that in a 2 m2 area of woodland there are exactly two worms.
b Find the probability that in each of two 1 m2 areas of woodland there is exactly one worm.
A scientist searches many different 1m2 areas of the woodland. She only records the number of
worms in areas where she finds some.
c Find the mean of her observations.

Mixed Practice 317

27 A shop has a delivery of 50 pairs of sunglasses every week during the summer season. Weekly
demand for sunglasses during this period can be modelled by a Poisson distribution with mean 42.5.
a Assuming that there are no sunglasses in stock when a fresh delivery arrives, find the probability
that the store then sells out of sunglasses that week.
b Find the most likely number of sunglasses sold in a given week.
c Find the minimum number of sunglasses the store should order to be 99% sure of meeting
demand.

28 Each year, a particular plant either flowers or grows. If the plant flowers in a given year, the
probability that it flowers the next year is 0.1. If the plant grows in a particular year, the probability
that it flowers the next year is 0.6.
a Write down the transition matrix, T, for the plant’s activity.
b Find the eigenvalues and corresponding eigenvectors of T.

c Hence write down matrices P and D such that T = PDP−1.

In year zero a plant flowers.
d Find an expression for the probability of it flowering in year n.
e Hence find the long-term probability of the plant flowering.

9 Statistics

ESSENTIAL UNDERSTANDINGS

 Statistics is concerned with the collection, analysis and interpretation of quantitative data and
uses the theory of probability to estimate parameters, discover empirical laws, test hypotheses and
predict the occurrence of events.

 Statistical representations and measures allow us to represent data in many different forms to aid
interpretation.

 Both statistics and probability provide important representations which enable us to make
predictions, valid comparisons and informed decisions. These fields have power and limitations
and should be applied with care and critically questioned in detail to differentiate between the
theoretical and the empirical/observed.

In this chapter you will learn...

 how to conduct a statistical investigation

 about validity and reliability

 how to find unbiased estimates of the population mean and variance
combine data in a χ 2 table
 how and when to of freedom in a χ 2 goodness of fit test when
 how to choose an appropriate number of degrees

estimating parameters

 how to find non-linear regression models

 how to calculate the sum of square residuals for a regression model and use it to measure the fit

for a model

 how to find the coefficient of determination and use it to measure the fit for a model

 how to find and interpret confidence intervals for the population mean

 how to conduct a hypothesis test for the population mean for the normal distribution when the

population variance is known and when it is unknown

 how to conduct a hypothesis test for the difference between two population means when the

population variance is known and when it is unknown

 how to conduct a hypothesis test for the difference in population mean for paired samples

 how to conduct a hypothesis test for the population proportion using the binomial distribution

 how to conduct a hypothesis test for the population mean rate using the Poisson distribution

 how to conduct a hypothesis test for the Pearson’s population product-moment correlation coefficient

 how to find the critical region in hypothesis tests for normal tests, binomial tests and Poisson tests

 how to find the probability of Type II and Type II errors in hypothesis tests.

 Figure 9.1 What level of certainty can we expect in the results?

Statistics 319

CONCEPTS

The following concepts will be addressed in this chapter:
 Different statistical techniques require j ustication and identication of their limitations and validity.
 Correlation and regression are powerful tools for identifying patterns and equivalence of systems.
 Modelling and nding structure in seemingly random events facilitates prediction.
 Statistical literacy involves identifying reliability and validity of samples and whole populations in a closed system.
 A systematic approach to hypothesis testing allows statistical inferences to be tested for validity.

PRIOR KNOWLEDGE

Before starting this chapter, you should already be able to complete the following:

1 Find the mean and standard deviation for this data set.

2, 3, 4, 5, 5, 6, 7, 8, 9, 9

2 Conduct a χ 2 goodness of fit test on the Data value 0 12345
Observed frequency 7 28 103 188 132 42
data alongside where the critical value is 12.8
and H0: Data come from B(5, 0.7).

3 For the contingency table alongside, test at the 10% significance 18 13 8

level whether the two variables are independent. 13 16 21

8 18 22

4 For the data alongside, x 1234456679
y 1 3 5 5 6 7 8 11 9 10
a find Pearson’s product-moment correlation
coefficient and interpret this value

b find the equation of the regression line. Mean Sample 1 Sample 2
Variance 5.4 5.8
10%5 For the data in the table alongside, conduct a t -test at the Size 4.3 3.9
H : , H :significance level of the hypotheses 0 μ1 = μ2 1 μ1 < μ2 48 35

6 If X ~ N (12, 42) , find P(X < 9).

7 If X ~ B(10, 0.6) , find P(X  5).

8 If X ~ Po(5.2) , find P(X > 6).

This chapter extends a number of ideas met in the statistics chapters of the
Mathematics: applications and interpretation SL book. As well as the possibility of
fitting a linear regression model to data, we now look at several other regression models
that might be more suitable, from quadratic and cubic to exponential and sinusoidal.

We also revisit χ 2-tests, t-tests and tests for the correlation coefficient, as well as

introducing hypothesis tests using the normal, binomial and Poisson distributions.

Of course, hypothesis tests do not offer certainty – it is always possible that the
conclusion arrived at was incorrect. Being able to find the probability that the conclusion
was in error is an important feature of setting up and evaluating any hypothesis test.

Starter Activity

Look at the pictures in Figure 9.1. Discuss what the possible errors are in any conclusion reached? Which potential
error is the more serious to make in each case?

Now look at this problem:

A coin is flipped 20 times.
a What is the minimum number of heads would you need to see to conclude that the coin was biased in favour of heads?
b If you did see that number of heads, how confident would you be in your conclusion that the coin was biased in

favour of heads?

320 9 Statistics

9A Statistical design and
unbiased estimators

 The statistical investigation cycle

In the SL book, you met some very advanced ways to analyse data, such as regression
and statistical tests. However, these analytical tools are only one part of the statistical
process.

Use descriptive statistics
to explore the data

Start Research Define Design Collect
question population investigation data

Use inferential statistics
to answer the research

question

No Are results Yes Report findings
reliable and

valid?

Research questions

The starting point for any investigation is to be clear about what you want to know.
This might start off as a broad question such as ‘does healthy body equal healthy mind?’
You often then have to narrow this down to something which is clearer and easier
to work with – for example ‘Do people who are good at athletics do better in their
academic studies?’

Define populations

A key idea in all of statistics is that you are inferring something about a population
from a sample. Even if you had data for everybody in your school, that is just one school,
in a particular year. You must be clear what the population of interest is – perhaps all
18-year-old IB students in your school – so that the sampling method you need to use
becomes clear.

LEARNER PROFILE – Principled

Does ‘fair’ mean the same thing as ‘equal’? Would it be fair for everybody to get equal
results in an exam? How can mathematics be used to define ‘fairness’?

9A Statistical design and unbiased estimators 321

Design investigation

This comprises several parts.

1 Selecting relevant variables from many variables

Most interesting research questions relate to ideas which are hard to measure. You have
to choose proxy variables which are related to, but not quite the same as, the idea you
are interested in. In the example above, we might choose time in a 100 m race as a
proxy for athletic ability and score in their maths examination as a proxy for academic
ability. Both of these have the advantage of being relatively easy to measure, but they
are not identical to the original idea – people might be very good at the javelin, or very
bad at examinations.

TOK Links

Are things which can be measured numerically more useful than intangible quantities?
Can everything important be assigned a numerical value?

Equally, there might be a lot more data available than we want. We could create very
complex models by taking into account every exam a student has ever taken, but that
suffers from increasing technical difficulty and it is harder to communicate results
clearly.

2 Designing data collection method

A survey is any method for collecting data for analysis. This might include
questionnaires, interviews, direct observations or measurements and collecting
secondary data from other sources. In the example above, we might collect academic
data from the school’s examination database and observe individuals’ 100 m times at a
sports day. However, this might have practical issues – we might not have access to the
school’s database and not everybody might take part in the 100 m in the school sports
day. It might be easier to use a questionnaire to ask each pupil to self-report their exam
results and 100 m times, although this still has issues with respondents not answering
honestly and many people not responding at all.

A questionnaire is a list of questions, which sounds simple; however, it can be hard to
design good questions.

 They should be unbiased – not revealing the personal opinions of the person setting
the questionnaire, for example ‘What is your opinion about burgers for lunch?’
is better than ‘Do you agree that burgers are unhealthy, disgusting and cruel to
animals?’

 Questionnaires can be unstructured (‘Describe your pain’) or structured (‘Rank
your pain on a scale from 1 to 5, circling your answer’). Both have advantages – it
is generally easier to analyse responses to structured questions, but unstructured
questions can reveal more insight into a situation.

322 9 Statistics

In Chapter 6  Questionnaires may be personal, requiring people to be self-aware. This often
of the causes issues. For example, asking people to rank their own happiness is notoriously
Mathematics: difficult as one person’s 7 out 10 happiness might be comparable to another person’s
applications and 9 out of 10.
interpretation SL
book, you saw a  Questionnaires should also be precise, so that respondents are clear about what is
variety of different required. For example, the question ‘How many people are in your family?’ might
sampling methods. be interpreted by some people as including aunts, cousins and grandparents. Some
people might have step-parents or half-brothers and they may not be sure about
whether these should be included. Another common imprecise question might
be ‘Would you like sausages or chips to be available in the canteen?’ Some people
might answer ‘chips’ and some people might just say ‘yes’.

3 Choosing relevant and appropriate data

If you want to know about the salaries of IB graduates there is no point looking at the
salaries of all the people in a country. Your sample should be relevant to the research
question. It should also be appropriate for the analysis you are choosing to do. For
example, a huge model linking hundreds of variables will require a very large data set.
You should choose a sampling method to try to get a representative sample from the
population of interest.

4 Choosing an appropriate statistical process

For exploratory work, you might want to use descriptive statistics – such as calculating
averages or examining scatter graphs. This might be the final process, or it might
inform further research questions.

To answer more sophisticated questions we tend to use inferential statistics such as

t-tests and chi-squared tests. The types of questions you can ask are guided by the tests

you have available – for example, you cannot really test causality (if one thing causes
another), just correlation between variables.

Test for reliability and validity

The conclusions of a test are reliable if similar conclusions would be reached on each
occasion the test is conducted in similar circumstances. There are two procedures you
should be aware of to check on reliability.

 Test-retest is when you conduct the test and repeat it on new data from the same
group sometime later. If the test is reliable you would expect a strong correlation
between the results on the two occasions. There will be intervening factors and
natural variation so the correlation does not have to be perfect.

 Parallel forms is when the same concept is measured in slightly different but
comparable ways – for example, measuring 100 m and 200 m times of athletes. If the
test is reliable, then there should be a strong correlation between the results when
either measurement is used. However it can be quite difficult to find two different
but comparable things to measure.

You are the Researcher

You might want to see how Cronbach alpha measures parallel forms.

9A Statistical design and unbiased estimators 323

The process is valid if it is measuring what you really want to measure. There are
multiple threats to validity. The variables you are working with might not be entirely
representative of the concept. For example, you might want to measure how healthy
people are in a country and use life expectancy as a proxy, but this might really be
measuring how good the doctors are at keeping people alive, with people having long
but unhealthy lives. There might also be issues with the statistical processes being

used. For example, you might use a t-test on data which are not drawn from a normal

distribution. There are two procedures which can be used to check validity.

 Content validity checks are when experts assess whether the test is relevant
to the content required. For example, whether a test on trigonometry is really
about trigonometry or whether it also brings in tests of reading, using calculators,
understanding contexts and other such confounding variables. Within statistics, this
would also include checking whether the assumptions of statistical tests are satisfied.

 Criterion validity checks whether the test agrees with an external standard which
is considered an authentic measure of the quality being investigated. For example, if
a test for entering a school has criterion validity, then students who do well on the
test should do well in the school.

If the tests for reliability and validity fail, then you might have to go back to adapt your
research question, redesign your data collection (for example, use a larger sample) or use
a different test.

You are the Researcher

There are a suite of tests called non-parametric tests which have fewer assumptions

than the tests you have met so far. For example, instead of the t-test you could use

the Mann-Whitney test.

Statistical investigations are very popular choices for mathematical explorations.
Having a formal approach to testing reliability and validity would be a good way
to demonstrate the sophistication and rigour aspects of Criterion E – Use of
Mathematics.

WORKED EXAMPLE 9.1

A school surveys its staff by asking them to rank how tired they are, on a scale of 1 to 10.

Suggest

a whether this question is valid
b whether the results are reliable.

Validity will depend on a This is unlikely to be valid as the number is subjective,
how well the criteria are with the same number meaning different things to
explained and what exactly different people.

is being investigated.
Measuring work-related
fatigue could be confounded
by illness, self-awareness
and wanting to exaggerate

Reliability is about whether b This is unlikely to be reliable as on a different day the
the results will be repeatable answers may change – the last day of term may have a
different answer to the first day of term.

324 9 Statistics

Unbiased estimators

Descriptive statistics, such as the mean, range or variance, can be calculated for a sample.
However, we are very rarely only interested in the sample. We want to infer something about
the population. One way of doing this is to use an unbiased estimator. This is a statistic
that, if we were to calculate it for many samples, would average to the true population value.
For the mean, it turns out that the sample mean x is an unbiased estimate of the

population mean, μ.

KEY POINT 9.1

x is an unbiased estimate of μ.

Proof 2.1

Key Point 9.1 might seem intuitive, but the proof shows you how unbiased
estimates work.

The sample mean is X = 1
a random variable n

formed by adding up n 1( ) = ⎛ ⎞
independent observations n⎝
μ)
of X and dividing by n 1
=n
Taking an average
over many samples is = 1 (
mathematically done by n
finding an expectation
= 1 ( μ)
This can be simplified
by using linear

transformations of
random variables

Each expected value is the
true population mean, μ

There are n terms in
the sum, each equal to

μ so this simplifies



sThe variance of a sample n2 will tend to slightly underestimate the true variance,

σ2, because a sample does not usually explore all the extremes of a population.

sThe unbiased estimate of the population variance is n2 − 1, and it can be found using

the following formula.

Tip KEY POINT 9.2

A common error An unbiased estimate of the population variance is

is to think that this sn2−1 n 1sn2
sumnebainassetdhaetstinm−1aitse
an = n −
of

the population standard This will usually be found directly from your calculator. Different calculators might use
slightly different notation so you must make sure that you know how your calculator
deviation. It turns out describes the unbiased estimate.

that this is not the case.

9A Statistical design and unbiased estimators 325

You are the WORKED EXAMPLE 9.2
Researcher
For the sample
You might like to
prove the formula 11, 16, 14, 14, 21, 23
in Key Point 9.2.
To do so you will evaluate unbiased estimates of the population mean and variance.
need to research
a general formula Put the data into The unbiased estimate for μ is x = 16.5.
for variance of a the calculator
random variable.
The unbiased estimate for σ ≈ = .

Exercise 9A

For questions 1 to 3, use the ideas in Worked Example 9.1 to suggest i whether this question is valid and ii whether the
results are reliable. Justify your answer.

1 a ‘Have you ever committed a crime?’ 2 a ‘Do you like pizza?’

b ‘What is the best football team?’ b ‘Do you have any brothers?’

3 a ‘Do you like maths or physics?’

b ‘Have you enjoyed the session today?’

For questions 4 to 6, use the technique of Worked Example 9.2 to evaluate unbiased estimates of the population mean

and variance for the given sample.

4 a 14, 19, 20, 22, 25, 25 5 a      

b 7, 14, 21, 28, 35, 42 b      

6 a      
b      
s s7 The standard deviation, n, of a sample of size 10 is 12.4. Find n − 1.
s s8 n − 1, of a sample is 1.118 (to 4 s.f.) times bigger than the
The best estimate of the population standard deviation,

sample standard deviation, n. Find the value of n.
9 Claudia collects a sample of the number of eggs laid by 6 snakes of a particular breed. Here are her results.

4, 3, 3, 8, 5, 4

a Find an unbiased estimate for the population mean of the number of eggs of snakes from this breed.

b Find an unbiased estimate for the population variance of the number of eggs of snakes from this breed.

c Claudia only sampled snakes from her local breeder. Explain why the processes used in parts a and b might not
be valid.

d How could Claudia test the reliability of her result in part a?

326 9 Statistics

10 A social scientist wants to look at people’s experience of crime. He conducts a survey using a questionnaire.
One of the questions asked is:

‘How many criminal activities affected you in the last year?’

He leaves his questionnaire in a police station, asking people to post their completed questionnaires in a box.
One week later, he has received the following responses to the question:

1, 2, 4, 2, 1, 3498

a Explain why it is reasonable to remove the data item 3498 from his sample.

b For the remaining data, calculate unbiased estimates of the population mean and population variance.

c Comment on the content validity of the question.

d Comment on the validity of the sampling method.

e Is the social scientist’s estimate of the population mean more likely to be an overestimate or an underestimate?
Justify your answer.

f How could the reliability of the survey be improved?

11 a Define reliability.

b A psychology test tries to assess people’s personality types on four different categories. A test uses four
questions to assess each category spread throughout a questionnaire. The psychologist wants to check whether
the results of each question testing the same category correlate. Name the reliability test she is using.

c Suggest an alternative way to assess the reliability of the test.

12 a Define validity.

b A career coach wants to assess his impact on people’s business success. He does this by asking the following
question at the end of a course he has run:

‘Do you think that this course will increase your salary?’

Give two reasons why this question does not have content validity.

c Explain why making the questionnaire anonymous might increase the validity of the question.

d The career coach followed up on these responses four years later and checked to see whether the salary
increases were greater for the people who answered yes to his question. What aspect of validity is he testing by
doing this?

9B Further 2 tests

 Categorizing numerical data in a 2 table

In the Mathematics: applications and interpretation SL book, you met the idea that the
chi-squared distribution only works if all expected values are greater than 5. We can
combine groups to make this happen.

WORKED EXAMPLE 9.3

In a survey at a sports club, members were asked to name their favourite sport out of baseball,
basketball and ice-hockey.

Sport Baseball 13–14 Age 17–19
Basketball 3 15–16 21
Ice-hockey 0 16
8 12 11
12
16

9B Further 2 tests 327

a Find the expected values in a test for independence.

b Combine the columns to form an appropriate data set to test whether age and sport are
independent.

c Hence test at 5% significance to see whether these data provide evidence for a link
between age and sport.

You can use your a The expected values are
calculator to find a matrix
13–14 Age 17–19
of expected values 4 15–16 17.5
3.11 13.6
Sport Baseball 3.89 14.5 17.0
Basketball 11.3
Ice-hockey 14.1

The 13–14 age group b Combining 13–14 and 15–16:
contains expected
Age
frequencies less than 5, so
it needs to be combined 13–16 17–19

with another column. The Sport Baseball 15 21
only column it makes
Basketball 12 16
sense to combine it with
is the 15–16 age group Ice-hockey 24 11

The calculator can be H0: Sport and Age categories are independent.
used to conduct the H1: Sport and Age categories are dependent.

chi-squared test. You c χ 2 = 6.31, 2 degrees of freedom,
should quote the degrees p = 0.0425 < 0.05
of freedom, chi-squared
Therefore, there is significant evidence that age
value and the p -value and sport preference are dependent.

You might have wondered why the default letter used to represent unknowns
is x. One theory starts from the fact that, historically, maths was not written in
equations using letters, but rather words.
x − 3 = 2 would have been written as ‘three less than the unknown equals two’.
However, at the time when such equations were being studied, Arabia was at the
forefront of mathematics and ‘unknown’ would have been written as the Arabic word
‘shalan’. Arabian influence in Europe was particularly strong in Spain and there was
no Spanish equivalent for the sound ‘sh’ so they borrowed the Greek letter χ to start
the word. Gradually this word was abbreviated to just χ then x.

 2 tests with estimated parameters

In the Mathematics: applications and interpretation SL book, you met the idea that you
can test to see if data might have been drawn from a particular distribution – for example,
Po(1.2). However, you might not know in advance the parameter of the distribution.
If you estimate it using the data, then that adds an extra constraint on the expected
frequencies. This leads to a slight change in the formula for the chi-squared distribution.

KEY POINT 9.3

N kDegrees of freedom = − 1 −

where N is the number of groups and k is the number of parameters estimated.

328 9 Statistics

WORKED EXAMPLE 9.4

a Estimate the mean of the population from which the following data are drawn.

X 01234
Frequency 32 49 41 20 12

b Hence test at the 5% significance level to see if it has been drawn from a Poisson
distribution, stating your null and alternative hypotheses.

You need to be able to input a From the GDC, x = 1.55
frequency distributions
into a calculator

In this case, the mean b H0: The data is drawn from a Poisson distribution.
is not included in the H1: The data is not drawn from a Poisson distribution.
hypotheses because Using the Po(1.55) distribution, the expected
there was no prior belief frequencies are
about its value. It must be
estimated from the data. X
You can find the expected 0123 4

frequencies using Exp frequency 32.6 50.6 39.3 20.3 11.1
probabilities from the

Po(1.55) distribution

multiplied by 154, the total
frequency in the data.

Notice that even though
the highest observed

value was 4, the expected
values have to add up to
the same total frequency.
This means that the final

group has to be treated as
greater than or equal to 4

Tip There are 5 categories, There are 5 − 1 − 1 = 3 degrees of freedom.
the total frequency is
Be careful – the χ2 = 0.211
conclusion should not fixed and one parameter p = 0.976 > 0.05
simply claim that the (the mean) has been
data does come from Therefore, there is no significant evidence to
a Poisson distribution. estimated from the data suggest that the data was not drawn from a Poisson
The wording is distribution.
quite tricky, but it is The calculator can then be
important to be precise. used to find the p -value

Tip This idea can also be applied to continuous data.

The notation [10,20[ WORKED EXAMPLE 9.5

means 10  x  20 . For the following data, find unbiased estimates of the population mean and variance.
Hence determine at the 5% significance level whether it could have been drawn from a normal
distribution.

x [10, 20[ [20, 30[ [30, 40[ [40, 50[ [50, 60[ [60, 70[
Frequency 12 40 48 52 44 16

9B Further 2 tests 329

In this case, the mean H0: The data is drawn from a normal distribution.
and variance are not H1: The data is not drawn from a normal distribution.

included in the hypotheses The mid-interval values are 15, 25, 35, 45, 55 and 65.
because there was no From the GDC,

prior belief about their x = 40.8
values. They have to be sn2−1 = 183
estimated from the data.
The expected frequencies are
Find the mean and
variance from the table x Frequency
using mid-interval values 13.1
(the midpoints of each ]− ∞,20[ 31.7
group). Write down those [ 20, 30[ 55.8
values that you use in the [ 30, 40[ 58.3
calculation that are not 36.2
[ 40, 50[ 16.7
given in the question [ 50, 60[
[60, ∞[
To find the expected
frequencies, find the The degrees of freedom are 6 − 1 − 2 = 3.
probabilities using the
From the GDC, χ2 = 5.73,
cumulative normal p-value = 0.126 > 0.05.
distribution function on the
Therefore, there is no significant evidence that this
calculator, then multiply sample was not drawn from a normal distribution.
by the total frequency
(212). Even though the
observed data were
between 10 and 70, the

expected frequencies for a
normal distribution need
to cover all real values

There are 6 groups, the
constraints are the total

being the same, and
the mean and variance
being the same as those
estimated from the data

The GDC can then do
a goodness of fit test

TOK Links

Notice that we cannot say that the data are drawn from a normal distribution. We would see
this type of data only about 12.6% of the time when the real distribution is normal which is
not hugely supportive of it being normally distributed. However, there is insufficient evidence
to be statistically confident that it was not drawn from a normal distribution. Are we really
testing for a good fit, or for a bad fit? Does the name we give to tests influence how we
interpret them? Is it easier to prove or disprove a statement using statistics?

You are the Researcher

In Worked Example 9.5, we used unbiased estimators for the mean and variance.
Strictly, when doing chi-squared tests we should use a different type of estimator
called a maximum likelihood estimator. Sometimes these give the same answer
as unbiased estimators but sometimes they give different answers. Explore what
is meant by a maximum likelihood estimator and find out when it differs from
unbiased estimators.

330 9 Statistics

Exercise 9B

For questions 1 to 3, use the method of Worked Example 9.3 to test variables A and B for independence at the 5%
significance level. You will need to combine rows or columns first.

1a 13–14 A 17–19 b [0,1] [10,20[ A [30,40]
9 15–16 20 ]1,2[ 12 [20,30[ 5
6 4 15 B [2,3] 16 4
B7 3 12 10 18 9 3
12 10
8 20 15

2a A bA
[20,30[
B [10,20[ [30,40] 1.3 1.4 1.5
6 5 4
[0,1] 16 15 20 2 20 30 40
]1,2[ 4 12 19
[2,3] B3 10 10 10

44 4 5

3a A bA

1234 1234

2 0 5 5 10 25 6 4 1

3 0 10 10 0 B 3 10 12 15 16
B 4 0 10 10 0 4 20 30 40 50

5 10 5 5 0 5 25 35 45 55

For questions 4 to 6, use the method of Worked Example 9.4 to test at the 5% significance level whether a Poisson
distribution is an appropriate model for the given data.

4a X 01234 bX 01234

Frequency 20 20 20 20 20 Frequency 10 10 10 10 10

5a X 01234 bX 01234

Frequency 10 20 30 20 10 Frequency 100 200 300 200 100

6a X 012345 bX 01234 5
Frequency 16 28 35 40 16 9
Frequency 8 14 20 25 16 9

For questions 7 to 9, use the method of Worked Example 9.5 to test at the 5% significance level whether a normal
distribution is an appropriate model for the given data.

7a x [10, 20[ [20, 30[ [30, 40[ [40, 50[ [50, 60[ [60, 70[
Frequency 15 20 25 25 20 15

bx [10, 20[ [20, 30[ [30, 40[ [40, 50[ [50, 60[ [60, 70[
Frequency 150 200 250 250 200 150

9B Further 2 tests 331

8a [20,40[ [40,60[ [60,80[ [80,100[ [100,120[

x

Frequency 10 10 10 10 10

b [20,40[ [40,60[ [60,80[ [80,100[ [100,120[

x

Frequency 10 20 30 20 10

9a x [− 100,0[ [ 0, 50[ [50,100[ [100,200[
Frequency
10 50 30 20

b [− 100,0[ [ 0, 50[ [50,100[ [100,200[

x 20 50 40 20
Frequency

10 Test Mendel’s prediction at the Links to: Biology
5% significance level.
In the 1850s, based on ideas of heredity, Austrian monk and scientist
Gregor Mendel predicted that if tall plants were allowed to self-fertilize,
they would produce tall plants and short plants in the ratio 3:1. He actually
observed 787 tall plants and 277 short plants.

11 The times taken by eight-year-old children to solve a puzzle can be modelled by a normal distribution with mean
12 minutes and standard deviation 2.5 minutes. The times taken to solve the same puzzle by a random sample of
40 ten-year-old children are as follows.

Time (minutes) t  9 9 < t  11 11 < t  13 13 < t  15 t > 15
Frequency
6 8 15 7 4

A psychologist wants to test, using a 10% significance level, whether the times of the ten-year-old children come
from the same distribution.

a Write down suitable hypotheses for this test.

b Find the expected frequencies.

c Explain why the first two groups and the last two groups need to be combined.

d State the number of degrees of freedom after combining the groups.

e Carry out the test and state the conclusion.

12 Katya wants to find out whether diet choices are dependent on age. She collects data from students at her school
and records them in the contingency table.

Vegetarian Vegan Eats meat

11–13 8 20 14

14–15 15 10 20

16–17 8 8 6

17–18 7 6 3

a State suitable hypotheses for a χ2 test for independence.

b Explain why the last two rows of the table need to be combined.

c Conduct a χ2 test for independence, using a 5% significance level. State your conclusion in context.

d i Katya’s friend says that he is vegetarian on most days but will eat fish at family celebrations if offered,
so he did not know which response was required. How could Katya improve her questionnaire to take into
account his feedback?

ii Explain why adding too many categories could be problematic.

332 9 Statistics

13 Hermann is investigating whether the number of cars going past his house can be modelled by a Poisson
distribution with mean 3.5 per minute. He observes the cars over a period of 60 minutes and records the number of
cars in each minute.

Number of cars in a minute 0 1 2 3 4 5 6

Frequency 3 8 10 13 12 9 5

a State suitable hypotheses for a χ2 test.

b State which two groups need to be combined.

c State the number of degrees of freedom.

d Find the p-value and state the conclusion of the test.

14 A teacher suggests that exam grades at her college can be modelled by the distribution.

P (G = g) = g (11 − g) for g = 3, 4, 5, 6, 7
140

A random sample of 26 students had the following grades.

Grade 34567

Frequency 10 6 4 3 3

a Assuming that the teacher’s suggestion is correct, calculate the expected frequencies.
b Test, using a 10% significance level, whether the teacher’s model is appropriate for these data.
15 A six-sided dice is rolled 27 times, with the following results.

Outcome 123456
Frequency 625725

Is there evidence, at the 10% significance level, that the dice is not fair?
16 The table shows information about the mode of transport that students use to get to school in four different cities.

Amsterdam Athens Houston Johannesburg

Car 12 25 48 24

Bus 18 33 12 18

Bicycle 46 12 7 53

Walk 13 30 5

Use a χ2 test to find out whether there is evidence, at the 5% significance level, that there is a relationship

between the mode of transport and the city. State the number of degrees of freedom and the p-value.

17 Consider the following data.

x [0, 10[ [10, 20[ [20, 30[ [30, 40[ [40, 50[
Frequency 200 500 820 500 200

a Use a chi-squared test to determine whether the following distributions are plausible models for the data at the
5% significance level.

i N (μ, σ 2)

ii N (25, σ 2)

iii N (25, 110)
b When would you use a model of the form N (25, σ 2) rather than N (μ, σ 2)?

9B Further 2 tests 333

18 Rajesh is practising tennis serves. He takes three serves at a time and records the number of successful ones.

He believes that this number can be modelled by the binomial distribution B(3, 0.6).

Number of successful serves out of 3 0 1 2 3

Frequency 7 28 95 70

a State the hypotheses for a χ2 goodness of fit test.

b Find the expected frequencies and write down the number of degrees of freedom.

B 3, 0.6c By finding the p-value, show that there is evidence, at the 5% significance level, that ( ) is not a
good model.

Rajesh still thinks that the number of successful serves can be modelled by a binomial distribution, but with a
different probability of success.

d By finding the mean of the data in the table, estimate the probability of success.

e Hence test, using a 5% significance level, whether the number of successful serves can be modelled by a
binomial distribution.

19 A publisher wants to test whether the number of typos per page can be modelled by a Poisson distribution.
She collects the following data from a random sample of 100 pages.

Number of typos 0 1 2 3 4 5

Number of pages 12 23 29 24 12 0

a Find the mean number of typos per page.

b State suitable hypotheses for a χ2 test.

c Find the expected frequencies and the number of degrees of freedom.

d Test, using a 5% significance level, whether the number of typos per page can be modelled by a Poisson
distribution.

20 A train company claims that times for a particular journey are distributed normally, with mean 23 minutes.

Sumaya takes this train to school and wants to test the company’s claim. She decides to conduct a χ2 test and

records the lengths of 50 randomly selected journeys.

Time (minutes) 20–21.5 21.5–22.5 22.5–23.5 23.5–24.5 24.5–26
Frequency 3 8 14 17 8

a Estimate the population standard deviation of train times.
b Find the expected frequencies. Do any groups need to be combined?
c Write down the number of degrees of freedom.
d Use a p-value to complete the test at the 5% significance level, stating your conclusion clearly.
21 An athlete believes that her long jump distances follow a normal distribution. In order to test her belief, she
recorded the distances from a random sample of 100 jumps, and obtained the following results.

Distance (m) 4.5 to 5 5 to 5.5 5.5 to 6 6 to 7 7 to 7.2
Frequency 9 18 32 33 8

a Assuming that her belief is correct, copy the table and complete the expected frequencies, correct to three
decimal places.

Distance (m) < 5 5 to 5.5 5.5 to 6 6 to 7 >7

Frequency 8.148 17.914

b State the number of degrees of freedom for a χ2 goodness of fit test.

c State suitable hypotheses.

d Conduct the test at the 10% significance level.

334 9 Statistics

22 Consider the following data.

X 01234

Frequency 10 20 30 40 50

a Find an unbiased estimate of the mean of the data.
b Find an unbiased estimate for the variance of the data.
c Amelia wants to check whether the data could plausibly come from a Poisson distribution. Show how she could

do this
i by comparing her answers to a and b

ii by using a goodness of fit test with the last category being X 4

iii by using a goodness of fit test with the last category being X > 4.

d Compare the validity of Amelia’s three methods.

9C Non-linear regression

In Chapter 6 of Mathematics: applications and interpretation SL, you met the idea of
linear regression, and used your GDC to find the equation of the regression line for a
given data set.

You now need to able to do the same for a number of non-linear possibilities as well:
quadratic, cubic, exponential, power and sine regressions.

WORKED EXAMPLE 9.6

For the data below, find the equation of a regression model of the form y = ax2 + bx + c.

x 2 3 3 4 5 6 7 7 9 10
y 13 16 17 19 22 24 23 24 22 21

The GDC will give you the The quadratic model is
values of the coefficients
a, b and c in the quadratic y = −0.399x2 + 5.78x + 2.85

y = ax2 + bx + c

9C Non-linear regression 335

WORKED EXAMPLE 9.7

For the data below, find the equation of a regression model of the form y = asin(bx + c) + d.

x 12234578
y 1.4 8.1 8.7 5.0 0.9 7.6 0.7 6.9

The GDC will give you the The sine model is
values of the coefficients
a, b, c and d in the sine y = 4.12sin(2.01x − 2.94) + 4.75

model y = asin(bx + c) + d

 Sum of square residuals as a measure of fit for a model

Having formed a model, it is important to know how well it fits the data. To do this we
need a measure of the difference between the y-values a model predicts and the actual
y-values of the data. These differences are called residuals.

y

Positive residual

Negative residual

x

Since residuals can be positive or negative, to get an overall measure we first square
the residuals before summing them. This gives a quantity called the sum of square

residuals, SSres .

336 9 Statistics

KEY POINT 9.4

 SSres = ∑ (yi − yˆi)2, where yˆi are the values the model predicts.

 The smaller the value of SSres the better the model fits the data.

Note that since each component of SSres is positive (as each residual is squared), the
more data values that are used to form a model the larger the value of SSres is likely to

be for that model.

Therefore, you should not use SSres to compare the goodness fit of two models that have

been derived from a different number of data points.

 The coefficient of determination (R2)

Tip The sum of square residuals leads to another measure of the goodness of fit of a model –

A value of R2 of around the coefficient of determination, R2 . This measure does allow for comparison between
0.7 or more is usually
considered to be an models derived from a different number of data points.
indication of a good fit.
For a linear model, R2 is just the square of the Pearson’s product-moment correlation
coefficient, r.

KEY POINT 9.5

0   1R2 , where a value of 1 indicates that the model perfectly predicts the data values.

In many circumstances, R2 gives the proportion of variability in the dependent variable

accounted for by the chosen model.

You are the Researcher

Find out about the different types of non-linear regression and when R2 has this

interpretation and when not.

WORKED EXAMPLE 9.8

A cubic model is suggested for this data.

x 0 0.5 1.0 1.5 2.0 2.5 3.0
y 3.4 4.9 1.9 5.8 5.5 16.2 16.5

a Find the equation of the regression curve.

b State the value of the coefficient of determination for this model.

The GDC will give you the a The cubic model is
values of the coefficients
a, b, c and d in the cubic y = −0.400x3 + 4.38x2 − 5.05x + 4.25

y = ax3 + bx2 + cx + d

The coefficient of b R2 = 0.865
determination is r2 from

this output screen

9C Non-linear regression 337

Exercise 9C

For questions 1 to 5, use the technique of Worked Examples 9.6 and 9.7 to find a regression model of the given form for
each set of data.

1 a y = ax2 + bx + c

x 012345

y 4.1 0.9 1.9 4.4 10.8 22.3

b y = ax2 + bx + c

x 0.5 1.0 1.5 2.0 2.5 3.0 3.5
y 10.8 15.7 17.4 15.8 14.1 9.5 5.1

2 a y = ax3 + bx2 + cx + d

x 0.2 0.4 0.6 0.8 0.8 1.0 1.2 1.2 1.4 1.6
y 3.8 4.1 6.1 6.7 7.8 8.4 12.9 12.1 12.6 15.0

b y = ax3 + bx2 + cx + d 2 345678
7 5 12 23 45 60 116
x 12
y 68

3 a y = aebx

x 5 10 15 20 25
y 4 20 40 155 330

b y = aebx

x 1.4 2.5 3.2 4.6 5.8 6.0
y 14 5.6 3.1 0.4 0.2 0.1

4 a y = axb 20 30 40 50 60 70 80 90
7.5 4.6 4.3 4.2 2.7 2.5 2.4 2.4
x 10
y 10

b y = axb 2 34 5
45 190 780 1000
x1
y1

5 a y = asin(bx + c) + d

x 10 11 12 13 13 14
y 0.9 1.5 7.3 1.0 1.3 3.5

b y = asin(bx + c) + d

x 11234456
y 2.4 1.7 3.7 4.8 4.7 4.6 2.5 0.6

338 9 Statistics

For questions 6 to 8, use the technique of Worked Example 9.8 to the find the coefficient of determination for the given
model for each set of data.
6 a Linear model

x 50 60 70 80 80 90 90 100 110 120

y 86 49 48 36 29 31 22 4  

b Linear model

x 10 15 20 25 30
y 2 12 10 13 20

7 a Quadratic model 223 4 5
x 01 5.6 7.1
y 4.3 1.1   

b Quadratic model

x 100 103 105 106 109 110 112 115
y 12 16 31 29 35 21 15 10

8 a Cubic model

x 0.4 0.9 1.3 1.8 2.3 2.5 2.8 3.1
y 5.3 5.5 2.3 0.4 4.6 4.3 4.9 6.8

b Cubic model

x 5 12 18 26 31

y 1.3 9.8 7.6 6.7 

9 Zoe suspects that a linear model may be appropriate for data she has collected and calculates the Pearson’s
product-moment correlation coefficient to be – 0.879.

Find the value of the coefficient of determination for Zoe’s data set.

10 The distance from the Sun, x, in astronomical units (AU), and the orbital period, T, in Earth years, of each of the
eight planets in the Solar System are given below.

Planet Distance from sun (AU) Orbital period (years)
Mercury 0.387 0.241
Venus 0.723 0.615
Earth 1.00 1.00
Mars 1.52 1.88
Jupiter 5.20 11.9
Saturn 9.54 29.5
Uranus 19.2
Neptune 30.1 84.0
165

a Find a regression model of the form T = ax3 + bx2 + cx + d , giving the values of a, b, c, d to three significant
figures.

b The dwarf planet Pluto is 39.5 AU from the sun.

i Use the model to predict its orbital period.

ii Comment on the reliability of your prediction.

9C Non-linear regression 339

11 The depth of water, h metres, in a harbour t hours after midnight is recorded as follows.

td
2 7.7
6 6.5
10 4.1
14 7.5
18 6.7
22 4.5

a Find a regression model of the form h = asin(bt + c) + d, giving the values of a, b, c, d to two significant figures.

b Use your model to find the
i maximum predicted depth of water in the harbour
ii minimum predicted depth of water in the harbour.

12 A company is considering two functions to model how demand for a particular product D varies with the price
charged, p.
The company has gathered the following sales figures at varying prices.

p ($) D
2 40
4 37
6 23
8 20
10 17
12 12
14 11
16 11
18 10

20 6

Model A: D = ap + b
Model B: D = ap b

a Determine the value of R2 for model A and model B.
b On the basis of these values of R2, suggest which model is a better fit for the data.
c Write down the equation for your chosen model, giving the values of a and b to three significant figures.

$15d Use your model to predict the demand at a price of .
$20e Comment on the suitability of using your model to predict demand at prices higher than .

13 A biologist is attempting to develop a model for population growth of bacteria. She records the following data.

Time (minutes) 1234
Population (thousands) 3.2 5.8 7.4 10.2

She proposes two models:

Model A: P = 2.5e0.3t
Model B: P = 3.5e0.2t

a Find the sum of square residuals for each model.

b On the basis of the values found in part a, suggest which model better fits the data.

340 9 Statistics

14 A maths teacher is attempting to form a model that predicts students’ scores, s, in a maths exam from the length
of time they spend revising beforehand, t.

He wants to form a separate model for girls and boys based on the following data.

GIRLS BOYS

Time spent Score in Time spent Score in

revising (hours) exam (%) revising (hours) exam (%)

2 54 1 34

3 60 2 36

3 51 2 52

5 69 4 61

6 83 5 94

6 77 7 78

8 84 8 84

9 91 10 80

10 86

12 79

a Form a quadratic regression model for the girls and a separate quadratic model for the boys.
b Why would using the sum of square residuals for each model not be a good way of determining which model

best fits the data set from which it is derived?
c Use the coefficient of determination for each model to suggest which model is a better fit.
15 The temperature, T °C, of a kettle t minutes after boiling is as follows.

t 2.5 5 7.5 10 12.5 15
T 84 67 59 49 41 38

a Fit a quadratic regression model to the data.
b Find the value of R2.
c Explain whether your value of R2 indicates that a quadratic model is a good fit for the data.

2416 Alessandra and Zoe collect data on the number of visits to their business’s website every hour in a -hour period.
0.81Alessandra fits a quadratic regression model to the data and gets a value of R2 = .
Zoe fits a cubic regression model and gets a value of R2 = 0.90.

Zoe claims that her model is a better fit to the data.
Explain why Zoe’s claim is not necessarily true.

9D Confidence intervals for the mean

 The concept of a confidence interval

In Section 9A, you learnt how to use sample data to find unbiased estimates of the
population mean and variance. These are called point estimates – each one is a single
number. They are very unlikely to be the real population mean and variance because
of natural variation when choosing samples – a different sample will give different
estimates.

Instead of finding a single value to estimate the population mean, it may be better to
have an interval of values which is very likely to include the true mean. This is called

9D Confidence intervals for the mean 341

a confidence interval. In this section, you will learn how to use the GDC to find
confidence intervals in various situations.

For example, suppose that you measure a sample of six leaves from a certain tree and get
the following results (in cm):

6.2, 5.1, 7.3, 5.3, 8.1, 6.5

Based on this sample, the 95% confidence interval for the mean is found to be

5.21 < μ < 7.63. If you take a different sample, you will get a different confidence

interval. ‘95% confidence’ means that if lots of samples are taken and a confidence
interval calculated for each one, then 95% of those confidence intervals will contain
the true population mean.

You can choose a different confidence level. For the sample of lengths above, the 90%

confidence interval is 5.47 < μ < 7.37. You can see that both intervals are centred at

6.42, which is the sample mean, but that the second interval is smaller. The second
interval gives a more precise estimate for the population mean, but you can be less
confident that the true mean is contained within it.

The diagram below shows a large number of 95% and 90% confidence intervals
calculated from samples of size 6 taken from a normal distribution with mean 10.
The intervals highlighted are the only ones that don’t contain the population mean.

20 95% confidence intervals 20 90% confidence intervals
18 18
16 16
14 14
12 12
10 10
8 8
6 6
4 4
2 2
0 0 2 4 6 8 10 12 14 16 18 20 0 0 2 4 6 8 10 12 14 16 18 20

In this course, we only consider symmetric confidence intervals for the population
mean, which are centred on the sample mean. The width of the interval depends
on the size of the sample, the variance of the data, the chosen confidence level and
the distribution of the underlying population. The latter may or may not be known;
your GDC calculates confidence intervals assuming that the underlying population
distribution is normal.

You are the Researcher
You already know how to use a sample to estimate the population variance – one

sgood estimate is n2−1. It turns out that finding a confidence interval for the variance

requires the use of the χ2 distribution.

342 9 Statistics

Tip  Confidence interval for the mean of a normal population

This can also with unknown variance
be written using
set notation as In this case, you need to construct a t-interval.

μ ∈ (16200, 25300). KEY POINT 9.6

Use a t-interval when the population variance is unknown.
You need to assume that the underlying population follows a normal distribution.

In Section 15B of the Mathematics: applications and interpretation SL book, you learnt

about t-tests, which you will revisit in the next section. They are based on the same
distribution, called the student’s t-distribution, as the calculation of t-intervals.

WORKED EXAMPLE 9.9

A random sample of eight lightbulbs from a particular manufacturer was tested to determine
their lifetime. The results, in thousands of hours, were

12.3, 21.7, 18.2, 31.5, 22.8, 16.0, 28.8, 14.5

It can be assumed that the population of lifetimes is normally distributed. Find a 90%
confidence interval for the mean lifetime of lightbulbs produced by this manufacturer.

State which interval you t-interval on GDC:
are going to find. Since
you have not been told 16200 hours < μ < 25300 hours
the population variance,

use the t-interval

You need to enter the
data into a list and select

the confidence level

Sometimes the data has been summarized for you, and you can enter those summary
statistics into the GDC to find a confidence interval.


Click to View FlipBook Version