The words you are searching are inside this book. To get more targeted content, please make full-text search by clicking here.

by Murray R. Spiegel, John Schiller and R. Alu Srinivasan

Discover the best professional documents and content resources in AnyFlip Document Base.
Search
Published by eLib, 2021-09-17 08:53:58

Schaum's Outline Probability and Statistics (3rd Edition)

by Murray R. Spiegel, John Schiller and R. Alu Srinivasan

Keywords: Statistics,Probability

292 CHAPTER 8 Curve Fitting, Regression, and Correlation

8.32. Use the formula of Problem 8.31 to obtain the linear correlation coefficient for the data of Problem 8.11.

From Table 8-4,
n axy Ϫ Q axR Q ayR


22
Sn ax2 Ϫ Q axR T Sn ay2 Ϫ Q ayR T
B
(12)(54,107) Ϫ (800)(811)
ϭ ϭ 0.7027
![(12)(53,418) Ϫ (800)2][(12)(54,849) Ϫ (811)2]

as in Problems 8.26(b), 8.28, and 8.30.

Generalized correlation coefficient

8.33. (a) Find the linear correlation coefficient between the variables x and y of Problem 8.16. (b) Find a non-
linear correlation coefficient between these variables, assuming the parabolic relationship obtained in
Problem 8.16. (c) Explain the difference between the correlation coefficients obtained in (a) and (b).
(d) What percentage of the total variation remains unexplained by the assumption of parabolic relationship
between x and y?

(a) Using the calculations in Table 8-9 and the added fact that gy2 ϭ 290.52, we find

n axy Ϫ Q axR Q ayR


22

Sn ax2 Ϫ Q axR T Sn ay2 Ϫ Q ayR T
B

(8)(230.42) Ϫ (42.2)(46.4)
ϭ ϭ Ϫ0.3743

![(8)(291.20) Ϫ (42.2)2][(8)(290.52) Ϫ (46.4)2]

(b) From Table 8-9, y# ϭ (gy)>n ϭ (46.4)>8 ϭ 5.80. Then

Total variation ϭ a(y Ϫ y#)2 ϭ 21.40

From Table 8-10,

Explained variation ϭ a(yest Ϫ y#)2 ϭ 21.02

Therefore, r2 ϭ explained variation ϭ 21.02 ϭ 0.9822 and r ϭ 0.9911
total variation 21.40

(c) The fact that part (a) shows a linear correlation coefficient of only Ϫ0.3743 indicates practically no linear
relationship between x and y. However, there is a very good nonlinear relationship supplied by the parabola
of Problem 8.16, as is indicated by the fact that the correlation coefficient in (b) is very nearly 1.

Unexplained variation
(d) Total variation ϭ 1 Ϫ r2 ϭ 1 Ϫ 0.9822 ϭ 0.0178

Therefore, 1.78% of the total variation remains unexplained. This could be due to random fluctuations
or to an additional variable that has not been considered.

8.34. Find (a) sy and (b) sy.x for the data of Problem 8.16.
(a) From Problem 8.33(b), g( y Ϫ y#)2 ϭ 21.40. Then the standard deviation of y is

sy ϭ a(y Ϫ y#)2 ϭ 21.40 ϭ 1.636 or 1.64
Bn A8

(b) First method
Using (a) and Problem 8.33(b), the standard error of estimate of y on x is

sy.x ϭ sy !1 Ϫ r2 ϭ 1.636 !1 Ϫ (0.9911)2 ϭ 0.218 or 0.22

CHAPTER 8 Curve Fitting, Regression, and Correlation 293

Second method
Using Problem 8.33,

sy. x ϭ a(y Ϫ yest)2 ϭ unexplained variation ϭ 21.40 Ϫ 21.02 ϭ 0.218 or 0.22
B n Bn A 8

Third method
Using Problem 8.16 and the additional calculation gy2 ϭ 290.52, we have

sy. x ϭ a y2 Ϫ a ay Ϫ b axy Ϫ c ax2y ϭ 0.218 or 0.22
n
B

8.35. Explain how you would determine a multiple correlation coefficient for the variables in Problem 8.19.

Since z is determined from x and y, we are interested in the multiple correlation coefficient of z on x and y.
To obtain this, we see from Problem 8.19 that

Unexplained variation ϭ a(z Ϫ zest)2
ϭ (64 Ϫ 64.414)2 ϩ c ϩ (68 Ϫ 65.920)2 ϭ 258.88

Total variation ϭ a(z Ϫ z#)2 ϭ az2 Ϫ nz#2
ϭ 48,139 Ϫ 12(62.75)2 ϭ 888.25

Explained variation ϭ 888.25 Ϫ 258.88 ϭ 629.37

Then

Multiple correlation coefficient of z on x and y

explained variation ϭ 629.37 ϭ 0.8418
ϭ B total variation A888.25

It should be mentioned that if we were to consider the regression of x on y and z, the multiple correlation
coefficient of x on y and z would in general be different from the above value.

Rank correlation
8.36. Derive Spearman’s rank correlation formula (36), page 271.

Here we are considering nx values (e.g., weights) and n corresponding y values (e.g., heights). Let xj be the rank
given to the jth x value, and yj the rank given to the jth y value. The ranks are the integers 1 through n. The
mean of the xj is then

x# ϭ 1 ϩ 2 ϩ cϩ n ϭ n(n ϩ 1)>2 ϭ n ϩ 1
n n 2

while the variance is

sx2 ϭ x#2 Ϫ x#2 ϭ 12 ϩ 22 ϩ c ϩ n2 Ϫ an ϩ 1b2
n 2

ϭ n(n ϩ 1)(2n ϩ 1)>6 Ϫ an ϩ 1b2
n 2

ϭ n2 Ϫ 1
12

using the results 1 and 2 of Appendix A. Similarly, the mean y# and variance s2y are equal to (n ϩ 1)>2 and
(n2 Ϫ 1)>12, respectively.

Now if dj ϭ xj Ϫ yj are the deviations between the ranks, the variance of the deviations, sd2, is given in terms
of s2x, s2y and the correlation coefficient between ranks by

sd2 ϭ sx2 ϩ s2y Ϫ 2rranksx sy

294 CHAPTER 8 Curve Fitting, Regression, and Correlation

Then

(1) rrank ϭ sx2 ϩ s2y Ϫ s2d

2sx sy

Since d# ϭ 0, sd2 ϭ (gd2)>n and (1) becomes

(n2 Ϫ 1)>12 ϩ (n2 Ϫ 1)>12 Ϫ Q ad2R>n 6 ad 2
(n2 Ϫ 1)>6 n(n2 Ϫ 1)
(2) rrank ϭ ϭ 1 Ϫ

8.37. Table 8-17 shows how 10 students were ranked according to their achievements in both the laboratory and
lecture portions of a biology course. Find the coefficient of rank correlation.

Table 8-17

Laboratory 8 3 9 2 7 10 4 6 1 5

Lecture 9 5 10 1 8 7 3 4 2 6

The difference of ranks d in laboratory and lecture for each student is given in Table 8-18. Also given in the
table are d 2 and gd 2.

Table 8-18
Difference of Ranks (d ) Ϫ1 Ϫ2 Ϫ1 1 Ϫ1 3 1 2 Ϫ1 Ϫ1

d 2 1 4 1 1 1 9 1 4 1 1 gd2 ϭ 24

Then rrank ϭ 1 Ϫ 6 ad 2 ϭ 1 Ϫ 6(24) 1) ϭ 0.8545
n(n2 Ϫ 1) 10(102 Ϫ

indicating that there is a marked relationship between achievements in laboratory and lecture.

8.38. Calculate the coefficient of rank correlation for the data of Problem 8.11, and compare your result with the
correlation coefficient obtained by other methods.
Arranged in ascending order of magnitude, the fathers’ heights are
(1) 62, 63, 64, 65, 66, 67, 67, 68, 68, 69, 70, 71
Since the 6th and 7th places in this array represent the same height (67 inches), we assign a mean rank 6.5 to
both these places. Similarly, the 8th and 9th places are assigned the rank 8.5. Therefore, the fathers’ heights are
assigned the ranks
(2) 1, 2, 3, 4, 5, 6.5, 6.5, 8.5, 8.5, 10, 11, 12
Similarly, the sons’ heights arranged in ascending order of magnitude are
(3) 65, 65, 66, 66, 67, 68, 68, 68, 68, 69, 70, 71
and since the 6th, 7th, 8th, and 9th places represent the same height (68 inches), we assign the mean rank 7.5
(6 ϩ 7 ϩ 8 ϩ 9)>4 to these places. Therefore, the sons’ heights are assigned the ranks
(4) 1.5, 1.5, 3.5, 3.5, 5, 7.5, 7.5, 7.5, 7.5, 10, 11, 12
Using the correspondences (1) and (2), (3) and (4), Table 8-3 becomes Table 8-19.

Table 8-19

Rank of Father 4 2 6.5 3 8.5 1 11 5 8.5 6.5 10 12

Rank of Son 7.5 3.5 7.5 1.5 10 3.5 7.5 1.5 12 5 7.5 11

CHAPTER 8 Curve Fitting, Regression, and Correlation 295

The differences in ranks d, and the computations of d2 and gd2 are shown in Table 8-20.
Table 8-20

d Ϫ3.5 Ϫ1.5 Ϫ1.0 1.5 Ϫ1.5 Ϫ2.5 3.5 3.5 Ϫ3.5 1.5 2.5 1.0
d2 12.25 2.25 1.00 2.25 2.25 6.25 12.25 12.25 12.25 2.25 6.25 1.00 gd 2 ϭ 72.50

Then rrank ϭ 1 Ϫ 6 ad2 ϭ 1 Ϫ 6(72.50) ϭ 0.7465
n(n2 Ϫ 1) 12(122 Ϫ 1)

which agrees well with the value r ϭ 0.7027 obtained in Problem 8.26(b).

Probability interpretation of regression and correlation
8.39. Derive (39) from (37).

Assume that the regression equation is
y ϭ E(Y Z X ϭ x) ϭ a ϩ bx

For the least-squares regression line we must consider

E5[Y Ϫ (a ϩ bX)]26 ϭ E5[(Y Ϫ mY) Ϫ b(X Ϫ mX) ϩ (mY Ϫ bmX Ϫ a)]26
ϭ E[(Y Ϫ mY)2] ϩ b2E[(X Ϫ mX)2] Ϫ 2bE[(X Ϫ mX)(Y Ϫ mY)] ϩ (mY Ϫ bmX Ϫ a)2
ϭ sY2 ϩ b2sX2 Ϫ 2bsXY ϩ (mY Ϫ bmX Ϫ a)2

where we have used E(X Ϫ mX) ϭ 0, E(Y Ϫ mY) ϭ 0.
Denoting the last expression by F(a, b), we have

'F ϭ Ϫ2(mY Ϫ bmX Ϫ a), 'F ϭ 2bsX2 Ϫ 2sXY Ϫ 2mX(mY Ϫ bmX Ϫ a)
'a 'b

Setting these equal to zero, which is a necessary condition for F(a, b) to be a minimum, we find

mY ϭ a ϩ bmX bsX2 ϭ sXY

Therefore, if y ϭ a ϩ bx, then y Ϫ mY ϭ b(x Ϫ mX) or

y Ϫ mY ϭ ssXX2Y(x Ϫ mX)

or y Ϫ mY ϭ x Ϫ mX b
sY ra sX

The similarity of the above proof for populations, using expectations, to the corresponding proof for
samples, using summations, should be noted. In general, results for samples have analogous results for
populations and conversely.

8.40. The joint density function of the random variables X and Y is
f (x, y) ϭ b 23(x ϩ 2y) 0 Յ x Յ 1, 0 Յ y Յ 1
0 otherwise

Find the least-squares regression curve of (a) Y on X, (b) X on Y.

(a) The marginal density function of X is

f1(x) ϭ 1 2 (x ϩ 2y) dy ϭ 2 (x ϩ 1)
3 3
3
0

296 CHAPTER 8 Curve Fitting, Regression, and Correlation

for 0 Յ x Յ 1, and f1(x) ϭ 0 otherwise. Hence, for 0 Յ x Յ 1, the conditional density of Y given X is

f (x, y) x ϩ 2y 0ՅyՅ1
f1(x) y Ͻ 0 or y Ͼ 1
f2( y Z x) ϭ ϭ cx ϩ 1
0

and the least-squares regression curve of Y on X is given by

`

y ϭ E(Y Z X ϭ x) ϭ 3 yf2(y Z x) dy
Ϫ`

ϭ 1 x ϩ 2y ϭ 3x ϩ 4
ya b dy 6x ϩ 6
3 x ϩ 1
0

Neither f2( y Z x) nor the least-squares regression curve is defined when x Ͻ 0 or x Ͼ 1.

(b) For 0 Յ y Յ 1, the marginal density function of Y is

f2( y) ϭ 1 2 (x ϩ 2y) dx ϭ 1 (1 ϩ 4y)
3 3
3
0

Hence, for 0 Յ y Յ 1, the conditional density of X given Y is

f (x, y) 2x ϩ 4y 0ՅxՅ1
f2( y) x Ͻ 0 or x Ͼ 1
f1(x Z y) ϭ ϭ c1 ϩ 4y
0

and the least-squares regression curve of X on Y is given by

`

x ϭ E(X Z Y ϭ y) ϭ 3 xf1(x Z y) dx
Ϫ`

ϭ 1 2x ϩ 4y 2 ϩ 6y
3 xa b dx ϭ
0 1 ϩ 4y 3 ϩ 12y

Neither f1(x Z y) nor the least-squares regression curve is defined when y Ͻ 0 or y Ͼ 1.
Note that the two regression curves y ϭ (3x ϩ 4)>(6x ϩ 6) and x ϭ (2 ϩ 6y)>(3 ϩ 12y) are

different.

8.41. Find (a) X# , (b) Y# , (c) s2X, (d) s2Y, (e) sXY, (f ) r for the distribution in Problem 8.40.

(a) X# ϭ 1 1 x c 2 (x ϩ 2y) d dx dy ϭ 5
3 9
3 0 3 0
xϭ yϭ

(b) Y# ϭ 1 1 y c 2 (x ϩ 2y) d dx dy ϭ 11
3 18
3 0 3 0
xϭ yϭ

(c) X# 2 ϭ 1 1 x2 c 2 (x ϩ 2y)d dx dy ϭ 7
3 18
3 0 3 0
xϭ yϭ

X# 2 X# 2 7 5 2 13
18 9 162
Then sX2 ϭ Ϫ ϭ Ϫ a b ϭ

(d) Y# 2 ϭ 1 1 y2 c 2 (x ϩ 2y)d dx dy ϭ 4
Then 3 9
3 0 3 0
xϭ yϭ

Y# 2 Y# 2 4 11 2 23
9 18 324
sY2 ϭ Ϫ ϭ Ϫ a b ϭ

(e) X# Y# ϭ 1 1 xy c 2 (x ϩ 2y) d dx dy ϭ 1
3 3
3 0 3 0
xϭ yϭ

CHAPTER 8 Curve Fitting, Regression, and Correlation 297

Then sXY ϭ XY Ϫ X# Y# ϭ 1 Ϫ a 5 b a 11 b ϭ Ϫ1612
(f ) 3 9 18

r ϭ sXY ϭ Ϫ1>162 ϭ Ϫ0.0818
sXsY ! 13 > 162 ! 23 > 324

Note that the linear correlation coefficient is small. This is to be expected from observation of the least-
squares regression lines obtained in Problem 8.42.

8.42. Write the least-squares regression lines of (a) Y on X, (b) X on Y for Problem 8.40.

(a) The regression line of Y on X is y Ϫ Y# ϭ rax Ϫ X# b or
sY sX

y Ϫ Y# ϭ sXY (x Ϫ X# ) or y Ϫ 11 ϭ Ϫ1>162 ax Ϫ 5 b
s2X 18 13>162 9

(b) The regression line of X on Y is x Ϫ X# y Ϫ Y#
sX ϭ ra sY b or

x Ϫ X# ϭ sXY (y Ϫ Y# ) or x Ϫ 5 ϭ Ϫ1>162 ay Ϫ 11 b
sY2 9 23>324 18

Sampling theory of regression

8.43. In Problem 8.11 we found the regression equation of y on x to be y ϭ 35.82 ϩ 0.476x. Test the hypothe-
sis at a 0.05 significance level that the regression coefficient of the population regression equation is as low
as 0.180.

t ϭ bϪb Ϫ 2 ϭ 0.476 Ϫ 0.180 !12 Ϫ 2 ϭ 1.95
sy.x>sx !n 1.28>2.66

since sy.x ϭ 1.28 (computed in Problem 8.22) and sx ϭ !x#2 Ϫ x#2 ϭ 2.66 from Problem 8.11.
On the basis of a one-tailed test of Student’s distribution at a 0.05 level, we would reject the hypothesis that

the regression coefficient is as low as 0.180 if t Ͼ t0.95 ϭ 1.81 for 12 Ϫ 2 ϭ 10 degrees of freedom. Therefore,
we can reject the hypothesis.

8.44. Find 95% confidence limits for the regression coefficient of Problem 8.43.

b ϭ b ϩ t a sy. x b
!n Ϫ 2
sx

Then 95% confidence limits for b (obtained by putting t ϭ Ϯt0.975 ϭ Ϯ2.23 for 12 Ϫ 2 ϭ 10 degrees of
freedom) are given by

b Ϯ 2.23 sy. x ϭ 0.476 Ϯ 2.23 a 1.28 b ϭ 0.476 Ϯ 0.340
!12 Ϫ 2 a sx b !10 2.66

i.e., we are 95% confident that b lies between 0.136 and 0.816.

8.45. In Problem 8.11, find 95% confidence limits for the heights of sons whose fathers’ heights are (a) 65.0,
(b) 70.0 inches.

Since t0.975 ϭ 2.23 for 12 Ϫ 2 ϭ 10 degrees of freedom, the 95% confidence limits for yp are

y0 Ϯ 2.23 2 sy.x An ϩ 1 ϩ n(x0 Ϫ x#)2
2n Ϫ s2x

where y0 ϭ 35.82 ϩ 0.476x0 (Problem 8.11), sy.x ϭ 1.28, sx ϭ 2.66 (Problem 8.43), and n ϭ 12.

298 CHAPTER 8 Curve Fitting, Regression, and Correlation

(a) If x0 ϭ 65.0, y0 ϭ 66.76 inches. Also, (x0 Ϫ x#)2 ϭ (65.0 Ϫ 800>12)2 ϭ 2.78. Then 95% confidence limits
are

66.76 Ϯ 2.23 (1.28) A12 ϩ 1 ϩ 12(2.78) ϭ 66.76 Ϯ 3.80 inches
!10 (2.66)2

i.e., we can be about 95% confident that the sons’ heights are between 63.0 and 70.6 inches.

(b) If x0 ϭ 70.0, y0 ϭ 69.14 inches. Also, (x0 Ϫ x#)2 ϭ (70.0 Ϫ 800>12)2 ϭ 11.11. Then the 95% confidence
limits are computed to be 69.14 Ϯ 5.09 inches, i.e., we can be about 95% confident that the sons’ heights

are between 64.1 and 74.2 inches.

Note that for large values of n, 95% confidence limits are given approximately by y0 Ϯ 1.96 sy.x or
y0 Ϯ 2sy.x provided that x0 Ϫ x# is not too large. This agrees with the approximate results mentioned on
page 269. The methods of this problem hold regardless of the size of n or x0 Ϫ x#, i.e., the sampling methods
are exact for a normal population.

8.46. In Problem 8.11, find 95% confidence limits for the mean heights of sons whose fathers’ heights are
(a) 65.0, (b) 70.0 inches.

Since t0.975 ϭ 2.23 for 10 degrees of freedom, the 95% confidence limits for y#p are

y0 Ϯ 2.23 sy. A1 ϩ (x0 Ϫ x#)2
!20 s2x
x

where y0 ϭ 35.82 ϩ 0.476x0 (Problem 8.11), sy.x ϭ 1.28 (Problem 8.43).

(a) If x0 ϭ 65.0, we find [compare Problem 8.45(a)] the 95% confidence limits 66.76 Ϯ 1.07 inches, i.e., we
can be about 95% confident that the mean height of all sons whose fathers’ heights are 65.0 inches will lie

between 65.7 and 67.8 inches.

(b) If x0 ϭ 70.0, we find [compare Problem 8.45(b)] the 95% confidence limits 69.14 Ϯ 1.45 inches, i.e., we
can be about 95% confident that the mean height of all sons whose fathers’ heights are 70.0 inches will lie

between 67.7 and 70.6 inches.

Sampling theory of correlation
8.47. A correlation coefficient based on a sample of size 18 was computed to be 0.32. Can we conclude at a sig-

nificance level of (a) 0.05, (b) 0.01 that the corresponding population correlation coefficient is signifi-
cantly greater than zero?

We wish to decide between the hypotheses (H0: r ϭ 0) and (H1: r Ͼ 0).

t ϭ r !n Ϫ 2 ϭ 0.32 !18 Ϫ 2 ϭ 1.35
!1 Ϫ r2 !1 Ϫ (0.32)2

(a) On the basis of a one-tailed test of Student’s distribution at a 0.05 level, we would reject H0 if
t Ͼ t0.95 ϭ 1.75 for 18 Ϫ 2 ϭ 16 degrees of freedom. Therefore, we cannot reject H0 at a 0.05 level.

(b) Since we cannot reject H0 at a 0.05 level, we certainly cannot reject it at a 0.01 level.

8.48. What is the minimum sample size necessary in order that we may conclude that a correlation coefficient
of 0.32 is significantly greater than zero at a 0.05 level?

At a 0.05 level using a one-tailed test of Student’s distribution, the minimum value of n must be such that

0.32 !n Ϫ 2 ϭ t0.95 for n Ϫ 2 degrees of freedom
!1 Ϫ (0.32)2

For n ϭ 26, n ϭ 24, t0.95 ϭ 1.71, t ϭ 0.32 !24> !1 Ϫ (0.32)2 ϭ 1.65.
For n ϭ 27, n ϭ 25, t0.95 ϭ 1.71, t ϭ 0.32 !25> !1 Ϫ (0.32)2 ϭ 1.69.
For n ϭ 28, n ϭ 26, t0.95 ϭ 1.71, t ϭ 0.32 !26> !1 Ϫ (0.32)2 ϭ 1.72.

Then the minimum sample size is n ϭ 28.

CHAPTER 8 Curve Fitting, Regression, and Correlation 299

8.49. A correlation coefficient based on a sample of size 24 was computed to be r ϭ 0.75. Can we reject the
hypothesis that the population correlation coefficient is as small as (a) r ϭ 0.60, (b) r ϭ 0.50, at a 0.05
significance level?

(a) Z ϭ 1.1513 log a 1 ϩ 0.75 b ϭ 0.9730, mZ ϭ 1.1513 log a 1 ϩ 0.60 b ϭ 0.6932,
1 Ϫ 0.75 1 Ϫ 0.60

sZ ϭ 1 3 ϭ 1 ϭ 0.2182
!n Ϫ !21

The standardized variable is then

z ϭ Z Ϫ mZ ϭ 0.9730 Ϫ 0.6932 ϭ 1.28
sZ 0.2182

At a 0.05 level of significance using a one-tailed test of the normal distribution, we would reject the
hypothesis only if z were greater than 1.64. Therefore, we cannot reject the hypothesis that the population
correlation coefficient is as small as 0.60.

(b) If r ϭ 0.50, mZ ϭ 1.1513 log 3 ϭ 0.5493 and z ϭ (0.9730 Ϫ 0.5493)>0.2182 ϭ 1.94. Therefore, we can
reject the hypothesis that the population correlation coefficient is as small as r ϭ 0.50 at a 0.05 level of

significance.

8.50. The correlation coefficient between physics and mathematics final grades for a group of 21 students was
computed to be 0.80. Find 95% confidence limits for this coefficient.

Since r ϭ 0.80 and n ϭ 21, 95% confidence limits for m2 are given by

Z Ϯ 1.96sZ ϭ 1.1513 log a 1 ϩ r b Ϯ 1.96a 1 b ϭ 1.0986 Ϯ 0.4620
1 Ϫ r 2n Ϫ 3

Then mZ has the 95% confidence interval 0.5366 to 1.5606. r ϭ 0.4904.
1ϩr

If mZ ϭ 1.1513 log a 1 Ϫ r b ϭ 0.5366,

1ϩr
If mZ ϭ 1.1513 log a 1 Ϫ r b ϭ 1.5606, r ϭ 0.9155.

Therefore, the 95% confidence limits for r are 0.49 and 0.92.

8.51. Two correlation coefficients obtained from samples of size n1 ϭ 28 and n2 ϭ 35 were computed to be
r1 ϭ 0.50 and r2 ϭ 0.30, respectively. Is there a significant difference between the two coefficients at a

0.05 level?

Z1 ϭ 1.1513 log 1 ϩ r1 b ϭ 0.5493, Z2 ϭ 1.1513 log 1 ϩ r2 b ϭ 0.3095
a1 Ϫ r1 a1 Ϫ r2

and sZ1ϪZ2 ϭ An1 1 3 ϩ n2 1 3 ϭ 0.2669
Ϫ Ϫ

We wish to decide between the hypotheses (H0: mZ1 ϭ mZ2 ) and (H1: mZ1 2 mZ2 ).
Under the hypothesis H0,

Z1 Ϫ Z2 Ϫ (mZ1 Ϫ mZ2 ) 0.5493 Ϫ 0.3095 Ϫ 0
sZ1ϪZ2 0.2669
z ϭ ϭ ϭ 0.8985

Using a two-tailed test of the normal distribution, we would reject H0 only if z Ͼ 1.96 or z Ͻ Ϫ1.96.
Therefore, we cannot reject H0, and we conclude that the results are not significantly different at a 0.05
level.

300 CHAPTER 8 Curve Fitting, Regression, and Correlation

Miscellaneous problems
8.52. Prove formula (25), page 269.

For the least-squares line we have, from Problems 8.20 and 8.21,

sy2.x ϭ a( y Ϫ y#)2 Ϫ b a (x Ϫ x#)( y Ϫ y#)
n n

But by definition,

a(y Ϫ y#)2 ϭ s2y a (x Ϫ x#)( y Ϫ y#) ϭ sxy
n n

and, by (6) on page 267,

bϭ a(x Ϫ x#)( y Ϫ y#) ϭ sxy
a(x Ϫ x#)2 sx2

Hence, s2xy sxy 2
s2y.x ϭ s2y Ϫ sx2 ϭ sy2 c 1 Ϫ a sxsy b d ϭ s2y(1 Ϫ r2)

An analogous formula holds for the population (see Problem 8.54).

8.53. Prove that E[(Y Ϫ Y# )2] ϭ E[(Y Ϫ Yest)2] ϩ [(Yest Ϫ Y# )2] for the case of (a) a least-squares line,
(b) a least-squares parabola.

We have Y Ϫ Y# ϭ (Y Ϫ Yest) ϩ (Yest Ϫ Y# )

Then (Y Ϫ Y# )2 ϭ (Y Ϫ Yest)2 ϩ (Yest Ϫ Y# )2 ϩ 2 (Y Ϫ Yest)(Yest Ϫ Y# )

and so E[(Y Ϫ Y# )2 ϭ E[(Y Ϫ Yest)2] ϩ E[(Yest Ϫ Y# )2] ϩ 2E[(Y Ϫ Yest)(Yest Ϫ Y# )]

The required result follows if we can show that the last term is zero.

(a) For linear regression, Yest ϭ a ϩ bX. Then

E[(Y Ϫ Yest)(Yest Ϫ Y# )] ϭ E[(Y Ϫ a Ϫ bX)(a ϩ bX Ϫ Y# )]
ϭ (a Ϫ Y# )E(Y Ϫ a Ϫ bX) ϩ bE(XY Ϫ aX Ϫ bX2)

ϭ0

because of the normal equations

E(Y Ϫ a Ϫ bX) ϭ 0, E(XY Ϫ aX Ϫ bX2) ϭ 0

(Compare Problem 8.3.)
(b) For parabolic regression, Yest ϭ a ϩ bX ϩ gX2. Then

E[(Y Ϫ Yest)(Yest Ϫ Y# )] ϭ E[(Y Ϫ a Ϫ bX Ϫ gX2)(a ϩ bX ϩ gX2 Ϫ Y# )]
ϭ (a Ϫ Y# )E(Y Ϫ a Ϫ bX Ϫ gX 2) ϩ bE[X(Y Ϫ a Ϫ bX Ϫ gX 2)]
ϩ gE[X 2(Y Ϫ a Ϫ bX Ϫ gX 2)]
ϭ0

because of the normal equations

E(Y Ϫ a Ϫ bX Ϫ gX 2) ϭ 0, E[X(Y Ϫ a Ϫ bX Ϫ gX 2)] ϭ 0, E[X 2(Y Ϫ a Ϫ bX Ϫ gX 2)] ϭ 0

Compare equations (19), page 269.
The result can be extended to higher-order least-squares curves.

CHAPTER 8 Curve Fitting, Regression, and Correlation 301

8.54. Prove that sY2.X ϭ sY2 (1 Ϫ r2) for least-squares regression.

By definition of the generalized correlation coefficient r, together with Problem 8.53, we have for either the
linear or parabolic case

r2 ϭ E[(Yest Ϫ Y# )2] ϭ 1 Ϫ E[(Y Ϫ Yest)2] ϭ 1 Ϫ s2Y. X
E[(Y Ϫ Y# )2] E[(Y Ϫ Y# )2] s2Y

and the result follows at once.
The relation also holds for higher-order least-squares curves.

8.55. Show that for the case of linear regression the correlation coefficient as defined by (45) reduces to that
defined by (40).

The square of the correlation coefficient, i.e., the coefficient of determination, as given by (45) is in the case of
linear regression given by

(1) r2 ϭ E[(Yest Ϫ Y# )2] ϭ E[(a ϩ bX Ϫ Y# )2]
E[(Y Ϫ Y# )2] s2Y

But since Y# ϭ a ϩ bX# ,

(2) E[(a ϩ bX Ϫ Y# )2] ϭ E[b2(X Ϫ X# )2] ϭ b2E[(X Ϫ X# )2]

ϭ s2XY s2X ϭ s2XY
s4X s2X

Then (1) becomes

(3) r2 ϭ s2XY or r ϭ sXY
sX2 sY2 sX sY

as we were required to show. (The correct sign for r is included in sXY.)

8.56. Refer to Table 8-21. (a) Find a least-squares regression parabola fitting the data. (b) Compute the regres-
sion values (commonly called trend values) for the given years and compare with the actual values.
(c) Estimate the population in 1945. (d) Estimate the population in 1960 and compare with the actual
value, 179.3. (e) Estimate the population in 1840 and compare with the actual value, 17.1.

Table 8-21

Year 1850 1860 1870 1880 1890 1900 1910 1920 1930 1940 1950

U.S. Population 23.2 31.4 39.8 50.2 62.9 76.0 92.0 105.7 122.8 131.7 151.1
(millions)

Source: Bureau of the Census.

(a) Let the variables x and y denote, respectively, the year and the population during that year. The equation of
a least-squares parabola fitting the data is

(1) y ϭ a ϩ bx ϩ cx2
where a, b, and c are found from the normal equations

ay ϭ an ϩ b ax ϩ c ax2
(2) axy ϭ a ax ϩ b ax2 Ϫ c ax3

ax2y ϭ a ax2 ϩ b ax3 ϩ c ax4
It is convenient to locate the origin so that the middle year, 1900, corresponds to x ϭ 0, and to choose
a unit that makes the years 1910, 1920, 1930, 1940, 1950 and 1890, 1880, 1870, 1860, 1850 correspond

302 CHAPTER 8 Curve Fitting, Regression, and Correlation

to 1, 2, 3, 4, 5 and Ϫ1, Ϫ2, Ϫ3, Ϫ4, Ϫ5, respectively. With this choice gx and gx3 are zero and equations
(2) are simplified.

The work involved in computation can be arranged as in Table 8-22. The normal equations (2) become

11a ϩ 110c ϭ 886.8
(3) 110b ϭ 1429.8

110a ϩ 1958c ϭ 9209.0

From the second equation in (3), b ϭ 13.00; from the first and third equations, a ϭ 76.64, c ϭ 0.3974.
Then the required equation is

(4) y ϭ 76.64 ϩ 13.00x ϩ 0.3974x2

where the origin, x ϭ 0, is July 1, 1900, and the unit of x is 10 years.

(b) The trend values, obtained by substituting x ϭ Ϫ5, Ϫ4, Ϫ3, Ϫ2, Ϫ1, 0, 1, 2, 3, 4, 5 in (4), are shown in
Table 8-23 together with the actual values. It is seen that the agreement is good.

Table 8-22

Year x y x2 x3 x4 xy x2y
1850 Ϫ5 23.2 25 Ϫ125 625 Ϫ116.0 580.0
1860 Ϫ4 31.4 16 Ϫ64 256 Ϫ125.6 502.4
1870 Ϫ3 39.8 Ϫ119.4 358.2
1880 Ϫ2 50.2 9 Ϫ27 81 Ϫ100.4 200.8
1890 Ϫ1 62.9 4 Ϫ8 16
1900 76.0 1 Ϫ1 1 Ϫ62.9 62.9
1910 0 92.0 00 0 0 0
1920 1 105.7 11 1 92.0 92.0
1930 2 122.8 48 16 422.8
1940 3 131.7 9 27 81 211.4 1105.2
1950 4 151.1 16 64 256 368.4 2107.2
5 25 125 625 526.8 3777.5
gy ϭ 755.5
gx ϭ 0 886.8 gx2 ϭ gx3 ϭ 0 gx4 ϭ gx2y ϭ
110 1958 gxy ϭ 9209.0
1429.8

Table 8-23

Year x ϭ Ϫ5 x ϭ Ϫ4 x ϭ Ϫ3 x ϭ Ϫ2 x ϭ Ϫ1 x ϭ 0 x ϭ 1 x ϭ 2 x ϭ 3 x ϭ 4 x ϭ 5
1850 1860 1870 1880 1890 1900 1910 1920 1930 1940 1950

Trend Value 21.6 31.0 41.2 52.2 64.0 76.6 90.0 104.2 119.2 135.0 151.6

Actual Value 23.2 31.4 39.8 50.2 62.9 76.0 92.0 105.7 122.8 131.7 151.1

(c) 1945 corresponds to x ϭ 4.5, for which y ϭ 76.64 ϩ 13.00(4.5) ϩ 0.3974(4.5)2 ϭ 143.2.

(d) 1960 corresponds to x ϭ 6, for which y ϭ 76.64 ϩ 13.00(6) ϩ 0.3974(6)2 ϭ 168.9. This does not agree
too well with the actual value, 179.3.

(e) 1840 corresponds to x ϭ Ϫ6, for which y ϭ 76.64 ϩ 13.00(Ϫ6) ϩ 0.3974(Ϫ6)2 ϭ 12.9. This does not
agree with the actual value, 17.1.
This example illustrates the fact that a relationship which is found to be satisfactory for a range of
values need not be satisfactory for an extended range of values.

CHAPTER 8 Curve Fitting, Regression, and Correlation 303

8.57. The average prices of stocks and bonds listed on the New York Stock Exchange during the years 1950
through 1959 are given in Table 8-24. (a) Find the correlation coefficient, (b) interpret the results.

Table 8-24

Year 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959

Average Price of 35.22 39.87 41.85 43.23 40.06 53.29 54.14 49.12 40.71 55.15
Stocks (dollars)

Average Price of 102.43 100.93 97.43 97.81 98.32 100.07 97.08 91.59 94.85 94.65
Bond (dollars)

Source: New York Stock Exchange.

(a) Denoting by x and y the average prices of stocks and bonds, the calculation of the correlation coefficient
can be organized as in Table 8-25. Note that the year is used only to specify the corresponding values of
x and y.

Table 8-25

x y xr ϭ yr ϭ xr2 xryr yr2
x Ϫ x# y Ϫ y#
35.22 102.43 Ϫ10.04 100.80 Ϫ49.30 24.11
39.87 100.93 Ϫ5.39 4.91 29.05 Ϫ18.38 11.63
41.85 Ϫ3.41 3.41 11.63
43.23 97.43 Ϫ2.03 Ϫ0.09 0.31 0.01
40.06 97.81 Ϫ5.20 0.29 4.12 Ϫ0.59 0.08
53.29 98.32 0.80 27.04 Ϫ4.16 0.64
54.14 100.07 8.03 2.55 64.48 20.48 6.50
49.12 97.08 8.88 Ϫ0.44 78.85 Ϫ3.91 0.19
40.71 91.59 3.86 Ϫ5.93 14.90 Ϫ22.89 35.16
55.15 94.85 Ϫ4.55 Ϫ2.67 20.70 12.15 7.13
gx ϭ 94.65 9.89 Ϫ2.87 97.81 Ϫ28.38 8.24
452.64 gx ϭ gxr2 ϭ gxryr ϭ gyr2 ϭ
x# ϭ 975.16 449.38 Ϫ94.67 93.69
45.26 y# ϭ
97.52

Then by the product-moment formula,

r ϭ axryr ϭ Ϫ94.67 ϭ Ϫ0.4614
Q axr2R Q ayr2R !(449.38)(93.69)

B

(b) We conclude that there is some negative correlation between stock and bond prices (i.e., a tendency
for stock prices to go down when bond prices go up, and vice versa), although this relationship is not
marked.

Another method
Table 8-26 shows the ranks of the average prices of stocks and bonds for the years 1950 through 1959 in order
of increasing prices. Also shown in the table are the differences in rank d and gd 2.

304 CHAPTER 8 Curve Fitting, Regression, and Correlation

Table 8-26

Year 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959

Stock Prices in 1 2 5 6 3 8 9 7 4 10

Order of Rank

Bond Prices in 10 9 5 6 7 8 4 13 2
Order of Rank

Differences in Ϫ9 Ϫ7 0 0 Ϫ4 0 5 61 8
Rank (d )

d 2 81 49 0 0 16 0 25 36 1 64 gd2 ϭ
272

Then rrank ϭ 1 Ϫ 6 ad2 ϭ 1 Ϫ 6(272) 1) ϭ Ϫ0.6485
n(n2 Ϫ 1) 10(102 Ϫ

This compares favorably with the result of the first method.

8.58. Table 8-27 shows the frequency distributions of the final grades of 100 students in mathematics and
physics. With reference to this table determine (a) the number of students who received grades 70 through
79 in mathematics and 80 through 89 in physics, (b) the percentage of students with mathematics grades
below 70, (c) the number of students who received a grade of 70 or more in physics and less than 80 in
mathematics, (d) the percentage of students who passed at least one of the subjects assuming 60 to be the
minimum passing grade.

Table 8-27
MATHEMATICS GRADES

PHYSICS GRADES 90–99 40–49 50–59 60–69 70–79 80–89 90–99 TOTALS
80–89 2 4 4 10
70–79 1 4 1 4 6 5 16
60–69 3 6 5 8 1 24
50–59 9 10 2 21
6 5 17
2

40–49 3 54 12

TOTALS 7 15 25 23 20 10 100

(a) Proceed down the column headed 70–79 (mathematics grade) to the row marked 80–89 (physics grade).
The entry 4 gives the required number of students.

(b) Total number of students with mathematics grades below 70

ϭ (number with grades 40–49) ϩ (number with grades 50–59) ϩ (number with grades 60–69)
ϭ 7 ϩ 15 ϩ 25 ϭ 47

Percentage of students with mathematics grades below 70 ϭ 47>100 ϭ 47%.

(c) The required number of students in the total of the entries in Table 8-28, which represents part of
Table 8-27.
Required number of students ϭ 1 ϩ 5 ϩ 2 ϩ 4 ϩ 10 ϭ 22.

CHAPTER 8 Curve Fitting, Regression, and Correlation 305

Table 8-28 Table 8-29
MATHEMATICS MATHEMATICS
GRADES GRADES

60–69 70–79 40–49 50–59
PHYSICS
90–99GRADES 2 50–59 3 6

80–89 PHYSICS14 40–49 3 5
GRADES
70–79 5 10

(d) Referring to Table 8-29, which is taken from Table 8-27, it is seen that the number of students with grades
below 60 in both mathematics and physics is 3 ϩ 3 ϩ 6 ϩ 5 ϭ 17. Then the number of students with
grades 60 or over in either physics or mathematics or both is 100 Ϫ 17 ϭ 83, and the required percentage
is 83>100 ϭ 83%.
Table 8-27 is sometimes called a bivariate frequency table or bivariate frequency distribution. Each square

in the table is called a cell and corresponds to a pair of classes or class intervals. The number indicated in the
cell is called the cell frequency. For example, in part (a) the number 4 is the frequency of the cell corresponding
to the pair of class intervals 70–79 in mathematics and 80–89 in physics.

The totals indicated in the last row and last column are called marginal totals or marginal frequencies. They
correspond, respectively, to the class frequencies of the separate frequency distributions of mathematics and
physics grades.

8.59. Show how to modify the formula of Problem 8.31 for the case of data grouped as in Table 8-27.
For grouped data, we can consider the various values of the variables x and y as coinciding with the
class marks, while fx and fy are the corresponding class frequencies or marginal frequencies indicated in
the last row and column of the bivariate frequency table. If we let f represent the various cell
frequencies corresponding to the pairs of class marks (x, y), then we can replace the formula of
Problem 8.31 by

n a fxy Ϫ Q a fxxR Q a fyyR
(1) r ϭ

22

BSn a fxx2 Ϫ Q a fxxR T Sn a fyy2 Ϫ Q a fyyR T

If we let x ϭ x0 ϩ cxux and y ϭ y0 ϩ cyuy, where cx and cy are the class interval widths (assumed constant)
and x0 and y0 are arbitrary class marks corresponding to the variables, the above formula becomes

n a fuxuy Ϫ Q a fxuxR Q a fyuyR
(2) r ϭ

22

Sn a fxux2 Ϫ Q a fxuxR T Sn a fyu2y Ϫ Q a fyuyR T
B

This is the coding method used in Chapter 5 as a short method for computing means, standard deviations,
and higher moments.

8.60. Find the coefficient of linear correlation of the mathematics and physics grades of Problem 8.58.
We use formula (2) of Problem 8.59. The work can be arranged as in Table 8-30, which is called a correlation
table.

306 CHAPTER 8 Curve Fitting, Regression, and Correlation

Table 8-30

The number in the corner of each cell represents the product fuxuy, where f is the cell frequency. The sum of
these corner numbers in each row is indicated in the corresponding row of the last column. The sum of these

corner numbers in each column is indicated in the corresponding column of the last row. The final totals of the

last row and last column are equal and represent g fuxuy.
From Table 8-30 we have

n a fuxuy Ϫ Q a fxuxR Q a fyuyR


22

Sn a fxux2 Ϫ Q a fxuxR T Sn a fyu2y Ϫ Q a fyuyR T
B

(100)(125) Ϫ (64)(Ϫ55) 16,020
ϭ ϭ ϭ 0.7686
2[(100)(236) Ϫ (64)2][(100)(253) Ϫ (Ϫ55)2] 2(19,504)(22,275)

8.61. Use the correlation table of Problem 8.60 to compute (a) sx, (b) sy, (c) sxy, and verify the formula
r ϭ sxy>sxsy.

a fxux2 a fxux 2 236 64 2
n n A100 100
(a) sx ϭ cxB Ϫ a b ϭ 10 Ϫ Q R ϭ 13.966

a fyu2y a a fyuy b2 253 Ϫ55 2
n n A100 100
(b) sy ϭ cyB Ϫ ϭ 10 Ϫ Q R ϭ 14.925

(c) sxy ϭ cxc B a fuxuy Ϫ ¢ a fxux ≤ ¢ a fyuy ≤ R ϭ (10) (10) c 125 Ϫ a 64 b a Ϫ55 b d ϭ 160.20
n nn 100 100 100

CHAPTER 8 Curve Fitting, Regression, and Correlation 307

Therefore, the standard deviations of mathematics grades and physics grades are 14.0 and 14.9, respectively,
while their covariance is 160.2. We have

sxy ϭ 160.20 ϭ 0.7686
sxsy (13.966)(14.925)

agreeing with r as found in Problem 8.60.

8.62. Write the equations of the regression lines of (a) y on x, (b) x on y for the data of Problem 8.60.
From Table 8-30 we have

x# ϭ x0 ϩ cx a fxux ϭ 64.5 ϩ (10)(64) ϭ 70.9
n 100

a fyuy (10)(Ϫ55)
y# ϭ y0 ϩ cy n ϭ 74.5 ϩ 100 ϭ 69.0

From the results of Problem 8.61, sx ϭ 13.966, sy ϭ 14.925 and r ϭ 0.7686.
We now use (16), page 268, to obtain the equations of the regression lines.

rsy (0.7686)(14.925)
(a) y Ϫ y# ϭ sx (x Ϫ x#), y Ϫ 69.0 ϭ (x Ϫ 70.9),
13.966

or y Ϫ 69.0 ϭ 0.821(x Ϫ 70.9)

(b) x Ϫ x# ϭ rsx (y Ϫ y#), y Ϫ 70.0 ϭ (0.7686) (13.966) ( y Ϫ 69.0),
or sy
14.925

x Ϫ 70.9 ϭ 0.719(y Ϫ 69.0)

8.63. Compute the standard errors of estimate (a) sy.x, (b) sy.x for the data of Problem 8.60. Use the results of
Problem 8.61.

(a) sy.x ϭ sy !1 Ϫ r2 ϭ 14.925 !1 Ϫ (0.7686)2 ϭ 9.548

(b) sx.y ϭ sx !1 Ϫ r2 ϭ 13.966 !1 Ϫ (0.7686)2 ϭ 8.934

SUPPLEMENTARY PROBLEMS
The least-squares line
8.64. Fit a least-squares line to the data in Table 8-31 using (a) x as the independent variable, (b) x as the dependent

variable. Graph the data and the least-squares lines using the same set of coordinate axes.

Table 8-31
x 3 5 6 8 9 11
y 234658

8.65. For the data of Problem 8.64, find (a) the values of y when x ϭ 5 and x ϭ 12, (b) the value of x when y ϭ 7.

8.66. Table 8-32 shows the final grades in algebra and physics obtained by 10 students selected at random from a
large group of students. (a) Graph the data. (b) Find the least-squares line fitting the data, using x as the

308 CHAPTER 8 Curve Fitting, Regression, and Correlation

independent variable. (c) Find the least-squares line fitting the data, using y as independent variable. (d) If a
student receives a grade of 75 in algebra, what is her expected grade in physics? (e) If a student receives a grade
of 95 in physics, what is her expected grade in algebra?

Table 8-32

Algebra (x) 75 80 93 65 87 71 98 68 84 77
Physics (y) 82 78 86 72 91 80 95 72 89 74

8.67. Refer to Table 8-33. (a) Construct a scatter diagram. (b) Find the least-squares regression line of y on x. (c) Find
the least-squares regression line of x on y. (d) Graph the two regression lines of (b) and (c) on the scatter
diagram of (a).

Table 8-33

Grade on First Quiz (x) 6 5 8 8 7 6 10 4 9 7

Grade on Second Quiz (y) 8 7 7 10 5 8 10 6 8 6

Least-squares regression curves
8.68. Fit a least-squares parabola, y ϭ a ϩ bx ϩ cx2, to the data in Table 8-34.

Table 8-34

x0 1 2 3 4 5 6

y 2.4 2.1 3.2 5.6 9.3 14.6 21.9

8.69. Table 8-35 gives the stopping distance d (feet) of an automobile traveling at speed v (miles per hour) at the
instant danger is sighted. (a) Graph d against v. (b) Fit a least-squares parabola of the form d ϭ a ϩ bv ϩ cv2
to the data. (c) Estimate d when v ϭ 45 miles per hour and 80 miles per hour.

Table 8-35
Speed, v (miles per hour) 20 30 40 50 60 70
Stopping Distance, d (feet) 54 90 138 206 292 396

8.70. The number y of bacteria per unit volume present in a culture after x hours is given in Table 8-36. (a) Graph the
data on semilogarithmic graph paper, with the logarithmic scale used for y and the arithmetic scale for x. (b) Fit
a least-squares curve having the form y ϭ abx to the data, and explain why this particular equation should yield
good results, (c) Compare the values of y obtained from this equation with the actual values. (d) Estimate the
value of y when x ϭ 7.

Table 8-36

Number of Hours (x) 0123 4 5 6

Number of Bacteria per Unit Volume (y) 32 47 65 92 132 190 275

Multiple regression

8.71. Table 8-37 shows the corresponding values of three variables x, y, and z. (a) Find the linear least-squares
regression equation of z on x and y. (b) Estimate z when x ϭ 10 and y ϭ 6.

CHAPTER 8 Curve Fitting, Regression, and Correlation 309

Table 8-37
x 3 5 6 8 12 14
y 16 10 7 4 3 2
z 90 72 54 42 30 12

Standard error of estimate and linear correlation coefficient
8.72. Find (a) sy.x, (b) sx.y for the data in Problem 8.67.

8.73. Compute (a) the total variation in y, (b) the unexplained variation in y, (c) the explained variation in y for the
data of Problem 8.67.

8.74. Use the results of Problem 8.73 to find the correlation coefficient between the two sets of quiz grades of
Problem 8.67.

8.75. Find the covariance for the data of Problem 8.67 (a) directly, (b) by using the formula sxy ϭ rsxsy and the result
of Problem 8.74.

8.76. Table 8-38 shows the ages x and systolic blood pressures y of 12 women. (a) Find the correlation coefficient
between x and y. (b) Determine the least-squares regression line of y on x. (c) Estimate the blood pressure of a
woman whose age is 45 years.

Table 8-38

Age (x) 56 42 72 36 63 47 55 49 38 42 68 60

Blood Pressure (y) 147 125 160 118 149 128 150 145 115 140 152 155

8.77. Find the correlation coefficients for the data of (a) Problem 8.64, (b) Problem 8.66.

8.78. The correlation coefficient between two variables x and y is, r ϭ 0.60. If sx ϭ 1.50, sy ϭ 2.00, x# ϭ 10 and
y# ϭ 20, find the equations of the regression lines of (a) y on x, (b) x on y.

8.79. Compute (a) sy.x, (b) sx.y for the data of Problem 8.78.

8.80. If sy.x ϭ 3 and sy ϭ 5, find r.

8.81. If the correlation coefficient between x and y is 0.50, what percentage of the total variation remains unexplained
by the regression equation?

8.82. (a) Compute the correlation coefficient between the corresponding values of x and y given in Table 8-39.
(b) Multiply each x value in the table by 2 and add 6. Multiply each y value in the table by 3 and subtract 15.
Find the correlation coefficient between the two new sets of values, explaining why you do or do not obtain the
same result as in part (a).

Table 8-39

x 2 4 5 6 8 11

y 18 12 10 8 7 5

310 CHAPTER 8 Curve Fitting, Regression, and Correlation

Generalized correlation coefficient
8.83. Find the standard error of estimate of z on x and y for the data of Problem 8.71.

8.84. Compute the coefficient of multiple correlation for the data of Problem 8.71.

Rank correlation

8.85. Two judges in a contest, who were asked to rank 8 candidates, A, B, C, D, E, F, G, and H, in order of their
preference, submitted the choice shown in Table 8-40. Find the coefficient of rank correlation and decide how
well the judges agreed in their choices.

Table 8-40

Candidate A BCDE FGH

First Judge 52814637

Second Judge 4 5 7 3 2 8 1 6

8.86. Find the coefficient of rank correlation for the data of (a) Problem 8.67, (b) Problem 8.76.

8.87. Find the coefficient of rank correlation for the data of Problem 8.82.

Sampling theory of regression
8.88. On the basis of a sample of size 27 a regression equation y on x was found to be y ϭ 25.0 ϩ 2.00x. If

sy.x ϭ 1.50, sx ϭ 3.00, and x# ϭ 7.50, find (a) 95%, (b) 99%, confidence limits for the regression coefficient.

8.89. In Problem 8.88 test the hypothesis that the population regression coefficient is (a) as low as 1.70, (b) as high as
2.20, at a 0.01 level of significance.

8.90. In Problem 8.88 find (a) 95%, (b) 99%, confidence limits for y when x ϭ 6.00.

8.91. In Problem 8.88 find (a) 95%, (b) 99%, confidence limits for the mean of all values of y corresponding to
x ϭ 6.00.

8.92. Referring to Problem 8.76, find 95% confidence limits for (a) the regression coefficient of y on x, (b) the blood
pressures of all women who are 45 years old, (c) the mean of the blood pressures of all women who are 45
years old.

Sampling theory of correlation
8.93. A correlation coefficient based on a sample of size 27 was computed to be 0.40. Can we conclude at a

significance level of (a) 0.05, (b) 0.01, that the corresponding population correlation coefficient is significantly
greater than zero?

8.94. A correlation coefficient based on a sample of size 35 was computed to be 0.50. Can we reject the hypothesis
that the population correlation coefficient is (a) as small as r ϭ 0.30, (b) as large as r ϭ 0.70, using a 0.05
significance level?

8.95. Find (a) 95%, (b) 99%, confidence limits for a correlation coefficient that is computed to be 0.60 from a sample
of size 28.

CHAPTER 8 Curve Fitting, Regression, and Correlation 311

8.96. Work Problem 8.95 if the sample size is 52.

8.97. Find 95% confidence limits for the correlation coefficient computed in Problem 8.76.

8.98. Two correlation coefficients obtained from samples of sizes 23 and 28 were computed to be 0.80 and 0.95,
respectively. Can we conclude at a level of (a) 0.05, (b) 0.01, that there is a significant difference between the
two coefficients?

Miscellaneous results
8.99. The sample least-squares regression lines for a set of data involving X and Y are given by 2x Ϫ 5y ϭ 3,

5x Ϫ 8y ϭ 2. Find the linear correlation coefficient.

8.100. Find the correlation coefficient between the heights and weights of 300 adult males in the United States as
given in the Table 8-41.

Table 8-41
HEIGHTS x (inches)

59–62 63–66 67–70 71–74 75–78

90–109 2 1

WEIGHTS y (pounds) 110–129 7 8 4 2

130–149 5 15 22 7 1

150–169 2 12 63 19 5

170–189 7 28 32 12

190–209 2 10 20 7

210–229 142

8.101. (a) Find the least-squares regression line of y on x for the data of Problem 8.100. (b) Estimate the weights of
two men whose heights are 64 and 72 inches, respectively.

8.102. Find (a) sy.x, (b) sx.y for the data of Problem 8.100.
8.103. Find 95% confidence limits for the correlation coefficient computed in Problem 8.100.

8.104. Find the correlation coefficient between U.S. consumer price indexes and wholesale price indexes for all
commodities as shown in Table 8-42. The base period 1947–1949 ϭ 100.

Table 8-42

Year 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958
101.8 102.8 111.0 113.5 114.4 114.8 114.5 116.2 120.2 123.5
Consumer 99.2 103.1 114.8 111.6 110.1 110.3 110.7 114.3 117.6 119.2
Price Index

Wholesale
Price Index

Source: Bureau of Labor Statistics.

312 CHAPTER 8 Curve Fitting, Regression, and Correlation

8.105. Refer to Table 8-43. (a) Graph the data. (b) Find a least-squares line fitting the data and construct its graph.
(c) Compute the trend values and compare with the actual values. (d) Predict the price index for medical care
during 1958 and compare with the true value (144.4). (e) In what year can we expect the index of medical
costs to be double that of 1947 through 1949, assuming present trends continue?

Table 8-43

Year 1950 1951 1952 1953 1954 1955 1956 1957

Consumer Price Index 106.0 111.1 117.2 121.3 125.2 128.0 132.6 138.0
for Medical Care
(1947–1949 ϭ 100)

Source: Bureau of Labor Statistics.

8.106. Refer to Table 8-44. (a) Graph the data. (b) Find a least-squares parabola fitting the data. (c) Compute the trend
values and compare with the actual values. (d) Explain why the equation obtained in (b) is not useful for
extrapolation purposes.

Table 8-44

Year 1915 1920 1925 1930 1935 1940 1945 1950 1955

Birth Rate per 25.0 23.7 21.3 18.9 16.9 17.9 19.5 23.6 24.6

1000 Population

Source: Department of Health and Human Services.

ANSWERS TO SUPPLEMENTARY PROBLEMS

8.64. (a) y ϭ Ϫ31 ϩ 75x or y ϭ Ϫ0.333 ϩ 0.714x (b) xϭ 1ϩ 9 or x ϭ 1.00 ϩ 1.29y
7y

8.65. (a) 3.24, 8.24 (b) 10.00 8.66. (b) y ϭ 29.13 ϩ 0.661x (c) x ϭ Ϫ14.39 ϩ 1.15y (d) 79 (e) 95

8.67. (b) y ϭ 4.000 ϩ 0.500x (c) x ϭ 2.408 ϩ 0.612y

8.68. y ϭ 5.51 ϩ 3.20(x Ϫ 3) ϩ 0.733(x Ϫ 3)2 or y ϭ 2.51 Ϫ 1.20x ϩ 0.733x2

8.69. (b) d ϭ 41.77 Ϫ 1.096v ϩ 0.08786v2 (c) 170 ft, 516 feet

8.70. (b) y ϭ 32.14(1.427)x or y ϭ 32.14(10)0.1544x or y ϭ 32.14e0.3556x (d) 387

8.71. (a) z ϭ 61.40 Ϫ 3.65x ϩ 2.54y (b) 40 8.72. (a) 1.304 (b) 1.443

8.73. (a) 24.50 (b) 17.00 (c) 7.50 8.74. 0.5533 8.75. 1.5

8.76. (a) 0.8961 (b) y ϭ 80.78 ϩ 1.138x (c) 132

8.77. (a) 0.958 (b) 0.872 8.78. (a) y ϭ 0.8x ϩ 12 (b) x ϭ 0.45y ϩ 1

CHAPTER 8 Curve Fitting, Regression, and Correlation 313

8.79. (a) 1.60 (b) 1.20 8.80. Ϯ0.80 8.81. 75% 8.82. (a) Ϫ0.9203

8.83. 3.12 8.84. 0.9927 8.85. rrank ϭ 2 8.86. (a) 0.5182 (b) 0.9318
3

8.87. Ϫ1.0000 8.88. (a) 2.00 Ϯ 0.21 (b) 2.00 Ϯ 0.28

8.89. (a) Using a one-tailed test, we can reject the hypothesis.
(b) Using a one-tailed test, we cannot reject the hypothesis.

8.90. (a) 37.0 Ϯ 3.6 (b) 37.0 Ϯ 4.9 8.91. (a) 37.0 Ϯ 1.5 (b) 37.0 Ϯ 2.1

8.92. (a) 1.138 Ϯ 0.398 (b) 132.0 Ϯ 19.2 (c) 132.0 Ϯ 5.4

8.93. (a) Yes. (b) No. 8.94. (a) No. (b) Yes.

8.95. (a) 0.2923 and 0.7951 (b) 0.1763 and 0.8361

8.96. (a) 0.3912 and 0.7500 (b) 0.3146 and 0.7861

8.97. 0.7096 and 0.9653 8.98. (a) yes (b) no 8.99. 0.8

8.100. 0.5440 8.101. (a) y ϭ 4.44x Ϫ 142.22 (b) 141.9 and 177.5 pounds

8.102. (a) 16.92 1b (b) 2.07 in 8.103. 0.4961 and 0.7235 8.104. 0.9263

8.105. (b) y ϭ 122.42 ϩ 2.19x if x-unit is 1 year and origin is at Jan. 1, 1954; or y ϭ 107.1 ϩ 4.38x if x-unit is 1 year
2

and origin is at July 1, 1950

(d) 142.1 (e) 1971

8.106. (b) y ϭ 18.16 Ϫ 0.1083x ϩ 0.4653x2, where y is the birth rate per 1000 population and x-unit is 5 years with
origin at July 1, 1935

CCHHAAPPTTEERR 192

Analysis of Variance

The Purpose of Analysis of Variance

In Chapter 7 we used sampling theory to test the significance of differences between two sampling means. We
assumed that the two populations from which the samples were drawn had the same variance. In many situations
there is a need to test the significance of differences among three or more sampling means, or equivalently to test
the null hypothesis that the sample means are all equal.

EXAMPLE 9.1 Suppose that in an agricultural experiment, four different chemical treatments of soil produced mean
wheat yields of 28, 22, 18, and 24 bushels per acre, respectively. Is there a significant difference in these means, or is
the observed spread simply due to chance?

Problems such as these can be solved by using an important technique known as the analysis of variance, de-
veloped by Fisher. It makes use of the F distribution already considered in previous chapters.

One-Way Classification or One-Factor Experiments

In a one-factor experiment measurements or observations are obtained for a independent groups of samples,
where the number of measurements in each group is b. We speak of a treatments, each of which has b repetitions
or replications. In Example 9.1, a ϭ 4.

The results of a one-factor experiment can be presented in a table having a rows and b columns (Table 9-1).
Here xjk denotes the measurement in the jth row and kth column, where j ϭ 1, 2, . . . , a and k ϭ 1, 2, . . . , b. For
example, x35 refers to the fifth measurement for the third treatment.

Table 9-1

Treatment 1 x11 x12 c x1b x#1.
Treatment 2 x21 x22 c x2b x#2.

( ( x#a.
Treatment a xa1 xa2 c xab

We shall denote by x#j. the mean of the measurements in the jth row. We have

x#j. ϭ 1 b j ϭ 1, 2, c, a (1)
b
a xjk

kϭ1

The dot in x#j. is used to show that the index k has been summed out. The values x#j. are called group means, treat-
ment means, or row means. The grand mean, or overall mean, is the mean of all the measurements in all the

groups and is denoted by x#, i.e.,

x# ϭ 1 a xjk ϭ 1 a b xjk (2)
ab ab
j,k a a

jϭ1 kϭ1

314

CHAPTER 9 Analysis of Variance 315

Total Variation. Variation Within Treatments. Variation Between Treatments

We define the total variation, denoted by v, as the sum of the squares of the deviations of each measurement from
the grand mean x#, i.e.,

Total variation ϭ v ϭ a(xjk Ϫ x#)2 (3)
(4)
j,k

By writing the identity,

xjk Ϫ x# ϭ (xjk Ϫ x#j.) ϩ (x#j. Ϫ x#)
and then squaring and summing over j and k, we can show (see Problem 9.1) that

a(xjk Ϫ x#)2 ϭ a(xjk Ϫ x#j.)2 ϩ a(x#j. Ϫ x#)2 (5)

j,k j,k j,k

or

a(xjk Ϫ x#)2 ϭ a(xjk Ϫ x#j.)2 ϩ b a(x#j. Ϫ x#)2 (6)

j,k j,k j

We call the first summation on the right of (5) or (6) the variation within treatments (since it involves the squares
of the deviations of xjk from the treatment means x#j.) and denote it by vw. Therefore,

vw ϭ a(xjk Ϫ x#j.)2 (7)

j,k

The second summation on the right of (5) or (6) is called the variation between treatments (since it involves the
squares of the deviations of the various treatment means x#j. from the grand mean x#) and is denoted by vb. Therefore,

vb ϭ a(xj. Ϫ x)2 ϭ b a(xj. Ϫ x)2 (8)

j,k j

Equations (5) or (6) can thus be written

v ϭ vw ϩ vb (9)

Shortcut Methods for Obtaining Variations

To minimize the labor in computing the above variations, the following forms are convenient:

v ϭ a xj2k Ϫ t2 (10)
ab (11)
j,k (12)

vb ϭ 1 a t2j. Ϫ t2
b ab
j

vw ϭ v Ϫ vb

where t is the total of all values xjk and tj. is the total of all values in the jth treatment, i.e.,

t ϭ a xjk tj. ϭ a xjk (13)

j,k k

In practice it is convenient to subtract some fixed value from all the data in the table; this has no effect on the
final results.

Linear Mathematical Model for Analysis of Variance

We can consider each row of Table 9-1 as representing a random sample of size b from the population for that
particular treatment. Therefore, for treatment j we have the independent, identically distributed random vari-
ables Xj1, Xj2, . . . , Xjb, which, respectively, take on the values xj1, xj2, . . . , xjb. Each of the Xjk (k ϭ 1, 2, . . . , b)

316 CHAPTER 9 Analysis of Variance

can be expressed as the sum of its expected value and a “chance” or “error” term:

Xjk ϭ mj ϩ ⌬jk (14)

The ⌬jk can be taken as independent (relative to j as well as to k), normally distributed random variables with
mean zero and variance s2. This is equivalent to assuming that the Xjk ( j ϭ 1, 2, . . . , a; k ϭ l, 2, . . . , b) are
mutually independent, normal variables with means mj and common variance s2.

Let us define the constant m by

m ϭ 1 a mj
a
j

We can think of m as the mean for a sort of grand population comprising all the treatment populations. Then (14)
can be rewritten as (see Problem 9.18)

Xjk ϭ m ϩ aj ϩ ⌬jk where aaj ϭ 0 (15)

j

The constant aj can be viewed as the special effect of the jth treatment.
The null hypothesis that all treatment means are equal is given by (H0: aj ϭ 0; j ϭ 1, 2, . . . , a) or equiva-

lently by (H0: mj ϭ m; j ϭ 1, 2, . . . , a). If H0 is true, the treatment populations, which by assumption are nor-
mal, have a common mean as well as a common variance. Then there is just one treatment population, and all

treatments are statistically identical.

Expected Values of the Variations

The between-treatments variation Vb, the within-treatments variation Vw, and the total variation V are random
variables that, respectively, assume the values vb, vw, and v as defined in (8), (7), and (3). We can show
(Problem 9.19) that

E(Vb) ϭ (a Ϫ 1)s2 ϩ b aa2j (16)
(17)
j (18)

E(Vw) ϭ a(b Ϫ 1)s2 (19)

E(V) ϭ (ab Ϫ 1)s2 ϩ b aaj2

From (17) it follows that

E B Vw 1) R ϭ s2
a(b Ϫ

so that

S^w2 ϭ Vw 1) (20)
a(b Ϫ

is always a best (unbiased) estimate of s2 regardless of whether H0 is true or not. On the other hand, from (16)
and (18) we see that only if H0 is true will we have

E¢a Vb 1≤ ϭ s2 E ¢ ab V 1≤ ϭ s2 (21)
Ϫ Ϫ

so that only in such case will

S^ 2 ϭ a Vb 1 S^ 2 ϭ ab V 1 (22)
b Ϫ Ϫ


















































Click to View FlipBook Version