The words you are searching are inside this book. To get more targeted content, please make full-text search by clicking here.

Beginning Statistics with Data Analysis (1)

Discover the best professional documents and content resources in AnyFlip Document Base.
Search
Published by mazatulathirah, 2021-12-26 08:09:38

Beginning Statistics with Data Analysis (1)

Beginning Statistics with Data Analysis (1)

The degrees of freedom will be the sample size, n, minus the number of
coefficients fitted. So if we have 20 points, n = 20, and if we fit 5 variables we
will lose 6 degrees of freedom, leaving us with 20 – 6 = 14 degrees of freedom.
If we drop 2 variables from our equation, then we are fitting 4 instead of 6
coefficients. Thus the degrees of freedom increase from 14 to 16.

Remember, when we fit additional variables, the residual sum of squares, Ŝ,
must automatically be reduced, but the estimated error variance, , need not be
reduced because the denominator of will also be decreasing.

This is quite a bundle of ideas to absorb all at one time. But we can illustrate
many of them in an example with actual data.

EXAMPLE 1 Student achievement. In the mid-1960s the U.S. Office of
Education undertook a major research project involving some 570,000 school
pupils and 60,000 teachers. It was intended to assess the availability of equal
educational opportunities in public educational institutions in the United States.
The report on the project became known as the Coleman report, after James S.
Coleman, who headed the study. Table 14-1 contains data on a sample of 20
grade schools from the Northeast and Middle Atlantic states, drawn from the
population of the Coleman report. Our aim with these data is to find reasonable
explanations for why the average sixth grade verbal achievement scores, y, vary
from school to school. Five predictors are available for our use here. Three of
these pertain primarily to assessment of general background characteristics for
each school:

u: sixth grade percentage of white collar (father)—(WHTC)

v: socioeconomic status—(SES)

x: sixth grade mothers’ average education (1 unit = 2 years)—(MOM)

TABLE 14-1
Random sample of 20 schools
from Northeast and Middle Atlantic states

† y, sixth grade verbal mean test score for all sixth graders (SCOR).
t, staff salaries per pupil (SLRY).
u, sixth grade percentage white-collar fathers (WHTC).
v, socioeconomic status composite: sixth grade means for family size, family intactness, father’s
education, mother’s education, percentage white-collar fathers, and home items (SES).
w, mean teacher’s verbal test score (TCHR).
x, sixth grade mean mother’s educational level (1 unit = 2 school years) (MOM).

Source: Adapted from J. S. Coleman, E. Q. Campbell, C. J. Hobson, J. McPartland, A. M. Mood, F. D.
Weinfeld, and R. L. York (1966). Equality of Educational Opportunity. U.S. Government Printing Office,
Washington, D.C.

The other two relate to variables that are potentially subject to manipulation in
order to produce equality of achievement:

t: staff salaries per pupil—(SLRY)

w: mean teacher’s verbal test score–(TCHR)

This particular sample of 20 schools was chosen in part to illustrate the
difficulties that arise when a variable like SES is used as a predictor along with
other variables that are closely related to it, such as WHTC and MOM. These 20
schools are not typical of schools in the country as a whole, and we must be very
cautious when we come to interpret the results of our analyses.

Our multiple regression analysis of these data involves several steps. We
proceed by giving a series of exhibits consisting of pieces of actual computer
output, and in between these exhibits we give brief explanations and
interpretations of the reported analyses from the computer. One point we must
recognize at the outset is that even our favorite statistical computer programs
often include, as part of their output, information that we neither understand nor
even wish to use in our analyses. Because the exhibits contain actual output, they
will contain statistics we have not mentioned. Many of these we will simply
ignore, because they do not play a central role in our discussion here. We do not
delete them because they add a tang of reality to the computer printouts.

We begin by examining the correlations among the six variables–the five
predictors and the dependent variable. Instead of using the variable labels t, u, v,
w, and x, we adopt the practice of using three-or four-letter labels (known as
mnemonics). These labels will save us the effort of continually flipping back to
see if, for example, v stands for socioeconomic status or for staff salaries per
pupil. Similarly, we describe the regression coefficients using the same labels;
for example, b1 the coefficient of t, is labeled bSLRy

The entries in Exhibit 14-A need some explanation.

1. To find the correlation coefficient between any two variables, we locate a
row corresponding to one and a column corresponding to the other; if we hit
a blank entry we simply reverse. For example, the correlationbetween SLRY
and TCHR is found by entering the fourth row (for TCHR) and the first
column (for SLRY), yielding r = 0.5027.

2. Some entries have strange labeling that tells us that the decimal place needs
to be moved, and exactly where it should go. For example, the correlation
between WHTC and TCHR is given as r = .5106E-01. “E-01” means “move
the decimal point 1 space to the left.” Thus r = 0.05106. This form of
labeling is known as scientific notation. Similarly, “E+01” means “move the
decimal point 1 space to the right.”

EXHIBIT 14-A

Northeast achievement data—correlation coefficients of variables used

3. Note that r = 1 for the correlation of each variable with itself. This is as it
should be, because SCOR, for example, is a perfect predictor of itself!
We can learn a considerable amount by examining the entries in Exhibit 14-

A. First, if we look at the last row, we see that the predictor variable with the
highest correlation with the dependent variables SCOR is SES, with rSCOR,SES =
0.9272. This means that when SES is used all by itself in a regression equation
to predict SCOR, it will account for

of the variability in SCOR. No other variable by itself accounts for more than 60
percent. Thus a regression equation with SES alone will give reasonably good
predictions. The other two variables that are highly correlated with SCOR are
WHTC and MOM.

Second, the correlations among the three background variables, WHTC,
SES, and MOM, are all quite high:

Each of these variables measures similar information. Perhaps we need to use
only one of the three.

Third, the correlation between the other two variables, SLRY and TCHR, is
also relatively large: rSLRY.TCHR = 0.5027. Again, perhaps both variables are not
needed as predictors.

Next we compute the multiple regression using all five predictors. From
Exhibit 14-B we have the five-variable fitted equation (rounding to two decimal
places), noting that BO means b0, and using the labels:

(Note that we interpreted “.4360156E-01” as “0.04360156.”) The display
includes the estimated standard errors of the least-squares coefficients; for
example,

EXHIBIT 14-B
Northeast achievement data—regression coefficients
for predicting achievement score using all five predictors

The display also contains the t ratios. Only two t ratios (those for SES and
TCHR) are larger than 2, and the variables with the smallest ratios are candidates
to be dropped from the equation. Finally, we see that the estimated error variance
is

and
Thus, by using all 5 variables instead of only SES, we can explain only

or 4.66 percent more of the variability of SCOR.
Before going on to omit some of the predictors from our equation, we look at

the residuals from the full multiple regression equation in Exhibit 14-C. The first
column in this exhibit, labeled CASE, contains the school numbers from Table
14-1, and the second column, labeled SCOR, contains the observed values of the
dependent variable as given in the last column of Table 14-1. To get the residuals
in the third column, the computer program calculates the PREDICTED SCOR
for each case based on the least-squares regression line and then subtracts this
from the observed value:

EXHIBIT 14-C
Northeast achievement data—observed scores for the
20 schools, residuals, and studentized residuals of predicted
scores using all five predictors; other quantities not treated here

In Section 12-7 we explained for simple regression how to estimate the
standard error for any prediction, usually different for each prediction. By
methods not given here, these formulas can be extended to multiple regression,
and the computer calculates these standard errors. The column labeled STUD.
RES contains the residuals, each divided by its estimated standard error (each
standard error may differ from the rest), or studentized residuals. This division
puts all the residuals on roughly the same footing, and the numbers should
behave approximately like standard normal random variables. One of the things
we look for in this column is large (either positive or negative) residuals—an
indication of outliers. The two cases that stand out a little are case 3 (with a
value –2.05) and case 18 (2.94). Given that we have 20 cases, it is not too
shocking to find two with values larger than 2, but we shall keep our eyes on
these observations in later analyses nonetheless. In figures not shown here, we
plotted the studentized residuals against the predicted values, and against each of
the predictors separately, to check for possible curvilinearity. We found no
evidence to suggest that an equation adding squared terms for the predictors to
take account of curvilinearity would be of much use.

In the discussion here we do not use the information in the final three
columns of Exhibit 14-C, nor the quantities at the bottom of the display. They
are included here simply because they were part of the output produced by the
particular regression program that we used. If we use a different program, the
information it provides on residuals may differ somewhat from that in Exhibit
14-C.

The two variables that are potentially subject to manipulation by a
policymaker to produce equality of achievement are SLRY (staff salaries per
pupil) and TCHR (mean teacher s verbal test score). Because the variable SLRY
has the smaller t ratio in absolute size in Exhibit 14-B, we try omitting it from
the equation, while retaining TCHR and the three background variables, WHTC,
SES, and MOM. Exhibit 14-D gives the regression coefficient omitting SLRY.
The new fitted equation is

and

only a slight change from the full equation (about 1 percent). The new estimate
of the error variance is

just a small increase, from 4.30 to 4.62.
From the t ratios we see that both WHTC and MOM have modest values,

and they remain as candidates to be dropped from the equation. We next omit
them one at a time, and then together. The results are given in Exhibit 14-E.

EXHIBIT 14-D
Northeast achievement data—regression
coefficients for predicting achievement scores
using the four predictors WHTC, SES, TCHR, and MOM

EXHIBIT 14-E
Northeast achievement data—regression
coefficients for predicting achievement score
using (a) three predictors, SES, TCHR, and MOM, (b) three
predictors, WHTC, SES, and TCHR, (c) two predictors, SES and TCHR

Omitting variables WHTC and MOM, either separately or together, leads to
essentially the same value of R2:

These differ by less than 0.2 of a percent, and they are about 0.5 of a percent
smaller than 0.8922, the R2 value with 4 variables from Exhibit 14-D. Little
seems to be lost by removing both WHTC and MOM from the equation. Our

new estimate of σ2, the error variance, using only SES and TCHR as predictors,
is now smaller than for any equation examined so far:

Leaving out both WHTC and MOM as predictors makes sense from a
different viewpoint. Socioeconomic status, variable SES, is a composite variable
that combines several different social and economic background variables. Thus,
variables WHTC and MOM measure background information that may already
be largely taken into account in the variable SES, which we are keeping in the
equation.

At this point we examine the residuals based on SES and TCHR in Exhibit
14-F(a). We note that cases 3 and 18 again have the largest absolute studentized
residuals, but the one for case 18 is smaller than the corresponding one for case
18 in Exhibit 14-C. (This reduction in the size of the largest studentized residual
is a good sign. Nonetheless, if we could, we would go back to the original
observations for these cases to check on them further.)

EXHIBIT 14-F(a)
Northeast achievement data—observed scores for the
20 schools, residuals, and studentized residuals of predicted scores
using two predictors, SES and TCHR; other quantities not treated here

EXHIBIT 14-F(b)
Northeast achievement data—plot of studentized
residuals versus predicted score using two predictors, SES and TCHR

At this point we also plot the studentized residuals against various variables.
Exhibit 14-F shows two plots. Both show slight linear trends running downward
from left to right. The linear trend in Exhibit 14-F(b) suggests that the fitted
model may be inadequate, although it does not necessarily suggest why. The
linear trend in Exhibit 14-F(c) suggests that the residuals contain some linear
component due to the omitted variable plotted on the x axis—in this plot the
variable SLRY. Thus, in Exhibit 14-G, we add SLRY back into our equation. The
new regression equation is

EXHIBIT 14-F(c)
Northeast achievement data—plot of

studentized residuals of SCOR on SES and TCHR versus SLRY

EXHIBIT 14-G
Northeast achievement data—
regression coefficients for predicting achievement
score using three predictors, SLRY, SES, and TCHR

with R2SLRY,SES,TCHR = 0.9007. The t ratio for variable SLRY is –1.47, and we
are left with the decision whether or not this value is large enough to keep the
variable in the equation. At this point we have looked at the data so many times,
in different ways, that it makes little sense to look up the t value in a table.

If we go back to our other criterion, making as small as possible, we may
well decide to keep SLRY, because the value of in the three-variable equation,

is smaller than the value 4.26 that we have for the equation using only SES and
TCHR in Exhibit 14-E. A final check of the residuals for the regression of SCOR
on SLRY, SES, and TCHR (not reproduced here) shows the same two large
studentized residuals for cases 3 and 18, but no special patterns when the
studentized residuals are plotted against anything else. This information mildly
supports the use of the three-variable equation.

Conclusions. We began this example with five possible predictors: three
background variables and two variables (SLRY and TCHR) that are subject to
possible manipulation. Our final prediction equation,

accounts for about 90 percent of the variability in the dependent variable.
The sign of the coefficient for the variable SLRY is unexpected. It might be

interpreted to say that if we control for background information (SES) and for
the mean teacher s verbal test score (TCHR), an increase in staff salaries per
pupil will produce a decrease in achievement. We have to be careful about such a

causal interpretation, however, because we have not done an experiment that
actually manipulated salaries for specific schools. Thus we should not conclude
that we can increase achievement by reducing salaries. For example, teacher
strikes could occur.

The other two variables do have the expected signs. When we introduce the
other variables into the equation, differences in socioeconomic status (SES) are
directly (and positively) related to achievement. Similarly for differences in
mean teacher’s verbal test score (TCHR). The latter has a potential policy
interpretation, because it suggests that if we control for other relevant variables,
increasing the teacher’s ability to communicate may increase the pupil’s
achievement.

Before concluding this example, we recall that we are dealing with a small
sample of schools from the Northeast and Middle Atlantic states. Although we
have found a high R2 for this sample, a much smaller correlation was found
when similar analyses were done for the country as a whole. Thus, these data do
not seem to be typical of the nation. We must always be cautious in generalizing
the results of our analyses beyond the population from which we draw our
samples.

In multiple correlation, although the size and sign of the coefficients may be
suggestive, we cannot depend on them to be causally related to the output
variable y. Thus deliberately changing the value of x in the real world may not
have either the effect desired by the changers or that suggested by the equation.
Nevertheless, the equation can still be useful for what it is supposed to do,
namely, predict y from the values of the x’s.

PROBLEMS FOR SECTION 14-2

In a study of geographic variation in the size of cricket frogs in the United States, data from several
localities were gathered. One part of the study involved predicting mean body length from five
environmental variables. Table 14-2 contains an excerpt of the data for 13 localities in the northeastern
section of the country. The following problems are concerned with examining multiple regression output
from an analysis of these data and drawing appropriate conclusions. We wish to predict body length of the
frogs (LENG), using the predictors longitude (LONG), latitude (LAT), altitude (ALT), annual rainfall
(RAIN), and annual temperature (TEMP).

1. Examine the correlations listed in Exhibit 14-H and identify (a) the two variables with the highest
correlations with the dependent variable, LENG, and (b) those predictor variables that are highly
intercorrelated (say r > 0.6 or r < –0.6).

TABLE 14-2
Data for predicting mean body length of male cricket frogs

in 13 selected localities in the northeast United States

† LAT, latitude
LONG, longitude
ALT, altitude
RAIN, annual rainfall
TEMP, annual temperature
LENG, body length of frog

Source: A. J. Izenman (1972). Reduced-rank regression for the multivariate linear model. Doctoral
dissertation, Department of Statistics, University of California, Berkeley.

EXHIBIT 14-H
Cricket frog data—correlation coefficients of the variables used

EXHIBIT 14-I
Cricket frog data—regression coefficients
for predicting length (LENG) from all five predictors

2. Based on your examination of the correlations in Problem 1, would you expect both LONG and LAT to
be useful as predictors in a multiple regression equation? Why or why not?
Exhibit 14-I contains information on the regression using all five predictors.

3. Write out the multiple regression equation, rounding the coefficients to three decimal places.
4. What are the values of and R2?
5. Find the degrees of freedom for the estimated error variance, and explain how the value was

determined.
6. How much more of the variability of LENG is explained by using all five variables instead of the best

single predictor identified in Problem 1(a)?
7. Which predictor variables appear to be reasonable candidates to be dropped from the equation in

Problem 3?
In Exhibit 14-J we have a series of multiple regression equations, beginning with the one in Exhibit 14-I,
and dropping variables one at a time (based on the t values), until only two remain.
8. What is an equation for the last regression in Exhibit 14-J?
9. How do the coefficients in the equation in Problem 8 compare in magnitude and sign with those for the
same variables in your equation from Problem 3?
10. Make a table of the values of R2, d.f., and for the four regression equations in Exhibits 14-I and 14-

J.
11. Based on the information in the table from Problem 10, and on the other information in the Exhibits,

choose a regression equation for prediction purposes.
EXHIBIT 14-J

Cricket frog data—regression coefficients for
predicting length (LENG) of frogs using (a) four

predictors, LONG, LAT, ALT, and TEMP, (b) three
predictors, LAT, ALT, and TEMP, (c) two predictors, ALT and TEMP

12. Review your answers to Problems 1 and 2, and then note if you are surprised about your choice of
equation in Problem 11. Explain your answer.

13. Examine the t values for the regression of LENG on ALT and TEMP. Do you think that either of these
variables should be dropped from the equation? Explain your answer.
EXHIBIT 14-K
Cricket frog data–plot of studentized residuals of LENG
using two predictors (ALT and TEMP) versus predicted LENG

14. Exhibit 14-K contains a plot of the standardized residuals for the regression of LENG on ALT and
TEMP. Is there any discernible pattern in this plot that should be cause for concern?

15. Exhibit 14-L contains the residuals whose values are plotted in Exhibit 14-K. Is there any evidence in
this exhibit to suggest that one or more points may be outliers or wild observations?

EXHIBIT 14-L
Cricket frog data—observed LENG,
residuals, and studentized residuals using two
predictors, ALT and TEMP; other quantities not treated here

16. Write a brief conclusion for the analysis in this data set that includes (a) the final prediction equation
you would choose, (b) the values of R2 and for your equation, and (c) an interpretation of the
values of the estimated regression coefficients.

14-3 EXAMPLES OF ESTIMATED MULTIPLE REGRESSION

EQUATIONS, THEIR INTERPRETATION, AND THEIR
USE

Multiple regression analysis is used in many different fields, especially in
situations in which the investigators are unable to collect true experimental data.
In this section we review four additional examples of the use of multiple
regression in four quite different areas, and we discuss problems with
interpretation of the estimated coefficients and the use of the fitted equations for
prediction purposes.

EXAMPLE 2 Assessing the demand for fuel. Using data for 1971 from the 48
contiguous U.S. states, Ferrar and Nelson (1975) used a multiple regression
analysis to predict the demand for fuel. Their variables and fitted equation were
as follows:

y = logarithm of total fuel consumption per capita
u = logarithm of mean personal income per capita

v = logarithm of price of heating fuel
w = logarithm of price of electricity
x = logarithm of average total heating degree-days (a measure of how cold

it was)

The estimates of standard errors of the coefficients are reported in parentheses,
and the proportion of variability of y explained by the regression equation is R2 =
0.85.

This multiple linear regression equation is the linear version of what
economists call a production function. They often model production Y with a
product of several variables raised to powers, such as

Then, by taking logarithms of both sides, they write

Now, letting

we can write

We have now a linear regression problem, where k, a, b, and c are the regression
coefficients. Economists would interpret the fitted coefficients as indicating that
the demand for heating fuel goes up with mean personal income, with the price
of electricity, and with the duration and intensity of winter coldness, but goes
down as the price of heating fuel itself goes up. The signs of the coefficients here
are sensible, although it is difficult to interpret their sizes.

This equation is based on data gathered for 48 states for a single year. We
would feel more confident of our interpretation of the equation for the demand
for heating fuel if we had observed, in addition, several states over a period of
many years and thus could see the actual impact of changes in the predictors on
demand.

EXAMPLE 3 Growth of Japanese larch trees. Bliss (1970) used data from 26
Japanese larch trees to explore the relationship between the growth in height (y)
and the percentage contents of nitrogen (u), phosphorus (v), and potassium (w) in
their needles. His fitted linear equation is

with R2 = 0.86. Growth seems to be positively related to the percentage contents
of all three minerals, although the t ratio of the phosphorus content (v), t =
304.2/165.1 = 1.84, is significant at the 0.10 level, but not the 0.05 level, when
we compare it with values in the t table with 26 – 4 = 22 d.f. These three
components account for 86 percent of the variability in the heights of the trees.
Because the three predictor variables are all measured on the same scale, we
might wish to order the three minerals by the sizes of their regression
coefficients and take this order to be an indicator of their relative importance.
The difficulty with doing this is that the larger the estimated regression
coefficient, the larger its estimated standard error.

In regression analyses such as this, we must avoid the temptation to interpret
the size of a fitted coefficient as an indication of the importance of the
corresponding predictor variable. Even though units are the same, the
contributions may not be based on size. Furthermore, the units are not actually
the same. Pounds of phosphorus are not pounds of nitrogen.

EXAMPLE 4 Midterm congressional elections. Tufte (1975) has tried to predict
the nationwide vote in congressional contests in the United States for midterm
elections (that is, elections held in years without a presidential contest). In every
midterm election but one since the Civil War (until 1974) the political party of
the incumbent president has lost seats in the House of Representatives. With

y = nationwide midterm congressional vote for party of the current
president (percent)

as the dependent variable, Tufte considered the regression of y on
u = the average of the vote for the party of the president in the preceding
eight elections
v = Gallup Poll rating of the president at the time of the election
w = the yearly change in real disposable income per capita (in 1958 dollars)

for the eight midterm elections from 1938 to 1970 (omitting 1942, because of
special effects relating to wartime conditions). The fitted equation is

and R2 = 0.93. All three coefficients have the expected sign.
One difficulty with this use of regression analysis is that with only eight

observations we have fitted four coefficients. Thus we have only 4 d.f. for the
estimate of σ2.

The data series here is simply too short for us to be very confident about
predictions using the fitted equation.

EXAMPLE 5 Property tax assessments. In Ramsey County, Minnesota, the
Department of Property Tax Assessment has begun to use multiple regression
analysis to get estimates of the market values of single-family residences for tax
purposes. They use an equation that incorporates two types of variables: (a)
property characteristics, such as size of lot, number of square feet of living area,
numbers of bedrooms, baths, etc., (b) neighborhood factors (two identical
properties located in different parts of the county may be worth different
amounts because of the relative desirabilities of the neighborhoods where they
are located).

The tax assessors do not simply rely on the estimated values produced from
the multiple regression equation for the tax bills they send out. They compare the
predictions with estimated values that come from two other methods of
assessment:

1. The cost approach, which is designed to calculate what it would cost to
replace a home in today’s building market,

2. Attribute matching (five houses are selected for which the assessor’s office
has recent sales information, houses that are most similar to the house being
valued; the median of the five sales prices, after some adjustments, is used as
the estimated value).

When the three methods yield substantially different valuations for a
property, appraisers are sent out to make a field inspection. In addition,
homeowners have the right to challenge an estimated valuation and to ask for a
field inspection.

When assessors apply this method to all of the homes in Ramsey County,
inevitably some outliers occur—homes whose true values are not well described
by the equation used. Predictions for these homes using the regression equation
may be far different from the estimates using the other two assessment methods.
Outliers can be found in both directions: overassessment and underassessment.
When formulas such as these are used, taxpayers have ordinarily been given the
right of appeal in the event that peculiar features distort the value. The same is
true for other methods of assessment.

In the tax example, we see multiple regression in action directed toward its
strength—estimation.

14-4 SUMMARY OF CHAPTER 14

Formula numbers correspond to those used earlier in this chapter.
1. With a dependent variable, y, and five predictors, t, u, v, w, and x, the

multiple linear regression model is

where e is a random error with mean 0 and variance σ2.
2. The least-squares estimates, , for the regression coefficients in (1) yield the

minimum sum of squared residuals:

3. The proportion of variability of the dependent variable, y, explained by the
least-squares fit is

which is the square of the correlation r between the observed y and the least-
squares estimate .
4. The estimate of the error variance, σ2, for the multiple regression model is

where the degrees of freedom equal the sample size, n, minus the number of
coefficients fitted.
5. We use the values of R2, , and the t ratios for the least-squares coefficients
to help us decide when to omit variables from a multiple regression equation.
We also use our understanding of any physical, social, and biological factors.

TABLE 14-3
1972 fuel consumption data
for the 48 contiguous states



† TAX, 1972 motor fuel tax in cents per gallon.
DLIC, 1971 proportion of population with driver’s license.
INC, 1972 per capita personal income in dollars.
ROAD, 1971 length of federal-aid primary highways in miles.
FUEL, 1972 fuel consumption in gallons per person.

Source: Data collected by C. Bingham from the American Almanac for 1974 and the 1974 World Almanac
and Book of Facts, and reported in S. Weisberg (1980). Applied Linear Regression, Wiley: New York, p. 33.

SUMMARY PROBLEMS FOR CHAPTER 14

The following 10 problems are based on the data in Table 14-3 on pages 430 and
431.

For Problems 1 and 2, the correlation table for the five variables is as
follows:

1. (a) Which variable has the highest correlation with the variable FUEL? (b)
Which pairs of variables have correlations greater than 0.5 or less than –0.5?

2. Which variables do you expect to be of little use in predicting FUEL?
Explain the reasons for your answer.
For Problems 3 through 7, the following information is produced by a

computer program carrying out the regression of FUEL on TAX, DLIC, INC,
and ROAD:

3. Write out the multiple regression equation, rounding the coefficients to three
decimal places.

4. What are the values of and R2?

5. How much more variability of FUEL is explained by using all four predictors
instead of the best single predictor?

6. Which predictor variables appear to be reasonable candidates to be dropped
from the equation in Problem 3? Why?

7. Interpret the coefficients in the equation from Problem 3, and discuss the
implications of the results of the analysis for the reduction of fuel
consumption for the 48 contiguous states.
The following problems require a high-speed computer.

8. Use a least-squares multiple regression computer program to get results for
the data of Table 14-3 for the following regressions:
a) FUEL on TAX, DLIC, INC
b) FUEL on TAX, DLIC
c) FUEL on TAX, INC
d) FUEL on DLIC, INC

9. Plot the residuals from Problem 8(b) versus ROAD and versus INC.
Examine the plots for patterns.

10. Compute for each of the equations in Problem 8, and using your answer
to Problem 4, decide which of the five equations has the smallest value of
.

REFERENCES

C. I. Bliss (1970). Statistics in Biology, Vol. II, pp. 308-310. McGraw-Hill, New York.
S. Chatterjee and B. Price (1977). Regression Analysis by Example, Chapters 3 and 9. Wiley, New York.
T. A. Ferrar and J. P. Nelson (1975). Energy conservation policies of the Federal Energy Office: economic

demand analysis. Science 187:644–646.
F. Mosteller and f. W. Tukey (1977). Data Analysis and Regression, Chapter 13. Addison-Wesley, Reading,

Mass.
E. R. Tufte (1975). Determinants of the outcomes of midterm congressional elections. American Political

Science Review 69:812–826.
S. Weisberg (1980). Applied Linear Regression, Chapters 2, 3, and 8. Wiley: New York.

Analysis of 15
Variance






Learning Objectives

1. Measuring variability and allocating it among sources
2. Comparing the locations of three or more groups
3. Breaking the total sum of squares for two-way tables of measurements into

three components: rows, columns, and residuals
4. Interpreting computer output for the analysis of variance (ANOVA)

15-1 INTRODUCTORY EXAMPLES

In the analysis of variance, we measure variability and allocate it among its
sources. We have already seen such analyses in tests of the differences between
two observed means. For that problem, the sources of variation were (a) the
difference between the population means and (b) the variation within the two
samples. We used the variability within the samples to assess the size of the
observed difference in means, that is, the variability between samples. We use
these same ideas of variability within and variability between in this chapter, and
we extend them to more groups. Then we consider problems in which the groups
are organized in a two-way layout, and we allocate the variability between
groups into pieces that go with the rows and columns. This analysis-of-variance
approach gives us formal tools to accompany the exploratory methods for two-
way tables in Chapter 4.

As we did in Chapter 14, we stress here the interpretation of analysis-of-
variance methods applied to actual examples. Although the calculations can be
done by hand, they are somewhat tedious. Fortunately there are computer
programs that can spare us much of this drudgery, and we develop the material in
this chapter as if we had such a program available. Our primary goal is to learn

to understand what a printout from an analysis-of-variance computer program
means.

ONE-WAY TABLE

In problems with more than two groups, organizing our approach to sources of
variation is more complicated than in the two-group case.

EXAMPLE 1 Geographic distribution of divorced men. In Table 15-1 we give
the percentages of divorced men by states, grouped by region. The regional
averages run from a low of 2.20 for the Northeast to a high of 3.91 for the West.
The question arises: Are the proportions of divorced men randomly distributed
by region, or do the regions differ more than random sorting would suggest? It
looks, at least superficially, as if the Southwest and West have substantially
higher rates than the rest of the country. But perhaps a random sorting into
collections of 11, 11, 12, 6, and 10 states would give us a similar amount of
variation in means as that in Table 15-1.

In Table 15-2 we show a random rearrangement of the 50 numbers into five
groups having the same numbers of states in the groups as before, but the 50
states are now sorted randomly rather than geographically. For example, in Table
15-1, Nevada, with 6.9, is in group 5; in Table 15-2 it is in group 3. The largest
and smallest means are 3.04 and 2.35. In the real regional grouping of Table 15-
1, the largest and smallest means are 3.91 and 2.20. Thus, on the basis of one
trial, the variability among the real regions looks larger than we would expect
from random selection. We could repeat the random regrouping many times to
see how the data behave. With high-speed computation this would be easy to do,
but it would deflect us from developing the standard analysis of variance.

TABLE 15-1
Percentages of divorced men by regions, state by state

TABLE 15-2
Randomly chosen groups of states

TABLE 15-3
Stem-and-leafs of percentages of divorced men by regions, from Table 15-1

While we look at Table 15-1, we might examine the distributions of divorce
rates in the several real regions using a stem-and-leaf diagram. Table 15-3 shows
the results. In the West, the 6.9 for Nevada stands out as an outlier. Because
many people go there to get divorced, we might wish to set it aside in any further
analysis.

The layouts in Table 15-1, 15-2, and 15-3 are called one-way tables. Each
column is composed of measurements that are not directly related to those in
other columns. We want to know about variation between the column means and
within the columns.

TABLE 15-4
Average 2-month temperatures for five large northeastern cities (row and column averages and

effects)

In some situations we have a two-way layout. Then each entry in the table is
connected with both a row and a column.

EXAMPLE 2 City temperatures. In Table 15-4 we give average temperatures for
2-month periods for five large cities. Table 15-4 shows two sources of variation:
cities and calendar periods. A third source is residuals, which we discuss later.
Looking at the row means, we see that the average temperatures vary from city
to city; in particular, the more southern the city, the higher the mean temperature.
We also get variation from calendar periods according to a pattern familiar for
temperate regions in the Northern Hemisphere—colder November through
April, warmer May through October. We can use these averages to develop
residuals from an additive model, much as we did in Chapter 4.

Table 15-4 gives (a) city effects, which are the deviations of the cities’ means
from the grand mean, 51.4, and (b) calendar effects, which are the deviations of
the calendar period means from the grand mean. We can estimate the value in the
cell by using the fitted value obtained by adding

grand mean + city effect + calendar effect = fitted,

and for Boston in January-February we get

To get the residual we compute

residual = observed – fitted

and get

residual = 28 – 27.5 = 0.5

for that cell, as we did in Chapter 4. The variation of these residuals from cell to
cell is an additional source of variation.

We could have examples with additional sources of variation, but we shall
not go further than the two-way table in this book.

In Chapters 9 and 10 we compared the means of two samples of data. In
scientific inquiries we often ask whether several populations may reasonably
have the same mean. For example, we may have air pollution data from several
different time periods, and we want to know whether the pollution has been
much the same across the periods. Usually, any difficulty in making this
comparison arises because we have to take account of the variability within the
groups of data.

The analytical tools developed in Chapter 10 will allow us to examine the
difference between any pair of observed means, but those tools do not put us into
position to estimate the variability among the set of means as a whole or to test
for inequality among the whole set of population means. We turn next to the task
of developing such methods.

As we progress with the analysis of variance, the reader may feel that we are
analyzing means instead of variances. But it is the variation in means that draws
our attention. We want to know how large that variation is and whether it is
important or can be neglected.

PROBLEMS FOR SECTION 15-1

1. One-way example. From an almanac or other source, find and display a set of
data appropriate to the one-way layout with more than two groups.

2. Two-way example. From an almanac or other source, find a set of data
appropriate to a two-way layout.

3. Recompute the mean for the West, with Nevada removed, in Table 15-1 and
the mean with Nevada removed from the “Central’’ in Table 15-2, and
discuss how the ranges in means for the real data and the randomly arranged
data now compare.

4. Continuation. For the recomputed data from Table 15-1, compute the sample
variance for Northeast and West. Comment on whether or not the variance
seems to be roughly constant for these groups.

5. Compute the residuals for the cells in Table 15-4 in the first row. Comment
on the accuracy of prediction.

6. Measuring variability. In Table 15-4, is there more variation between cities
or between calendar periods? How can you tell? How would you measure
such variations?

7. Residual trouble. The sum of the row effects and the sum of the column
effects should each add to zero in Table 15-4. Why don’t they?

Olympic platform diving. Table 15-5 contains results from the finals of the
women’s platform diving event at the 1976 Montreal Olympics. The country of
each judge and each diver is also listed. Each judge gives a score over 10 dives
to each diver, incorporating the degree of difficulty. The entries in Table 15-5 are
the total scores minus 1400, and these are the data for Problems 8 through 14.
Problems 10 and 11 are for those with access to an analysis-of-variance
computer program.

8. Use the mean scores for judges to calculate the judge effects.
9. Use the mean scores for divers to calculate the diver effects.

10. Compute the residuals for the cells in Table 15-5, using fitted value = grand
mean + judge effect + diver effect.

11. Do any residuals stand out as being much larger than the rest?

12. For each judge, identify the highest and lowest scores assigned.

13. For each diver, identify the judges who gave the highest and lowest scores.

14. Is there any evidence of bias in the judging related to nationality?

15-2 SEVERAL SAMPLES

To compare the means of several populations, numbered for our convenience as
1, 2, …, k, we draw a sample from each. These samples will have sizes n1, n2,…,
nk, and the total number of observations, n, for the k groups taken together is

For each sample, we have a sample mean and a sample variance:

TABLE 15-5
Scores from finals of women’s platform diving, Montreal Olympics, 1976 (each entry is the total score

minus 1400)

Source: Adapted from data reported by K. S. Brown (1979). Judging Judging. Ontario Secondary School
Mathematics Bulletin 15 (3):16-18.

In the two-sample problem of Chapter 10 we compared the two sample
means. In the several-sample situation we compare each mean with the grand
mean, pooling the observations over all of the samples. We represent this grand
mean by

It will be helpful to have a three-sample case with convenient numbers to be
used for illustrative calculations throughout this section.

EXAMPLE 3 A three-sample case. Samples 1, 2, and 3 have observations as
follows:

In this example, k = 3, and

The three sample means and variances are as follows:

If we pool the three samples, we get n = 3 + 6 + 3 = 12 observations, and the
grand mean is

We will refer various questions to this three-sample case.

THE ONE-WAY MODEL

In the one-way model for the analysis of variance, our data consist of samples
from k populations, with unknown population means: μ1, μ2,…, μk. As we did in
the regression model of Chapter 12, we can represent each observation as the
sum of two parts, a true population value and a random error. Because the true
population value is the population mean, we have the following:

The errors are not directly observable unless we have the unusual situation of
knowing the mean, μj.

To complete the one-way analysis-of-variance model, we make the following

assumptions that resemble those of the regression situation:
1. The errors are independent random values from normally distributed

populations.
2. Each population of errors has mean zero and the same variance σ2

(unknown).

The role of the normality assumption is to assure that the percentage points of
the tests of significance, described in Section 15-3, are correct. That assumption
does not bear on the ideas of average values or of various sums of squares
adding up correctly.

Because the errors have mean zero, our observations from sample i, on
average, take the value μj. Thus we estimate μi by the ith sample mean, Xi. We
want to develop a way to assess differences among the μj’S. We do this by
comparing the sample means using information about the variability within the
samples.

SUMS OF SQUARES

The sums of squares from various sources have useful relations, and so we
emphasize them. To compare the sample means, we compute the squared
deviation of each from the grand mean:

Next, we calculate a summary measure of the difference among the means, the
between-means sum of squares:

The sum of squares is a way of measuring how far the set of observed sample
means is from the grand mean of all the measurements. Strictly, it is the square
of the distance. Our plan is to compare BSS with a pooled estimate of error
variance, σ2, over all k samples. The variability in sample i is measured by s2i. To
convert each s2i back to a sum of squares, we multiply by the degrees of freedom
(d.f.) in sample i, ni, – 1. Adding over the samples, we get the within-groups
sum of squares:

or

The total sum of squares is defined as the sum of squares of deviations of all
measurements from the grand mean x:

DEFINITION OF TOTAL SUM OF SQUARES

The total sum of squares can also be computed by adding the between means
sum of squares and the within-groups sum of squares:

Thus we have a numerical check that the definitional form of TSS, equation (6),
gives the same result as the component form, equation (7).
EXAMPLE 4 Verifying the sums of squares calculations. For the three-sample
problem of Example 3, the between-means sum of squares is

Within each of the samples, the sums of squared deviations about the sample
mean are

Therefore the within-samples sum of squares is

Because x = 4, using the definitional form of TSS in equation (6) we have
Finally, we use equation (7) to get a check on our answer:

DEGREES OF FREEDOM

In Chapter 10 we learned that there are nj – 1 d.f. associated with our estimate of
variance in the ith sample, s2i. Thus, a pooled estimate of σ2 using observations
from all k samples should be based on the sum of the degrees of freedom for all
of the samples:

To get our estimate of σ2, we divide the WSS by the pooled degrees of freedom:

We refer to this estimate as the within mean square or WMS.
For the comparison among the means, we have k values of xi, but they are

constrained to average to the overall mean, x, as we saw in equation (1). Thus we

lose 1 d.f. out of the k, yielding k – 1. We associate these k – 1 d.f. with our
summary statistic comparing the xi, BSS. The between mean square, BMS,
comes from dividing BSS by its degrees of freedom:

The degrees of freedom associated with the TSS are n – 1, because we start
with n observations and use up 1 d.f. to estimate μ by x. Note that the degrees of
freedom add up, just as do the sums of squares:

In Table 15-6 we summarize the formulas for degrees of freedom, the sums
of squares, and the mean squares. Because TSS summarizes the overall
variability in the data and the within mean square is our estimate of the error
variance, σ2, we call the summary in Table 15-6 an analysis-of-variance table.
Table 15-6 contains a final column labeled F ratio, which we will discuss shortly.

For our three-sample case of Examples 3 and 4, Table 15-7 summarizes the
analysis of variance. We see that BSS = 36, WSS = 40, and

TABLE 15-6
Analysis-of-variance table (formulas)

TABLE 15-7
Analysis-of-variance table for our three-sample case

as we noted earlier. The value listed in the column for F ratio is

We learn to interpret this value in the next section.

USING COMPUTER OUTPUT

The three-sample example used to illustrate the material in this section was
designed for easy calculation by hand. For more complicated examples, the
sums-of-squares calculations can still be done by hand or with a pocket
calculator, but we usually need to take advantage of special computational
formulas to keep the effort under control. Some of the references at the end of
the chapter explain these computational formulas for the analysis of variance.

Most high-speed computers have programs readily available for carrying out
the analysis of variance. Again, you should check these out first by giving them
simple problems in which the work has already been carried out by hand or by
hand calculator. Computer programs often provide valuable extra printout
including the data, all shown in nice form. You likely will need instruction about
what items printed out have special meaning for the problem at hand. For large
jobs, such programs can be most helpful. Once you have invested the overhead
to learn how to enter and submit problems, even small jobs go very quickly.

For much of the remainder of this chapter we act as if we have mastered the
use of an analysis-of-variance computer program, and we focus on the output it

produces.

EXAMPLE 5 Analysis of variance for the geographic distribution of divorced
men. In Example 1 we examined the data in Table 15-1 on the percentages of
divorced men by states. The states were organized into 5 regions. Construct an
analysis-of-variance table for these data and discuss the results.
SOLUTION. We treat the data as k = 5 samples, one for each region. Exhibit 15-
A contains the actual computer output from a one-way analysis of variance of
these data.

The output in Exhibit 15-A contains two sets of information. In the bottom
part we have the summary information for each of the 5 samples: the sample size
nj, the mean xj, and the standard deviation sj. In the top part of the exhibit we
have the analysis-of-variance table. The first column of this

EXHIBIT 15-A
Computer output for a one-way analysis-of-variance table of the data on percentages of divorced men

(Table 15-1)

output differs somewhat from that of Table 15-6. To translate these terms, in
place of DUE TO read SOURCE. The row labeled FACTOR corresponds to
“between means,” and the row ERROR corresponds to “within samples” (used
to compute an estimate of the error variance). Thus we see from Exhibit 15-A
that (rounding to one decimal place) BSS = 21.1, with 4 d.f., and WSS = 20.0,

with 45 d.f.
The pooled estimate of variance for this example is

Taking the square root, we have = 0.666, which is listed at the bottom

of the output as POOLED ST. DEV.

In the next section we compare the magnitude of the between mean square

with that of the within mean square by taking their ratio, called F in honor of R.

A. Fisher:

We use this value, found in the output in the column labeled F RATIO, to assess
possible differences among the population means, the μi’S.

To see how things might have turned out if we had used data where there are
no differences among the μi’S, let us recall the values in Table 15-2. This table
was based on a random rearrangement of the 50 observations into 5 groups of
size 11, 11, 12, 6, and 10. Exhibit 15-B contains the computer output for an
analysis of variance of these data.

EXHIBIT 15-B
Computer output for a one-way analysis-of-variance table of the data in Table 15-2

In Exhibit 15-B we see that BSS = 2.704, with 4 d.f., and WSS = 38.4, with
45 d.f. For the rearranged data the sample means all lie in an interval of length
0.7, a value that is less than the pooled standard deviation. Now the ratio of the
between mean square to the within mean square is

Large values of F typically suggest differences among the samples and, by
implication, differences among the populations from which they are drawn.
Values of F near 1 suggest that the variation between observed means is
primarily due to sampling variation.

PROBLEMS FOR SECTION 15-2

The following data for three samples form the basis for Problems 1 through 9.
Save your results.

Sample 1: 2, 2, 8, Sample 2: 1, 3, Sample 3: 4, 4, 6, 6

1. Find the sample means and the grand mean.
2. Use s2i to estimate separately the variances of the three populations from

which the samples were drawn.
3. Continuation. Estimation. Assuming that the three populations have equal

variances, estimate that common σ2 by suitably weighting the s2i in Problem
2.
4. Find the between means sum of squares, BSS.
5. Find the total sum of squares, TSS.
6. Find the within sample sum of squares, WSS.
7. Continuation of Problems 4, 5, and 6. Check that the relation TSS = BSS +
WSS applies to this example.
8. Lay out the analysis-of-variance table, and compute and interpret the F ratio.
9. If you have access to an analysis-of-variance computer program, input the
data from this example and compare the output with the results you
computed by hand (or calculator).
10. What happens to the BSS and WSS when the number of samples k = 1?

15-3 THE F RATIO

The F ratio gives us a way to make a statistical test for a difference among
population means. The ideas follow. If the means for k different populations, μ1,
μ2, … μk, are equal, then we expect the differences

simply to reflect the variability from one sample to another. In fact, if

then it can be shown that the average value of the BMS,

averaging over all possible sets of samples, is σ2. That is:

On the other hand, if the μi’s differ, it can be shown that the average value of
BMS is greater than σ2.

It can also be shown that the average of the WMS over all samples is always
σ2. Consequently, the ratio of the mean squares

should give us a notion of whether or not the population means differ. When the
means are all equal, the average over all samples of the numerator and that for
the denominator are both σ2. Thus a ratio of about 1 suggests equality of means.
A larger ratio suggests that the population means are scattered.

The statistic F is usually described by relating it to the degrees of freedom in
its numerator and in its denominator. For the k-sample situation, the numerator
has k – 1 d.f. (one less than the number of samples), and the denominator has n –

k d.f., because that is the number of independent measurements available for
estimating the common σ2.

Because both the numerator and the denominator of F involve sums of
squares, F can never become negative. Its possible values run from 0 to ∞.
Further, just as t had a different distribution for each degree of freedom, F has a
different distribution for each pair of degrees of freedom. Figures 15-1(a,b,c)
show some illustrative distributions of F for various pairs of degrees of freedom.
The F distributions are skewed. The probability is concentrated toward the left-
hand tail instead of being symmetrically distributed.

Figure 15-1(a) Curves of the probability density functions of F1,1 and F1,100.

Figure 15-1(b) Curves of the probability density functions of F2,9.

Figure 15-1(c) Curves of the probability density functions of F5,9 and F10,100.

Large values of F cast doubt on the hypothesis that the groups have equal
population means. Table 15-8 gives a small table of percentage points for the F
distributions for the 0.05 probability level. A larger table of percentage points for
the 0.05 probability level is Table A-4(a) in Appendix III in the back of the
book; Table A-4(b) gives percentage points for the 0.01 probability level. Tables
of other levels are available in other books. See, for example, Owen (1962),
Pearson and Hartley (1966), and Snedecor and Cochran (1980).

EXAMPLE 6 F test for three-sample case. In the three-sample case of Examples
3 and 4, which is summarized in Table 15-7, F = 4.05. There are 2 degrees of
freedom for the numerator of F and 9 degrees of freedom for the denominator.
Thus, we look in Table 15-8 to find the 5 percent level for F2,9. We look in
column 2 and row 9 to find the value 4.26. Because 4.05 is less than 4.26, the P


Click to View FlipBook Version