value exceeds 0.05, and so at the 5 percent significance level we do not reject the
hypothesis of equal population means. Admittedly, the observed ratio comes
close to this critical value.
TABLE 15-8†
Values of a for which the probability is 0.05 that a value of Fu,v is greater than a, where u is the
number of degrees of freedom for the numerator and v is the number of degrees of freedom for the
denominator in the F ratio (BMS/WMS)
† Larger tables are Table A-4(a) and Table A-4(b) in Appendix III in the back of the book.
EXAMPLE 7 F test for geographic distribution of divorced men. In Example 5
we computed the F ratio for the data of Example 1 on the percentages of
divorced men by states. Carry out the F test for comparing the 5 regions.
SOLUTION. The F ratio from Exhibit 15-A is F = 11.91. When we look in Table
15-8 or Table A-4 in Appendix III in the back of the book to find the 5 percent
level for F4,45, we find the values
Because the observed F ratio of 11.91 is well in excess of both of these values,
the P value is much less than 0.05, and we conclude that the data provide
sufficient evidence to support the existence of differences among regions in
percentages of divorced men.
We recall that the assumptions needed to apply this F test are:
1. Random samples independently drawn from normally distributed
populations.
2. Each population has the same variance σ2.
When these assumptions are mildly violated, the results still may be
approximately correct.
EXAMPLE 8 Mathematics programs. In a large school district, four
different mathematics programs were used with sixth grade classes. The numbers
of classes for each program were as follows:
The mean scores for the classes on a standardized mathematics test for sixth
graders were determined at the end of the year, and these are given in Table 15-
9. The classes were all about the same size. Carry out the analysis of variance
and interpret the results.
TABLE 15-9
Data for comparing the mathematics programs
EXHIBIT 15-C
Computer output for a one-way analysis-of-variance table of the mathematics test data in Table 15-9
SOLUTION. Exhibit 15-C contains the computer output from a one-way analysis
of variance of the data in Table 15-9.*
The F ratio from the ANOVA table in Exhibit 15-C is 1.86. We compare this
with the 5 percent point for the F distribution with 3 and 19 degrees of freedom.
In Table A-4 in Appendix III in the back of the book we find F319 to be between
3.10 and 3.29. Because the observed ratio, 1.86, is much smaller than 3.10, we
find that the P value is larger than 0.05 and conclude that the program results
show no statistically significant difference at the 5 percent level.
Although Program I has a mean larger than the rest, it is based on a small
sample. The estimated standard deviations (also from Exhibit 15-C) for the four
programs are as follows:
They do not seem to vary a great deal. Even though the program means differ by
as much as 14 points, the inherent variability within the programs prevents us
from drawing strong conclusions about differences. It would take much more
data to establish these differences firmly than we have available in the present
study.
SMALL VALUES OF F
Theory not given here shows that the “average mean square between” must be
larger than or equal to the “average mean square within.” We must remember
that this is a long-run average under the assumptions. The F distributions
displayed in Figs. 15-1(a,b,c) imply that an F ratio can have values less than 1.
Even when we have true differences among the μi, the observed F ratio might be
smaller than 1. Would we ever find it too small? That is, when might it give a
value that would fall into the lower 5 percent of the distribution instead of the
upper? Of course, this could happen in the 5 percent of the time that chance
plays dirty tricks. But it might also be a signal that a mistake has been made.
EXAMPLE 9 Mistakes in degrees of freedom. An example of such a mistake
might occur if we divided BSS by 100 instead of k – 1, or if we forgot to divide
WSS by n – k. Suppose that we made both mistakes in our three-sample case.
Then we would get
How can we look this up in Table 15-8? We cannot. What we need is an extra
fact.
FACT ABOUT 1/F
Suppose the numerator and denominator of F have u and v degrees of freedom, respectively. The
distribution of 1/F has the distribution Fv,u. And so the lower tail of Fu,v corresponds to the upper tail of
Fv,u.
Thus, to use our tables to study the lower tail, we need to invert F and
interchange the degrees of freedom.
Because we do not give a table of lower limits, we need to flip the ratio over
and make it an upper-tail problem. Then
1/(mistaken F ratio) = 111.1.
The new degrees of freedom are (9, 2), in that order, and the 5 percent upper
level for F9,2, from Table A-4 in Appendix III in the back of the book is 19.38.
The observed 111.1 exceeds 19.38, and so the original F = 0.009 is significantly
small.
Systematic Effects. A systematic effect leading to a value of F near zero can
occur when some hidden matching in the samples occurs that prevents the
independence we assume between the samples. For example, suppose that two
samples were made up of several sets of identical twins, one randomly chosen
for the first sample and the other assigned to the second. Then the samples
would be more alike than random samples from a population. Of course, this
matching would be helpful for an experiment, but we should use the matching in
the analysis as well. But then we are involved with a two-way table, rather than a
one-way table. This type of matching is often called “blocking,” and we discuss
the link between blocking and analysis of variance in Chapter 17, Section 17-6.
EXAMPLE 10 Omitting a variable. In Table 15-4, for Example 2, suppose we
forget that the columns match the cities for calendar months. Then we might run
an analysis of variance as if we had 5 cities with 6 independent measurements
each. We would expect the cities to show a small F ratio because of the strong
matching. You will be asked to study this in Problem 4 of the summary problems
for this chapter.
EXAMPLE 11 Kansas City preventive patrol experiment. With three different
patrolling procedures, and several areas for each procedure, a study analyzed
numbers of reported crimes for several types of crime. For bicycle larceny, the
mean square between was smaller than the mean square within, 1.27 versus 6.22.
This result presumably was due to geographic balancing in the design of the
study. Similar results occurred for several other types of crime.
PROBLEMS FOR SECTION 15-3
1. Weight loss among teenagers. After physical examinations at the beginning
of a term, 17 girls ages 16 to 18 were identified as being considerably
overweight. All but two agreed to participate in an experiment designed to
study the effects of two new diets. The physician in charge randomized them
into 3 groups: 5 girls each to diets A and B and 5 girls to a control group
(diet C). The control group received standard counseling on weight loss but
no specific diet. After 3 months their weight losses to the nearest pound were
as shown in Table 15-10. Carry out the analysis of variance. Explain the
results and interpret them.
TABLE 15-10
Weight losses for teenagers
2. Sugar. Among 21 samples of size 5 each from different stores, the Within
Sum of Squares for the net weight of a “pound” of sugar was 25 ounces2.
The Between Sum of Squares was 5.50 ounces2. Estimate σ2, the common
variance, and σ, the common standard deviation.
3. The Between Mean Square is 4, and the Within Mean Square is 6, with 3 and
150 d.f. If you are testing for differences among the μi’s, is it worth looking
in the F table? Why or why not?
4. The Between Mean Square is 6, and the Within Mean Square is 4, with 3 and
150 d.f. Compute the F ratio, and interpret it using the F table, Table A-4(a)
in Appendix III in the back of the book.
5. Small F. With an F3,9 an investigator observed F = 0.06. He wonders if this
is unusually small. Use Table A-4(a) in the back of the book to find out, and
explain your calculation.
One-way analysis of variance for k = 2 samples. Problems 6 through 12
demonstrate the link between the F ratio for the two-sample case and the two-
sample t test from Chapter 10, using the following data. An x1 sample is drawn
from one population and an x2 sample from another, with the following values:
6. Find the sample means and the grand mean.
7. Compute BSS and TSS.
8. Lay out the analysis-of-variance table, and compute and interpret the F ratio.
9. Compute the pooled estimate of σ2, based on WSS.
10. Carry out a t test for the equality of the means of the x1 and x2 populations,
using the pooled estimate of σ2.
11. Verify for this example that the square of the t value from Problem 10 equals
the F ratio from Problem 8; that is, F = t2.
12. Check that the 5 percent point for F1,7 in Table 15-8 is identical with the 5
percent point obtained by squaring the two-sided 5 percent point of t7.
Project based on Problems 13 through 16: Explore the analysis of variance in
a situation in which you secretly know the μi’s. If you sum three random digits,
you get numbers that are approximately normally distributed for many purposes.
13. Use the random number table, Table A-7 in Appendix III in the back of the
book, to construct 30 numbers by summing successive sets of three digits
(no overlap). Break these into three samples of 10 each using the first 10,
the second 10, and the third 10.
14. Carry out the analysis of variance on them, estimating σ2 (whose true value
is 24.75, or about 25) and computing the F ratio and interpreting it.
15. You have now carried out an analysis of variance in a situation in which you
know there is no difference among the μi’s. To assess σ2, compute
and compare it to values in the F table with n – k and ∞ d.f. (you know σ2
exactly, and so you have ∞ d.f.).
16. Continuation. Add –5 to the ten numbers in your first set and +5 to the ten
numbers in your third set. Leave the second set unchanged. Now carry out
the same steps as in Problem 14. How has the estimate of σ2 changed?
15-4 TWO-WAY ANALYSIS OF VARIANCE WITH ONE
OBSERVATION PER CELL
The temperature data for northeastern cities in Table 15-4 introduces an extra
“way” in the table. Thus, we want to break the total sum of squares into three
parts:
1. between calendar periods (columns)
2. between cities (rows)
3. residual (interaction plus error)
To give formulas, we need a little more notation than we have given up to
now. Let
r = number of rows
c = number of columns
Xij = entry in row i, column j (rows are stated first)
xi+ = total for row i
x+j = total for column j
x++ = grand total.
The grand mean is simply
The total sum of squares is defined as before as the sum of squares of
deviations of all measurements from the grand mean x. In our current setting,
with the new notation, this is the following:
Total sum of squares:
The calculation of the between-rows sum of squares is just like the BSS
calculation from Section 15-2. We only need to keep in mind that the number of
observations in each row is the same, that is, c:
or we can write more briefly
Between-rows sum of squares:
We use a similar calculation to get
Between-columns sum of squares:
Because sums of squares for the two-way table add up to the total did for the
one-way case, we have as they
Residual sum of squares:
Special computational formulas are available to simplify the work in
computing these sums of squares, but as before we rely on the computer to do
our calculations. The analysis-of-variance table based on these sums of squares
can be laid out as in Table 15-11. Note that there the lines for BRSS and BCSS
replace the single line for BSS that we had in the one-way case.
The degrees of freedom for each sum of squares are also listed in Table 15-
11. The only new value is the residual sum of squares. The (r – 1)(c – 1) d.f. for
the RSS can be derived most directly by recalling that the degrees of freedom
must also add up:
TABLE 15-11
Two-way analysis-of-variance table with one observation per cell
By dividing each sum of squares by its degrees of freedom, we get the mean
squares. If the cells were all drawn from the same population, then the between-
rows mean square, the between-columns mean square, and the residual mean
square would be identical, on the average, though not in specific examples. The
ratio of the between-rows mean square to the residual mean square, when the
cells all come from a population with the same mean, has an F distribution with
degrees of freedom r – 1 and (r – 1)(c – 1). Similarly, the distribution of the ratio
of the between-columns mean square to the residual mean square is F, with c – 1
and (r – 1)(c – 1) d.f. A large value of the F for rows suggests that the row means
differ. A similar interpretation applies to the F for columns.
In the two-way analysis, we think of a measurement as being made of four
parts:
μ = grand mean
αi = effect of row i (Σαi, = 0)
βi = effect of column j (Σβj = 0)
eij = error with mean zero and variance σ2.
To have accuracy with F ratios, we need the eij to be approximately normally
distributed. To sum up, the relation between the observation xij and the four
components given earlier is the following:
Analysis-of-variance model:
EXAMPLE 12 City temperatures. Carry out the analysis of variance for the
temperature data of Example 2, Table 15-4.
SOLUTION. In Exhibit 15-D we give the computer results of the analysis-of-
variance calculation. The huge effect of calendar periods is expected and
obvious. Northeasterners do not need analysis of variance to know that winter
comes and is substantially different from summer. But by taking out the effects
of period, we get a better measure of the basic variability in the residuals to help
us appraise the variation between cities. From Table A-4(a) in Appendix III in
the back of the book, the 5 percent level of F with 4 and 20 d.f. is 2.87, and our
observed F, which is equal to 54.03/1.77 = 30.5, is very much higher than this.
We therefore have firm evidence of differences in temperatures among these
cities averaged over the year. Because all the cities are situated near water and
are not very much removed north to south, we might not have been able to detect
the differences reliably without controlling for the effects of period.
The estimate of the residual variance is 1.77, so that the residual standard
deviation is 1.33 degrees Fahrenheit. This is the standard deviation of the
residuals after we have estimated the cell values on the basis of the grand mean
and the row and column effects. What is it composed of?
EXHIBIT 15-D
Computer output of the analysis-of-variance calculations for the temperature data (Table 15-4)
Several things:
1. Rounding error: These cells were averages rounded to the nearest degree.
2. Variability of the cells: The cell values are no doubt based on several years’
experience, but they are subject to sampling fluctuations.
3. Interaction: No law says that even when measured perfectly the row and
column effects will predict the cells additively. The failure to produce that
additivity is called interaction.
These and perhaps other effects combine to produce the residual mean square
that we use to gauge the row and column effects.
PROBLEMS FOR SECTION 15-4
1. An experiment is done to study the effects of seat height (24, 26, 30 inches)
and tire pressure (40, 45, 50, 55 pounds per square inch) on the time required
for a college student to ride a bicycle over a two-mile course. She completes
the course a total of 12 times, once on each of 12 successive days for each of
the 12 combinations of seat height and tire pressure. Set up an analysis-of-
variance table to display the results of such an experiment.
2. Air pollution extremes. The following table gives extreme value indices of air
conditions (that is, the worst values) for each of 3 years for 3 sizes of
communities:
Carry out the analysis of variance and discuss the results.
3. Boiling water. A laboratory experiment in home economics compares
household appliances. One experiment measures the effects of the brand of
pot and brand of stove on the time required to boil two quarts of water. The
experiment involves 3 pots of identical size from different manufacturers
and 2 makes of stoves. The observations are time in seconds to boiling.
Analyze the data and discuss the results.
4. Scholastic test performance. In a national high school study in 1955,
investigators gathered data on high schools, their communities, and the
career and college plans of 35,000 seniors. Background information on the
seniors’ families made it possible to classify them into 5 fifths according to
an index of socioeconomic status. The investigators classified the school
itself according to the percentage of seniors’ families falling into the top
two-fifths of the socioeconomic index.
The two-way data in Table 15-12 show the percentage of seniors performing
above the median on their scholastic aptitude tests. Make and interpret a two-
way analysis of variance for this table.
TABLE 15-12
Percentage of seniors scoring in the top half of the scholastic aptitude test broken down by
socioeconomic status of the family and the high school climate
5. Continuation. Compute residuals for the cells, and see if you observe any
pattern to them.
6. Continuation. What interpretation can you make of the residuals?
15-5 SUMMARY OF CHAPTER 15
1. The purpose of analysis of variance is to measure variability and assign it to
its sources.
2. One-way analysis of variance breaks the total sum of squares, TSS, into the
between-means sum of squares, BSS, and the within-groups sum of squares,
WSS, with degrees of freedom n – 1, k – 1, and n – k, respectively. In one-
way analysis of variance, the grand mean is
The total sum of squares, TSS, is
The between-means sum of squares, BSS, is
and the within-groups sum of squares is
3. For calculations, we use
4. An estimate of σ2 is
5. To test for differences among group means, we compute
and refer it to an F table with k – 1 and n – k d.f. We reject equality of means
when F is too large.
6. To test whether F is too small for an Fu,v variable, we take the reciprocal of F
and see if it is too large for an Fv,u variable (note interchange of subscripts).
7. Programmable hand calculators and high-speed computers may conveniently
use the definitional formulas for the sums of squares of deviations required
in the analysis of variance.
8. The two-way analysis of variance deals with measurements classified two
ways, and so it can allocate variability to row effects, column effects, and
residual effects:
TSS = BRSS + BCSS + RSS.
The residual effects are owing to at least sampling and rounding errors and
nonadditivity (interaction). If we had more than one observation per cell, we
could assess sampling error directly in the two-way analysis.
SUMMARY PROBLEMS FOR CHAPTER 15
1. An experiment was set up to compare 5 groups of size 4. Some of the results
are displayed in the following ANOVA table:
Complete the table by filling in the missing entries, and compare the F ratio
to the appropriate 5 percent tail value from the F table.
Data for Problems 2 and 3: In an experiment to compare the weight gains of
baby chicks reared on four different forms of tropical feed, 5 chicks were
assigned to each form of feed. The resulting weight gains for the 20 chicks were
the following:
Feed Type
(Source: Query 70, Biometrics 5:250,1949.)
2. Compute the sample mean and variance for each of the four feed types.
3. Analyze the results of this experiment using an analysis-of-variance table.
Compute and interpret the F ratio.
4. Omitting a matching variable. Using the results of Exhibit 15-D in Section
15-4, carry out a one-way analysis of variance for the city temperature data
of Table 15-4, dropping the distinction between the calendar periods, and
thus behaving as if each city has six independent measurements. Does F turn
out to be small?
5. Using the data in Table 15-5 and your calculations in Problems 8 through 10
following Section 15-1, construct an analysis-of-variance table for the results
of the 1976 Montreal Olympics women’s platform diving competition.
Interpret the results.
6. A study of automobile performance involved 3 different models of cars and 5
blends of gasoline. Each car was driven over a 50-mile course 5 times, once
using each gasoline. The results in miles per gallon were as follows:
Blend of Gasoline
The data are summarized in the following two-way analysis-of-variance
table:
Compute the F ratios for cars and blends, and discuss the results of this study.
7. An experiment is designed to produce information on the yield of three
different types of popcorn kernels. Two equal-size samples of each type were
tested, one using corn oil and a traditional popper, the second using a new
hot-air popper that requires no oil. Lay out an analysis-of-variance table for
this experiment, filling in all entries for which information is available.
8. Continuation. What are the degrees of freedom for the F ratio for popcorn
types in Problem 7?
9. Continuation. The experiment in Problem 7 is extended to include a third
type of popper that uses a small amount of peanut oil. All three types of
kernels are tested with this new popper as well, and the results are combined
with those from the earlier experiment. What are the changes in the degrees
of freedom for the analysis-of-variance table?
REFERENCES
G. E. P. Box, W. G. Hunter, and f. S. Hunter (1978). Statistics for Experimenters, Chapters 6 and 7. Wiley,
New York.
D. B. Owen (1962). Handbook of Statistical Tables, pp. 64-87. Addison-Wesley, Reading, Mass.
E. S. Pearson and H. O. Hartley (editors) (1966). Biometrika Tables for Statisticians, vol. I, third edition,
pp. 169-175. Cambridge University Press, Cambridge, England.
T. A. Ryan, Jr., B. L. Joiner, and B. F. Ryan (1976). MINITAB Student Handbook, Chapter 10. Duxbury,
North Scituate, Mass.
G. W. Snedecor and W. G. Cochran (1980). Statistical Methods, seventh edition, Chapters 12 and 16, and
pp. 480-487. Iowa State University Press, Ames.
* C1, C2, C3, and C4 in the computer printout for Exhibit 15-C correspond to Programs I, II, III, and IV,
respectively.
Nonparametric 16
Methods
Learning Objectives
1. Applying methods that defend themselves against wild observations
2. Replacing measurements or comparisons by their signs (+ or –)
3. Using ranks in place of measurements to get alternatives to the t test for two
groups, and analysis of variance for more than two groups
16-1 WHAT ARE NONPARAMETRIC METHODS?
The best-known methods of statistical inference, methods that yield probability
levels, such as regression analysis and analysis of variance, assume that we
know the shape of the probability distribution that the measurements take. For
example, if we assume that the measurements are drawn from a normal
distribution, then we can proceed to construct confidence limits for its
parameters, such as the mean μ or the variance σ2, or check on the plausibility of
specific values. Frequently, however, we cannot confidently make an assumption
about the shape of the distribution of measurements. The evidence may not
support a given shape or, still worse, may sharply deny it.
In such circumstances we may prefer to use methods whose strengths do not
depend much on the precise shape of the distribution. For example, we may want
to compare properties of distributions even when we know little about their
shapes. So we turn to methods based on signs of differences, ranks of
measurements, and counts of objects or events falling into categories. The
behavior of such methods may not rest heavily on the shape of the distribution,
and for this reason they are called nonparametric methods.
The term nonparametric is somewhat misleading, because nonparametric
statistics do deal with parameters such as the median of a distribution or the
probability of success p in a binomial. Indeed, the word nonparametric as
commonly used does not lend itself to a precise definition.
The main point is that many of the methods that we now describe defend
themselves against wild observations and stand up well against various shapes of
distributions and failures of assumptions. Statisticians use such words as robust
and resistant for methods that have these properties. To illustrate resistance,
consider the medians and means of two samples of 5 measurements:
The huge change from 45 to 945 in one of the measurements changed the
median not at all. The mean, though, changes from 30 to 210. Similarly, if we
ranked these measurements from 1 to 5, both 45 and 945 would be given the
same rank in their respective samples. If we picked some cutoff number and
assigned plus signs to larger numbers and minus signs to smaller numbers, then
for any cutoff less than 45, both 45 and 945 would get the same sign. This
illustrates the idea of resistance of medians, ranks, and signs.
The word robust means that if the populations do have shapes appropriate for
parametric methods, then we lose only a little information by using the robust
method. If the data come from normal distributions, the sample mean has a
variance about 64 percent as large as that of the sample median. So when we use
the sample median, we say that we lose 36 percent of the information if we are
sampling from a normal distribution. Some statistics for estimating location lose
less information than the median, and they are more robust. An example is the
average of the observations remaining after deleting the largest 20 percent and
the smallest 20 percent of the data. In our first sample of 5 measurements given
earlier, this would leave 22, 35, and 38, whose average is 95/3 = 31.7, as would
the second sample. Thus resistance to wild measurements means that a statistic
does not change much when a small proportion of the measurements change, and
robustness means that the statistic preserves much of the information present
whether or not ideal assumptions are true.
Sometimes data come only in dichotomies (“plus or minus signs,” “better or
worse,” “dead or alive,” “success or failure”) or in grades (“high, middle, or
low,” “excellent, good, fair, or poor”). Then nonparametric methods are essential
for making significance tests or constructing confidence limits. We used some
nonparametric methods in Chapter 9 on contingency tables and counts: the sign
test based on the binomial distribution and the chi-squared test for independence.
Although ease of application and ease of calculation seem to be attractive
reasons for using nonparametric methods, it is more important that good, rather
than easy, methods be used.
The methods we present in this chapter correspond to further applications of
the binomial distribution of Chapter 7, to a two-sample nonparametric test
comparable to the t test in Chapter 10, and to an analysis of variance by ranks
comparable to the method of Chapter 15.
16-2 THE SIGN TEST
When we want to know whether one treatment more often gives better results
than another, the sign test offers a method for deciding. We have already used
the method repeatedly in connection with counts in Chapter 9. The data may
arise in various ways, as the following examples suggest.
EXAMPLE 1 Matched pairs. On the basis of family background, age, and
severity of delinquency, 100 delinquent boys were paired. One randomly chosen
member of each pair attended summer camp; the other participated in a police
athletic league. After a period, 25 of the athletic league members were rated as
performing less delinquency than their paired “mates,” 15 of the summer camp
boys were performing better than their “mates,” and 10 pairs were tied.
EXAMPLE 2 Time periods. Are body excretions of certain chemicals higher
during the day or during the night (12 hours each)? In 3 of 8 individuals they
were higher at night.
EXAMPLE 3 Gains. Black plastic mulch for tomatoes costs more than dirt. An
extra yield of 5 percent or more would pay for the mulch. Among 18 matched
plots, the extra yield exceeded 5 percent in 12 plots and was less in 6 plots.
The Name. The sign test gets its name from our replacing the attributes or
measurements or ratings or differences or comparisons by plus (+) or minus (-)
signs. When the comparisons are independent, we use the binomial distribution
with
as a basis for judgment. If the two treatments are about the same in performance,
the number of plus signs will be about half the total number of signs; if one
treatment is better, the number tends to differ from half. When it differs enough,
we prefer to conclude that one of the treatments is more often successful.
We illustrate the method by applying it to Example 1. The big idea is, of
course, that the athletic league boys did better than the summer camp boys, and
so if we have to choose one of these treatments now, if costs and politics are
similar, we choose the athletic league. Nevertheless, we may also ask if the 25-
15 outcome is compelling.
What about the ties? Although they tell us a lot about how often the
performances are close together, they do not tell us which treatment more often
wins, and for making such a test, we set them aside. We proceed as follows:
We arbitrarily assign a plus sign to pairs when one group wins or a difference
is positive and a minus sign when the other group wins or the difference is
negative, and we regard the sample size, n, for comparison purposes as the
number of nonzero differences (here 40).
We use the cumulative binomial table, Table A-5 in Appendix III in the back
of the book, for p = ½ to compute the probability of a split as extreme as or more
extreme than the one observed. Or we could use the normal approximation, as
we did earlier for the binomial. To use Table A-5, we find the column
corresponding to the sample size n, here 40. It shows the following:
The numbers in our classes are 25 and 15, with 15 the less frequent, and so we
read the number 077. Each number in the table is to be understood as preceded
by a decimal point. Therefore the probability of 15 or fewer observations in the
class when p = ½ is 0.077. To get the probability that one or the other class has a
count less than or equal to 15, we must multiply by 2, and we get 2 x 0.077 =
0.154. This two-sided calculation seems sensible, because we were not given
either treatment as a standard. What we have found is that although a 25-15 split
sounds impressive, splits at least this wide occur more than 15 percent of the
time when p = ½.
PROBLEMS FOR SECTION 16-2
1. In Example 1, if n = 40, use Table A-5 in Appendix III in the back of the
book to find the probability of a split of 30-10 or wider.
2. Apply the method to Example 2, and find the probability of a split at least as
wide as that observed.
3. Apply the method to Example 3, and find the probability of a split at least as
wide as that observed.
4. When n = 50, what split gives a two-sided probability level of just less than 5
percent?
5. Because we know that in real life p is not likely to be exactly ½, what are we
testing when we look at the binomial probabilities? That is, what are we
trying to decide?
6. If you did not have Table A-5 in Appendix III in the back of the book, how
could you get an approximate answer in Problem 1?
7. Methods of memorizing. Two methods of memorizing difficult material are
tried to see which gives better retention. Pairs of students are matched for
both IQ and academic performance. They then receive instruction on the
same material, but one member of a pair uses method A to learn it, and the
other member uses method B. The students are then tested for recall, and the
following scores are obtained:
Use the t test of Chapter 10 to analyze these data.
8. Continuation. Reanalyze the data from Problem 7 using the sign test, and
compare the observed significance level with that from the t test.
9. Continuation. Suppose we were to add two additional pairs of scores to the
data in Problem 7:
How do these new observations affect the sign test of Problem 8?
16-3 THE MANN-WHITNEY-WILCOXON TWO-SAMPLE
TEST
When we have two independent samples, we may want to know if the
populations have much the same location or if they are separated.
When we are willing to suppose that our measurements are approximately
normally distributed without wild observations, the two-sample t test suits us
well for this purpose. But when we have no such comfortable views about the
samples and their populations, we may prefer an approach in which a few wild
observations will cause only limited damage. The Mann-Whitney-Wilcoxon
two-sample rank test offers such an approach.
EXAMPLE 4 Samples of sizes 2 and 4. Sample A contains two measurements, 6
and 24; sample B has four measurements, 14, 33, 74, and 105. Compare the
samples for evidence that sample B comes from a population slipped to the right
of that of sample A.
SOLUTION. The ranking approach considers all six measurements as a
population. Ranks are assigned from least to greatest, here rank 1 to the
measurement 6, rank 2 to 14, and so on up to rank 6 for 105. Then we form all
possible situations that divide the six into two samples of sizes 2 and 4. Finally,
we compute the distribution of their summed ranks for the samples of size 2.
This program has been carried out in Table 16-1. In all, we have 15 possible
samples of size 2, and the ranks associated with the samples of size 2 have been
summed. Equivalently, we could have summed those for the sample of size 4,
but it is more trouble. The extremeness of any sample is judged by how far it is
in the tails of the frequency distribution.
The rank sum of our sample A is 4. The chance of a rank sum being this
small or smaller if the sample is a random choice from these six measurements is
2/15. If we also include the possibility of rank sums being as large as 10 or
larger, we have 4/15 as the probability of a rank sum at least this far from the
middle of the distribution in either direction. Even the most extreme rank sum 3
(or 11) would have given us a two-sided probability of 2/15. Thus our samples
are not large enough to indicate a very rare event. But they have served to
illustrate the idea of the test.
TABLE 16-1
Possible samples of two from six measurements together with the sums of ranks
The general procedure for sample A with n measurements and sample B with
m measurements, n ≤ m, is as follows:
1. Rank the n + m measurements from least to greatest: rank 1, 2, …, n + m.
2. Total the ranks for sample A to obtain the total of its ranks, t.
3. Look in Table A-6 in Appendix III in the back of the book to find the
probability of a total at least as extreme as t, P(T ≤ t) or P(T ≥ t) if t is in the
upper tail. For a two-sided problem, double the probability.
EXAMPLE 5 Sample sizes 4 and 8. Two samples of sizes n = 4 and m = 8 have
a rank sum t = 15 for the n measurements. Find from Table A-6 in Appendix III
in the back of the book the probability of a value of T this small or smaller when
sampling is done at random.
SOLUTION. We enter Table A-6 in Appendix III in the back of the book with n
= 4 and m = 8 on the line reading 15 036 37. The probability of a rank sum less
than or equal to 15 is 0.036, which solves Example 5. Furthermore, the
probability of a rank sum greater than or equal to 37 is 0.036; that of a rank sum
at least this far from the middle of the distribution is 2 x 0.036 = 0.072.
The general idea is that if the selected population has slipped to the right, its
observations will tend to have larger ranks on the average; if it has slipped to the
left, its observations will tend to have smaller ranks, as suggested in Fig. 16-1.
When the populations are positioned in much the same place, the ranks of the
two samples will have much the same average, as suggested in Fig. 16-2.
Figure 16-1 Population A has slipped to the left of population B.
Figure 16-2 The populations heavily overlap.
EXAMPLE 6 Rosier futures. Tabular approach. How much better will the future
be? A social psychologist surveyed samples of people in many countries, asking
what concerned them, how satisfied they were, and how satisfied they expected
to be 5 years later. He rated their degrees of satisfaction on a scale from 0 to 10.
People from all countries thought the future would be better than the present.
Table 16-2 compares the improvements in satisfaction ratings expected in three
industralized countries with those expected in a number of less-industrialized
countries. The two samples are of sizes n = 3 and m = 8. The sum, t, of the ranks
in the smaller sample is t = 4 + 1 + 2 =7. Table A-6 in Appendix III in the back
of the book tells us that for n = 3 and m = 8, when the populations are identical,
the chance of a sum of 7 or less in the sample of size 3 is 0.012. This is a very
small probability, and we are inclined to think that the industrialized countries
are not expecting as much improvement as the others and that this difference is
not accounted for by sampling variation.
NORMAL APPROXIMATION
We can use a normal approximation for values of n and m at least as large as 3,
using the approach shown in the following box. Here μT and σT are the mean and
standard deviation of the distribution of the rank sum T for the smaller sample
when the two samples are drawn from identical populations and thus have
identical locations. We do not give their derivations.
NORMAL APPROXIMATION FOR TWO-SAMPLE RANK TEST
If T is the rank sum for the sample of n randomly distributed among m + n measurements, then the mean
of T is
and the variance of T is
We compute
and refer z to the standard normal table to get a probability.
EXAMPLE 7 Normal approximation for rosier futures. For the data given in
Table 16-2, n = 3, m = 8,
TABLE 16-2
Expected improvements in industrialized and less-industrialized countries together with their ranks
To get the approximate probability of T ≤ 7, we first compute
The normal table, Table A-1 in Appendix III in the back of the book, gives P(Z ≥
–2.14) = 0.0162. This 0.0162 approximates the 0.012 we got from Table A-6 in
Appendix III in the back of the book. On the one hand, it is close, within 0.006,
but on the other hand, it is 35 percent larger. This is one reason we like exact
tables for the smaller n’s and m’s.
Ties. For the sign test, we dealt with ties in a simple way—we ignored them,
setting the corresponding observations aside. For the Mann-Whitney-Wilcoxon
test, we do not ignore them:
1. When two or more measurements are tied, we assign to them the average
rank available to the tied measurements.
2. We use the normal approximation for T, the rank sum for the sample of n, but
we replace formula (2) for the variance by
For the correction term, we let W = w3 – w, where w is the number of tied
observations at a given level. Table 16-3 gives W for various values of w. Then,
if ΣW is the sum of W over all tied sets,
TABLE 16-3
Helping table for ties, W = w3 – w†
†w = number of tied measurements in a set of ties.
EXAMPLE 8 Sample sizes n = 3, m = 8, with ties. Two samples, of sizes n = 3
and m = 8, have one set of 5 tied measurements consisting of all 3 from the n
sample and 2 from the m sample, and these 5 are the lowest measurements. Find
the variance.
SOLUTION. The 5 lowest ranks are 1, 2, 3, 4, and 5, and their average is 3. This
average is used for the 3 tied ranks from the n sample in computing t. To correct
for the ties, we need
From formula (5),
Therefore,
PROBLEMS FOR SECTION 16-3
For Problems 1 and 2, use the normal approximation to find the probability
indicated, and compare the result with that given in Table A-6 in Appendix III in
the back of the book.
1. P(T ≥ 20), when m = 7, n = 3.
2. P(T ≥ 29), when m = 8, n = 4.
3. Masses of planets (World Almanac, 1970). The following table gives the
masses of the planets in terms of the mass of Earth, which is taken as 1 unit.
The B planets are farther from the sun than is Earth, and the A planets are
nearer.
What probability does the Mann-Whitney-Wilcoxon test give for a rank sum
as small as or smaller than that for sample A? Does this support or weaken
the hypothesis of identical locations for the populations leading to groups A
and B? Does the test accept, at the 10 percent level, the hypothesis of
identical distributions for groups A and B?
4. Are famous men more likely to die before or after reaching One might want
very much to live until one’s next birthday some people can live a little
longer if they have a goal? The following data were obtained by a researcher
(Phillips, 1978), who investigated the relationship between the months in
which 1251 famous men died and their birthmonths (i = number of months
after the birthmonth).
Use these data and the Mann-Whitney-Wilcoxon test to decide whether or
not famous men usually die after rather than before their birthmonths.
SOLUTION. Note that the given data offer several possible interpretations. It
seems natural to let i > 0 indicate “after birthmonth” and i < 0 indicate
“before birthmonth.” But how shall we deal with the birthmonth itself, where
i = 0? We might interpret “before birthmonth” and “after birthmonth”
respectively, by (1) i ≤ 0, i > 0, or by (2) i < 0, i ≥ 0, or by (3) i < 0, i > 0, in
which we omit the i = 0 record.
For each of these three interpretations, the data suggest that famous men
die after their birthmonths. In order to get precise information about the
strength of the evidence, we shall apply the Mann-Whitney-Wilcoxon text,
using interpretation (1).
Let the A sample include the numbers of deaths in the months at or before
the birthmonth (i < 0), and let the B sample include the numbers of deaths in
months after the birthmonth (i > 0). Rank the numbers, and indicate the ranks
belonging to sample B:
Let n be the number of B’s, and test whether their ranks are toe) high for a
random sample. Here n = 5, m = 7, t = 7 + 8 + 9 + 10 + 12 = 46,
From Table A-1 in Appendix III in the back of the book, we Find P(Z ≥ 2.11)
= 0.017. This casts doubt on the notion of a random sample and gives strong
evidence that famous men die after their birthmonths.
5. Continuation. Use interpretation (2) in Problem 4 to make your decision.
6. Continuation. Use interpretation (3) in Problem 4 to make your decision.
7. Carry out the Mann-Whitney-Wilcoxon approach to analyze the rates of
word usage in Table 10-6, and compare the results with those obtained by
the approximate t test there.
8. If m = 5 and n = 4 and the pooled 9 measurements have the “ranks” 2, 2, 2,
5.5, 5.5, 5.5, 5.5, 8.5, and 8.5, find the variance of T.
9. A physics teacher had two small classes taking the same course. On the final
examination, the following marks actually occurred:
Use the normal approximation to the Mann-Whitney-Wilcoxon test to see if
the results from the two classes could be regarded as samples from the same
distribution.
10. Enzyme activity. The activity of an enzyme (beta glucuronidase) was
measured for the sweat glands of a group of diseased patients and a group of
control patients. The results were as follows:
Use the normal approximation for the Mann-Whitney-Wilcoxon test to see if
these measurements might reasonably have come from identical
distributions.
16-4 ANALYSIS OF VARIANCE BY RANKS: THE KRUSKAL-
WALLIS TEST
In addition to the two-sample Mann-Whitney-Wilcoxon test, which is the
nonparametric version of the two-sample t test, we would like a nonparametric
version of the one-way analysis of variance for more than two groups. What we
need is something corresponding to the between-means sum of squares. We rank
all the observations in the pool of all the samples just as we did for the Mann-
Whitney-Wilcoxon. Let
ni = number of observations in sample i
Ri = sum of ranks in sample i
N = Σni.
If we take the average rank in each sample and measure its departure from the
mean of all the ranks, we get for sample i
These quantities are analogous to xi – x in the analysis of variance. For the
between-means sum of squares, we squared such quantities and multiplied by ni
to get ni(xi – x)2, and then we summed them. Here we get
where C is the number of groups. We use H = 12 D/[N(N + 1)] as a measure of
departure from equality when we have no tied ranks.
When we deal with random samples and the ni are not too small, the
distribution of H is approximately chi-square with C – 1 d.f. For purposes of
calculation, we can use the following form:
Large values of H cast doubt on the hypothesis that the C distributions have
much the same positions. Very small values mean that the samples agree too
well. This latter situation often arises from mistakes in arithmetic.
EXAMPLE 9 Graduate admission rates. Table 16-4 gives percentages of
admissions to the Graduate School of Arts and Sciences at Harvard University in
1966-1967 for the 11 departments receiving the most applications, grouped by
division. Use the Kruskal-Wallis test to decide if there is substantial variation
among the three divisions of the graduate school in admission practices.
TABLE 16-4
Graduate admission rates to Graduate School of Arts and Sciences, Harvard University, 1966-1967
SOLUTION. The sample sizes and ranks are
Because C = 3, we are dealing with chi-square with 2 degrees of freedom.
Interpolating in Table A-3 in Appendix III in the back of the book, we find for
Χ2 = 2.6 that P = 0.3, and so we have little evidence for different locations
among the three divisions. Observe by eye that one department has a rather
higher rate than the rest.
Ties. The data for multiple samples frequently are afflicted with ties. We
assign to each observation the mean of the ranks of the observations tied for that
value, as we did for the Mann-Whitney-Wilcoxon test in Section 16-3. The mean
of the ranks of those tied values is either an integer or a half-integer. In addition,
to get H we need a simple correction. Let W = w3 – w, where w is the number of
tied observations at a given level, as for the Mann-Whitney-Wilcoxon test with
ties. Then we correct H by dividing it by the divisor for ties:
where ΣW is the sum of W over all the tied sets. Thus we get the following:
EXAMPLE 10 Ties. In Table 16-5 we analyze some artificial data constructed to
display the problem of ties. The table contains five Os, which have ranks allotted
1, 2, 3, 4, 5. The average rank is 3. It has three l’s, allotted ranks 6, 7, 8, or
average rank 7. Three 2’s at 9, 10, 11 have average rank 10. Now, for w3 – w, we
have, from Table 16-3, for five ties 120 and for three ties 24, and so the sum ΣW
= 120 + 24 + 24 = 168. We have three groups, and so C = 3. The probability of a
larger value of H* is about 0.66, and so we have no evidence of substantial
difference among the locations of the populations. Note that the huge
observation 88 in treatment 1 ultimately contributed little more than the
observation 15. The ranks are resistant. Note also that the correction for ties
changed matters very little (about 3 percent).
TABLE 16-5
Example of the H statistic for ranked analysis of variance
Source: F. Mosteller and R. R. Bush (1954). Selected quantitative techniques. In Handbook of Social
Psychology, edited by G. Lindzey. Addison-Wesley, Reading, Mass., p. 320. © 1954. Reprinted with
permission.
In a large problem with many ties, it can be a nagging backache to get the
rankings all correct. We have found that a stem-and-leaf diagram with the leaves
ordered and widely spaced helps a great deal. We can mark on it both the correct
ranks to assign and the cumulative rank. Table 16-6 shows the layout applied to
the data of Table 15-1 for the state-by-state data on divorced males broken down
by region of the country. After the layout is made, the ranks are transferred back
to be associated with the appropriate states and regions. Table 16-7 shows the
final layout. A high-speed computer may have a program for ordering the
measurements.
TABLE 16-6
Stem-and-leaf diagram to aid the ranking and keep track of ties applied to data of Table 15-1 on
divorced males. The middle row for each stem gives the ordered leaves of the stem and leaf; the top
row gives the rank, counting from the smallest number left to right; the bottom row for each stem
indicates the ties and the average rank to be assigned these ties.
TABLE 16-7
States with their rankings of percentages of divorced men sorted by region of the country
PROBLEMS FOR SECTION 16-4
1. Given six independent samples and no ties, with H = 9.2, find
approximately P(H ≥ 9.2) if the samples are randomly drawn from
identically distributed populations.
2. Three chemical sprays for killing flies are tested, and the percentages of
kills are recorded as follows:
Compute H. What do you conclude about the differences among the
brands?
3. Suppose that we have Scholastic Aptitude Test scores for three groups of
students:
Assuming that these groups represent three independent samples, test these
samples to see if they could reasonably have come from distributions with
the same location.
4. To test four different diets, 24 young turkeys were randomly divided into 4
groups of 6 each, and each group was fed a different diet. At the end of the
experiment, the gain in weight for each turkey was recorded. The results
were as follows:
Compute H and the probability of a larger value of H for these data if all
four diets produced the same distribution of gains.
5. Davies and Sears (1955) gave data on fuel-oil consumption for 5 buses
during a 6-month period. Disregarding possible differences between
months, and thereby treating the measurements as independent, compute
the value of the Kruskal-Wallis statistic H and its statistical significance.
What conclusion do you draw about differences between buses in fuel
consumption? Fuel consumption, gallons per 1000 miles:
6. Reading scores. The average reading test scores reported for grade 7 by
schools in six Manhattan districts for 1969 are tabulated below. (The
national norm for grade 7 is 7.7 for this test. The low scores are said to be
partly due to a long teachers’ strike.)
Use the Kruskal-Wallis test to assess differences among districts 1 through
6. What conclusion do you draw?
7. In an analysis of the behavior of stocks of various companies, an economist
estimated the percentage of total variance of stock prices contributed by
the behavior of the market. Use the Kruskal-Wallis statistic to test whether
there is industry-to-industry variation.
8. Why is the mean rank for a group taken as (N + 1)/2?
* 9. Prove that Σni(Ri/nj – (N + 1)/2] = 0, when ni is the number of
measurements in sample i, N = Σni, and Ri is the sum of the ranks of the ni
measurements in sample i.
10. For Table 16-7 on rates of divorced men by regions of the country, the
Kruskal-Wallis approach is especially attractive for two reasons. First, we
can include the outlier state without having it ruin the analysis. Second, the
idea of random rearrangements on which the Kruskal-Wallis test is based is
exactly the kind of null hypothesis appropriate to the discussion. Carry out
the analysis and interpret the result. The sums, and region sizes and ties are
summarized as follows:
16-5 SUMMARY OF CHAPTER 16
Formula numbers cited in these items correspond to those used earlier in this
chapter.
1. Nonparametric methods based on signs, counts, and ranks offer alternative
analyses to those based on assumptions about the shapes of distributions.
2. Nonparametric methods are resistant to outliers and robust in that they lose
only a modest amount of information if used when ideal assumptions hold.
3. The sign test can be used to test differences to see if pluses and minuses split
about 50-50 or more extremely. Binomial tables or the normal approximation
can provide the probabilities.
4. For robust two-sample tests of location, we use the Mann-Whitney-Wilcoxon
approach:
a) Take two samples of sizes n and m (n ≤ m).
b) Pool the measurements and rank them from least to greatest.
c) Compute
t = sum of the ranks
for the sample of size n.
d) When n and m are small, we use Table A-6 in Appendix III in the back of
the book.
5. The mean and variance for the sum, T, when the samples come from
identical distributions are
6. For large samples, we use a normal approximation for the distribution of T.
We compute
and refer z to the standard normal table. See Section 16-3 of the text for ties.
7. To handle the one-way analysis of variance for C groups robustly, we use the
Kruskal-Wallis statistic:
where Ri is the sum of the ranks for the ith group, whose sample is of size ni
and N = Σni.
8. When the ni are not too small, the distribution of H is approximately chi-
square, with C – 1 d.f.
9. When ties occur in the ranks for the Kruskal-Wallis test we compute ΣW,
where W = w3 – w, and w is the number of tied observations at a given level.
Then we use the corrected statistic
SUMMARY PROBLEMS FOR CHAPTER 16
1. Why do we sometimes prefer nonparametric to parametric methods?
2. What can we use in place of the original measurements when we employ
nonparametric methods?
3. The two-sample t test and its extension to the sample analysis of variance
have what counterparts in this chapter?
4. In using the sign test, we are assessing the value of a parameter even though
the test is called nonparametric. What is the parameter we study?
5. Although in the sign test we have usually taken as the null hypothesis p = ½,
other values are sometimes useful. In a medical investigation, suppose that in
the past the probability of improvement has been 0.3. For a new treatment, 7
out of 10 improve. Is this result a significant improvement at the 5 percent
level?
6. In the Kruskal-Wallis test, suppose there are two groups of observations, and
n1 ≤ n2, with no ties. Then for each of n1, n2, N, R1, R2, give the
corresponding notations from the Mann-Whitney-Wilcoxon two-sample test.
(You may use the fact that the sum of all the ranks is N(N + 1)/2.)
7. When the Kruskal-Wallis test is applied to two groups without ties, the value
of H is exactly the square of
for the Mann-Whitney-Wilcoxon two-sample test. Check this numerically for
the special case where the samples are of sizes n = 2 and m = 4, and the
corresponding ranks are for the n sample 1, 3, and for the m sample 2, 4, 5, 6.
REFERENCES
A. P. Davies and A. W. Sears (1955). Some makeshift methods of analysis applied to complex experimental
results. Applied Statistics 4:48.
F. Mosteller and R. E. K. Rourke (1973). Sturdy Statistics. Addison-Wesley, Reading, Mass.
D. P. Phillips (1978). Deathday and birthday: an unexpected connection. In Statistics: A Guide to the
Unknown, second edition, edited by J. M. Tanur, F. Mosteller, W. H. Kruskal, R. F. Link, R. S. Pieters,
G. R. Rising, and E. L. Lehmann (special editor). Holden-Day, Inc., San Francisco, pp. 71-85.
Ideas of 17
Experimentation
Learning Objectives
1. Distinguishing an experiment from an observational study
2. Devices for strengthening experiments
3. Troubles that weaken experiments
4. The four principal one-and two-group designs
5. Linking blocking in an experiment to two-way analysis of variance
17-1 ILLUSTRATION OF EXPERIMENTS
In many simple situations we can predict accurately the effect of an action or a
change. For example, hit a glass jar with a hammer and it will shatter, or stir a
little salt into water and it will dissolve. We are confident of these and many
other outcomes because of our previous experiences, or because the events are
simple enough that a well-established theory tells us what to expect.
In more complicated situations we may require careful definitions of the
events and elaborate experiments to discover the direction and magnitude of the
effects. To illustrate such questions: Do people who smile at others get more
smiles in return than those who don’t? If we increase the money supply, will we
create increased jobs and inflation? Does washing wounds with alcohol instead
of water reduce the frequency of infections? These are not questions that we plan
to answer here, but rather questions whose answers might require investigations
of the form we call experiments. We may each have our opinions about the
outcomes of such studies, but it is something else again to define the problems
carefully and gather convincing data about them. The questions that data
gathering helps us answer are these: What do we believe? What is the evidence
for the belief?
In this chapter we give the simplest forms of controlled experiments, and we
describe some of the devices and precautions that have been developed through
the years to add to their strength or to defend them from threats to their validity.
Let us begin with two dietary examples.
EXAMPLE 1 Daniel and the diets. Although the Daniel of the Bible is famous
for returning from the lions’ den, he also designed a very early dietary
experiment. Daniel and his young noble friends were hostages in
Nebuchadnezzar’s palace and were being treated well with the Babylonian
king’s diet of wine and rich food. Daniel complained that he wanted to be fed the
kind of food he ate at home, primarily vegetables. The man in charge of the
hostages feared for his own life if, after he gave Daniel and his friends the kind
of food they asked for, they wound up in poorer condition than the others. In one
translation of the Bible, Daniel said, “Test your servants [Daniel and his three
friends] for ten days; let us be given vegetables to eat and water to drink. Then
let our appearance and the appearance of the youths [the other hostages] who eat
the king’s rich food be observed by you, and according to what you see, deal
with your servants [Daniel and his three friends].” It turned out that at the end of
10 days they looked better to the man in charge than those who had the rich
foods, and so Daniel and his friends were allowed to eat their own diet.
Discussion. A number of issues will have occurred to the reader:
1. Maybe Daniel and his friends were in better shape than the other young
men to start with. More generally, we ask: Were the conditions comparable
initially?
We would call Daniels group the experimental group, because from the point
of view of the caretaker, the diet Daniel wanted was new and untested. The other
group was the control group; in this instance that group had the same treatment
as before, sometimes called “the standard treatment,” meaning “what is
ordinarily done.” We usually want the control and experimental groups to be as
alike as possible initially.
2. Maybe Daniel’s group exercised more. So we ask: Was the treatment
really the vegetable and water diet, or were other things at work? This matters if
we base future actions on the effect of the diet.
Following an experiment, often someone claims that the treatment said to
cause the effect is not the real cause, because some other variable has changed as
well as the treatment. Then new experiments are required. Redoing experiments
in psychology, biology, chemistry, and the other quantitative sciences is
commonplace, because new variables are always being discovered. The idea that
experiments are done once and that the matter is then settled is an
oversimplification.
3. Is 10 days a reasonable length of time for such a diet to show its effect?
More generally, have we chosen the experimental conditions properly?
4. Was the experiment of sufficient size? Daniel’s friends, called Shadrach,
Meshach, and Abednego by the Babylonians, made only four in the Judean
sample. We are not told how large the other group was.
5. Do we have a good measure of health? More generally, have we outcome
measures for the variables we want measured? This is a grave stumbling block in
many investigations.
EXAMPLE 2 James Lind on scurvy. When sailors were long at sea before 1800
A.D., the disease called scurvy slew them by the thousands. In the British Navy,
more sailors died from scurvy than from all other causes, including battle. When
Anson sailed around the world, 1740-1744, he lost about four-fifths of his 961
sailors to death from scurvy.
The experiment. In 1747, James Lind, a physician in the British Navy on
board H.M.S. Salisbury, chose 12 men with scurvy and assigned the following
six treatments to pairs of sailors:
1 quart of cider per day
25 drops of elixir vitriol three times per day
2 teaspoons of vinegar three times per day
1/2 pint of seawater per day
2 oranges and 1 lemon per day
mixture of selected herbs
Results. The outcome was that in 6 days the two sailors treated with oranges
and lemons were back on duty. The cider seemed to help the patients somewhat,
but the other treatments did nothing. The conclusion was that oranges and
lemons provide dramatic relief, because before treatment the sailors could hardly
move.
How can the result for two sailors be so compelling? When we have a lot of
experience with a situation and know how it will continue if left alone, we are
impressed when it changes following a treatment. In Linds case, physicians were
convinced from long experience and from the reports of others that unless some
help arrived, the scurvy patient would continue to deteriorate.
Design. Lind’s experiment had six treatments, one applied to each of two
sailors. The large effect of the citrus fruit would be called a “slam-bang effect.”
The body responds remarkably well to a resupply of vitamin C. This strong
experimental result, plus extensive historical reading, led Lind to recommend
that the British Navy carry citrus juices on warships. It took the navy 50 years to
implement Lind’s recommendation, and in the meantime thousands of men were
lost each year. Once the regulation of issuing citrus juice was established, scurvy
disappeared from the Royal Navy.
A previous recommendation. Lest it be supposed that 50 years is a long time
for public policy to change when firm information is available, we should note
that Sir Richard Hawkins had recommended the use of citrus fruits in 1593,
about 150 years before Lind.
A previous controlled study. In 1601, Captain James Lancaster of the East
India Company performed an experiment, giving 3 teaspoons of lemon juice
daily to each of the 202 men on his ship, but none to the 108, 88, and 82 men on
the other three ships of his fleet bound for India. Essentially, nobody on his ship
got scurvy, and practically all the sailors on the other three ships did, and 105 of
278 of them died from it on the trip.
Comments. Lind had six experimental groups, and we are not told of any
treatment being regarded by him as not being expected to have any effect
(placebo). Thus, any five of these groups can act as controls for the sixth. In
addition, all his patients were on the same ship.
Lancaster’s experiment, although controlled, did not have the same ship for
all the men. Thus, if someone claims there was something special about the
flagship that prevented the disease, we have trouble making a convincing case
for the orange and lemon juice. The experiment would be a bit stronger (a) if
some sailors on each ship had the treatment and some did not or (b) if we had
more ships, some employing lemon juice and some not.
In either case (Lind or Lancaster), we wish that the choice of individuals for
the treatment group could have been assured to be independent of the
characteristics of the sailors. We would not want the brawny ones getting the
juice and the sickly ones nothing, or vice versa.
PROBLEMS FOR SECTION 17-1
1. Regarding the Biblical example: (a) Who were the members of the control
group, and who were the members of the experimental group? (b) What was
the experimental treatment? What was the control treatment? (c) What bad