332 CHAPTER 12 • DESCRIPTIVE STATISTICS
or percentile levels across tests because the scores The Pearson r
were obtained from different groups. Being in the
84th percentile on the SAT does not correspond to The Pearson r correlation coefficient is the most
being in the 84th percentile on an IQ test. appropriate measure when the variables to be cor-
related are expressed as either interval or ratio data.
Measures of Relationship Similar to the mean and the standard deviation, the
Pearson r takes into account every score in both
Measures of relationship indicate the degree to distributions; it is also the most stable measure of
which two sets of scores are related. Correlational correlation. In educational research, most measures
research, discussed in detail in Chapter 8, focuses represent interval scales, so the Pearson r is the coef-
on measures of relationship—it involves collecting ficient most frequently used for testing for relations.
data to determine whether and to what degree a An assumption associated with the application of the
relation exists between two or more quantifiable Pearson r is that the relation between the variables is
variables. This degree of relation is expressed as a a linear one. If the relation is not linear, the Pearson r
correlation coefficient and is computed using two will not yield a valid indication of the relation.
sets of scores from a single group of participants. If
two variables are highly related, a correlation coef- Although the computational formula is somewhat
ficient near ϩ1.00 or Ϫ1.00 will be obtained; if two complex at first glance, the Pearson r is basically a
variables are not related, a coefficient near 0.00 will series of mathematical calculations that consider the
be obtained. relative association of each person, test, or occur-
rence (e.g., X is the person’s score for the first test,
As stressed earlier, when interpreting measures and Y is that person’s score for the second test).
of relationship, researchers must be careful not
to make inappropriate causal assumptions. Un- ⌺XY 2 (⌺X )(⌺Y )
fortunately, this error occurs frequently, not only N
in research but also in the popular press. For r5
example, National Public Radio recently reported (⌺X )2 (⌺Y )2
that a well-meaning organization found, after run- c ⌺X 2 2 N d c ⌺Y 2 2 N d
ning a correlational study, that the number of trees
in a neighborhood was inversely correlated with )
the number of crimes in that neighborhood. That
is, neighborhoods with more trees had lower crime If you look closely at the formula for Pearson r, you
rates. The organization concluded, therefore, that will see that you already are familiar with all the
planting trees will lower the crime rate. Although pieces of the formula except one, ⌺XY, which is a
number of trees is a malleable variable (we can symbol for the sum of the product of each person’s
always plant more trees), it is not likely the solution X and Y scores (e.g., a student’s first test score mul-
to crime rates; the assumption that planting trees tiplied by the student’s second score). Obviously, if
will reduce the crime rate is seriously flawed. Lack we have a large dataset, multiplying and summing
of trees in crime-ridden neighborhoods is not really every score by hand is quite tedious. Fortunately, the
the problem, it is a symptom—neighborhoods with computer arrives at the same value in much less time.
more trees are usually in nicer sections of town. It
is equally easy to assume causal relations in educa- For a concrete example, we return to our analyses
tional research—erroneously—and thus research- of the outcomes in Mrs. Alvarez’s third-grade class at
ers must carefully guard against it. Pinecrest Elementary. First, we want to find the associ-
ation or correlation between the Fall Reading and Fall
A number of different statistical methods can Math scores for the students. We expect from our ex-
be used to test for relationships; which one is perience that Reading scores and Math scores would
appropriate depends on the scale of measurement likely be correlated; that is, students who score well
represented by the data. The two most frequently on reading will also do well on math and vice versa.
used correlational analyses are the product moment
correlation coefficient, the Pearson r, and the rank Computing the correlation using SPSS is rela-
difference correlation coefficient, usually referred tively simple. We select the following options, as
to as the Spearman rho. illustrated in Appendix B, Figure B.12.5:
Analyze
Correlate
Bi-Variate
CHAPTER 12 • DESCRIPTIVE STATISTICS 333
Having selected the appropriate SPSS options, Is this correlation coefficient significantly different
we then select the variables we want to analyze, from 0.00? If you recall the earlier discussion, you
as shown in Figure B.12.6: Fall Reading (ReadF) know that a correlation coefficient of 0.618 prob-
and Fall Math (MathF). SPSS then produces the ably indicates a strong relation between the vari-
output shown in Table 12.8. This output displays a ables. Moreover, even with our small sample size of
2 ϫ 2 matrix with four cells. The first cell in the ma- 25, r ϭ 0.618 is significant at 0.001. In other words,
trix shows the correlation coefficient of ReadF by the likelihood of finding a correlation this large,
itself: ReadF ϫ ReadF. This is a perfect correlation simply by chance, is less than one in 1,000. We can
(1), of course, as the table shows. If we read across feel confident that there is a relation between the
the matrix to the second cell, horizontally, we find fall reading and math scores; Mrs. Alvarez’s students
the correlation between ReadF and MathF ϭ 0.618 have started the school year with fairly equivalent
with a significance level of 0.001. The third cell skills in reading and math. It will be important for
provides the same correlation between MathF and us at the end of the year to test the correlation again;
ReadF, r ϭ 0.618, only MathF is listed first. The that is, to see whether the students’ reading and
final cell is MathF by MathF, r ϭ 1, which is a per- math scores have improved at the same rates.
fect correlation against itself, of course. The two
cells that show the perfect correlation of 1 (i.e., The ease of using SPSS to determine the correla-
the first and fourth) are called the diagonal of the tion and level of significance is quite obvious, but
matrix. The other two cells in the matrix are called remember that if we were making this calculation
the off-diagonal cells. Because there are only two by hand, rather than using SPSS or other statistical
variables in this analysis, both off-diagonal boxes software, we should arrive at the same Pearson r
show the same correlation coefficient. Although of 0.618. SPSS follows the same procedures you
we need only one coefficient value (0.618), SPSS would to compute the Pearson r (i.e., summing up
prints the complete information for all the cells. the product of the X and Y scores, subtracting the
Having all this information in the matrix becomes sum of X times the sum of Y, dividing by N in the
more important when we are correlating more than numerator, and then dividing by the complex cal-
two variables and the matrix grows to 3 ϫ 3, 4 ϫ 4, culation in the denominator). Once the Pearson r is
or larger. In large datasets we often compute cor- determined, SPSS calculates the significance of this
relations for all ratio variables in preparation for r based on the sample size and degrees of freedom
conducting multiple regression. and prints exactly the level of significance for those
parameters. Our finding of a significance level of
How do we interpret this finding of a correlation 0.001 in Table 12.8 assures us the correlation we
coefficient of r ϭ 0.618 with a significance value of found is not very likely due to chance.
0.001? Is this good? Does it represent a true relation?
Whereas SPSS determines the level of signifi-
TABLE 12.8 • SPSS output of correlation: Fall cance automatically, when conducting a correlation
Reading (ReadF) by fall Math (MathF) Correlations analysis by hand we have to consult a table of corre-
lation coefficients, taking into account the number
ReadF MathF of students or participants. The effect of the num-
ber of students on the level of significance is done
ReadF Pearson Correlation 1 .618(**) through consideration of the degrees of freedom, as
Sig. (2-tailed) .001 included in Table A.2 in Appendix A. For the Pear-
N 25 25 son r, degrees of freedom is always computed by
.618(**) 1 the formula N Ϫ 2, with N representing the number
MathF Pearson Correlation .001 of participants for whom we have paired data (i.e.,
Sig. (2-tailed) 25 25 X and Y). Thus, for our example with Pinecrest,
N degrees of freedom (df) ϭ N Ϫ 2 ϭ 25 Ϫ 2 ϭ 23.
With a level of significance set at 0.05 we read
**Correlation is significant at the 0.01 level (2-tailed). down the column labeled 0.05 in Table A.2 to the
Source: Graphing Statistics and Data: Creating Better Charts by lines that correspond with df values of 20 and 25
Wallgren, Anders. Copyright 1996. Reporduction with permission (i.e., in the first column). The p value in Table A.2
of Sage Publications Inc., Books in the format Textbooks via is a population correlation coefficient that we use
Copyright Clearance Center. as a benchmark for comparison with our sample
334 CHAPTER 12 • DESCRIPTIVE STATISTICS
correlation coefficient. Although our df value falls GRAPHING DATA
between 20 and 25, we find that our Pearson r of
0.618 is significant if df ϭ 25 (p ϭ 0.3809) or df ϭ A distinct benefit of statistical software packages is
20 (p ϭ 0.4227). Note also that if our Pearson r was that they provide a variety of ways to present the
negative in our example (r ϭ Ϫ.618) we would data in graphic form. For example, Excel and similar
still consult Table A.2 in the same way because the spreadsheet programs offer a very convenient way
direction of the relation (i.e., positive or negative) to graph data with commonly used pie charts and
does not affect the level of significance. line graphs. Because the shape of the distribution
may not be self-evident, especially if a large number
The Spearman Rho. Because correlational data of scores are involved, it is always helpful to provide
are not always measured on an interval scale, we a graphic representation of the data, and in some
can use the Spearman rho coefficient to correlate cases (e.g., a curvilinear relation among variables)
ranked or ordinal data. Other measures for ordinal the shape of the distribution may influence the re-
data include Gamma, Kendall’s tau and Somer’s d, searcher’s choice of descriptive statistics.
but Spearman’s rho is among the most popular.
When using Spearman’s rho, both variables to Graphic displays of data range from simple
be correlated must be ranked. For example, if bar graphs to more complex frequency polygons
intelligence were to be correlated with class rank that display the shape of the data, as in a nor-
or economic status, students’ intelligence scores mal distribution or bell shaped-curve. Figure 12.3
would have to be translated into ranks (e.g., low, shows a simple bar graph created in Excel for
the distribution of Pinecrest students, organized by
medium, high). Spearman’s rho has a weakness,
however, when more
than one individual re- FIGURE 12.3 • Bar chart: Gender by economic level
ceives the same score—
there is a tie in the
ranking. In these cases Economic Level
the corresponding ranks
are averaged. For ex-
ample, two participants
with the same high- High
est score would each
be assigned Rank 1.5, Medium
the average of Rank 1 2
and Rank 2. The next Low
highest score would be
assigned Rank 3. Simi-
larly, the 24th and 25th
highest scores, if identi-
cal, would each be as- High
signed the rank 24.5.
As with most other cor- Medium
relation coefficients, the 1
Spearman rho produces Low
a coefficient some-
where between Ϫ1.00
and ϩ1.00. If, for exam-
ple, a group of partici- 0 1 2 3 4 5 6
pants achieved identical
ranks on both variables, Gender: Boys = 1, Girls = 2
the coefficient would Economic Levels: 1 = Low, 2 = Medium, 3 = High
be ϩ1.00.
CHAPTER 12 • DESCRIPTIVE STATISTICS 335
gender and economic level. The graph was created FIGURE 12.4 • Frequency polygon and pie
first by making a pivot table (refer to the explana- chart based on 85 Pacific Crest students’
tion, Figure B.12.3 in Appendix B) and then select- achievement test scores
ing “Insert” from the Excel menu bar and choosing
“Chart.” Gender is on the Y axis (i.e., vertical), and Frequency Polygon
economic level is on the X axis (i.e., horizontal). As
the graph shows, six boys are at the low economic Frequency 14
level, four boys are at the medium level, and so 12
forth. Note, however, that three-dimensional graphs 10 80 85 90 95
provide an effective illustration of data for an audi-
ence, the exact values are not always easy to iden- 8 Score
tify from them. If precision is the goal, then a pivot 6
table is likely the preferable choice for displaying 4
the data. 2
0
Another typically used method of graphing
data is to construct a frequency polygon. A poly- 75
gon plots the frequency of each score value on
the graph and then connects the dots with a line. Pie Chart
Figure 12.4 shows a frequency polygon based on
85 achievement test scores of Pinecrest students 81–83 80 & less
(the data are shown in Table 12.6), created in 27% 12%
Excel selecting “Insert” and then using “Chart” (see
Appendix Figure B.12.4). Various options are then 90 & higher
used to create the polygon. Twelve Pinecrest stu- 6%
dents scored 85 on the achievement test, 10 scored
86, nine scored 84 and so forth. Most of the scores, 87–89
as we would expect, group around the middle with 19%
progressively fewer students achieving the higher
or lower scores. In other words, the scores appear 84–86
to form a relatively normal distribution. 36%
Whereas the frequency polygon graphs the num- can give some clues about which inferential statis-
ber of occurrences of each score from 78 to 91, the tics, discussed in the next chapter, are appropriate.
accompanying pie chart in Figure 12.4, also created
in Excel, groups the data into five categories. The Postscript
pie chart provides the percentage of scores in each
of these categories. For example, one slice of the Almost always in a research study, descriptive sta-
pie or category represents 12% of the scores falling tistics such as the mean and standard deviation
in the 80 and less range. The next slice displays the are computed separately for each group in the
27% of scores found in the 81–83 range, and so forth. study. A correlation coefficient is usually computed
only in a correlational study (unless it is used to
There are many other approaches and methods compute the reliability of an instrument used in a
for displaying not only frequency data but also causal–comparative or experimental study). Stan-
measures of central tendency and variance; graphs dard scores are rarely used in research studies.
showing group means and standard deviations are However, to test a hypothesis, we almost always
among the most common. Excel, SPSS, and other need more than descriptive statistics; we need the
programs can create scatterplots, box plots, stem- application of one or more inferential statistics to
and-leaf charts, and a number of graphical and test hypotheses and determine the significance of
tabular formats.1 Examining a picture of the data results. We discuss inferential statistics in the next
chapter.
1 Graphing Statistics and Data, by A. Wallgren, B. Wallgren,
R. Persson, U. Jorner, and J. Haaland, 1996, Thousand Oaks,
CA: Sage.
336 CHAPTER 12 • DESCRIPTIVE STATISTICS
SUMMARY
The Language of Statistics Measures of Central Tendency
1. The formulas for statistical procedures 9. Measures of central tendency are indices that
are just basic mathematical procedures; represent a typical score among a group of
statistical notation uses Greek letters as scores. They provide a convenient way of
shorthand. describing a set of data with a single number.
PREPARING DATA FOR ANALYSIS 10. The mean is the arithmetic average of the
scores and is the most frequently used
2. The first step toward analysis involves measure of central tendency. It is appropriate
converting behavioral responses into some for describing interval or ratio data.
numeric system or categorical organization.
11. The median is the midpoint in a distribution;
3. All instruments should be scored accurately 50% of the scores are above the median, and
and consistently, using the same procedures 50% are below the median. The median is
and criteria. Scoring self-developed most useful when looking at ordinal variables
instruments is more complex than scoring or datasets in which the scores vary widely
standardized instruments, especially if open- over the distribution.
ended items are involved.
12. The mode is the score that is attained by more
Tabulation and Coding Procedures subjects than any other score (i.e., occurs
most frequently). A set of scores may have
4. Tabulation involves organizing the data two (or more) modes. When nominal data are
systematically, such as by individual subject. collected, the mode is the only appropriate
If planned analyses involve subgroup measure of central tendency.
comparisons, scores should be tabulated for
each subgroup. Deciding Among Mean, Median, and Mode
5. Following tabulation, the next step is 13. In general, the mean is the preferred measure
to summarize the data using descriptive of central tendency. When a group of test
statistics. scores contains one or more extreme scores,
the median is the best index of typical
TYPES OF DESCRIPTIVE STATISTICS performance.
6. The values calculated for a sample drawn Measures of Variability
from a population are referred to as statistics.
The values calculated for an entire population 14. Two sets of data that are very different
are referred to as parameters. can have identical means or medians, thus
creating a need for measures of variability,
Frequencies indices that indicate how spread out a group
of scores are.
7. Frequency refers to the number of times
something occurs; with descriptive statistics, 15. The range is simply the difference between
frequency usually refers to the number of the highest and lowest score in a distribution
times each value of a variable occurs. and is determined by subtraction. It is not
a very stable measure of variability, but its
8. For nominal or ordinal variables, a chief advantage is that it gives a quick, rough
frequency count by each value is very estimate of variability.
descriptive. Frequency is more complicated
for interval or ratio variables; other measures 16. The quartile deviation is one half of the
of central tendency are preferred to difference between the upper quartile
describe them. (the 75th percentile) and lower quartile
(the 25th percentile) in a distribution. The
CHAPTER 12 • DESCRIPTIVE STATISTICS 337
quartile deviation is a more stable measure of Measures of Relative Position
variability than the range and is appropriate
whenever the median is appropriate. 24. Measures of relative position indicate
17. Variance is defined as the amount of spread where a score falls in the distribution,
among scores. If the variance is small, the relative to all other scores in the distribution.
scores are close together; if it is large, the They make it possible to compare one
scores are more spread out. person’s performance on two or more
18. Standard deviation is the square root of the different tests.
variance of a set of scores. It is the most
stable measure of variability and takes into 25. A percentile rank indicates the percentage
account every score. of scores that fall at or below a given score.
Percentiles are appropriate for data measured
The Normal Curve on an ordinal scale, although they are
frequently computed for interval data. The
19. A distribution with fewer people (or scores) at median of a set of scores corresponds to the
the extremes and more people in the middle 50th percentile.
is considered “normal.” When plotted as a
frequency graph, a normal distribution forms 26. A standard score reflects how many standard
a bell shape, known as the normal curve. deviations a student’s score is above or below
the mean. A z score directly expresses how far
20. If a variable is normally distributed, then 50% a score is from the mean in terms of standard
of the scores are above the mean, and 50% deviation units. A score that is equivalent to
of the scores are below the mean. The mean, the mean corresponds to a z score of 0. A
the median, and the mode are the same. Most score that is exactly one standard deviation
scores are near the mean, and the farther from above the mean corresponds to a z score of
the mean a score is, the fewer the number of ϩ1.00, and a z score of 21.00 is one standard
subjects who attained that score. For every deviation below the mean. If a set of scores
normal distribution, 34.13% of the scores fall is transformed into a set of z scores, the new
between the mean and one standard deviation distribution has a mean of 0 and a standard
above the mean, and 34.13% of the scores fall deviation of 1.
one standard deviation below the mean. More
than 99% of the scores will fall somewhere 27. A T score (also called a Z score ) is a z score
between three standard deviations above and transformed to eliminate pluses or minuses.
three standard deviations below the mean.
Measures of Relationship
21. Because research studies deal with a finite
number of subjects, and often a not very 28. Measures of relationship indicate the degree
large number, data from a sample can only to which two sets of scores are related.
approximate a normal curve. Degree of relationship is expressed as a
correlation coefficient, which is computed
Skewed Distributions from two sets of scores from a single group
of participants. If two variables are highly
22. When a distribution is not normal, it is said related, a correlation coefficient near ϩ1.00
to be skewed, and there are more extreme or 21.00 will be obtained; if two variables
scores at one end than the other. If the are not related, a coefficient near 0.00 will be
extreme scores are at the lower end of the obtained.
distribution, the distribution is said to be
negatively skewed; if the extreme scores are at 29. The Pearson r is the most appropriate
the upper, or higher, end of the distribution, measure of correlation when the sets of
the distribution is said to be positively skewed. data to be correlated are expressed as either
interval or ratio scales. The Pearson r is not
23. For a negatively skewed distribution, the valid if the relation between the variables is
mean (X) is always lower, or smaller, than not linear.
the median (md); for a positively skewed
distribution, the mean is always higher, or 30. The Spearman rho is the appropriate
greater, than the median. measure of correlation when the variables are
expressed as ranks.
338 CHAPTER 12 • DESCRIPTIVE STATISTICS
GRAPHING DATA The Mean
The formula for the mean is X 5 ⌺X.
31. It is always helpful to provide a graphic
representation of the data, and in some cases N
(e.g., a curvilinear relation among variables)
the shape of the distribution may influence The Standard Deviation
the researcher’s choice of descriptive statistics.
The formula for the standard deviation is
32. The most common method of graphing
research data is to construct a frequency SD 5 SS , where SS 5 ⌺X2 2 (⌺X)2
polygon. Data can also be displayed in bar .
graphs, scatter plots, pie charts, and the stem- )N21 N
and-leaf charts.
Standard Scores
Calculation for Interval Data
Symbols The formula for a z score is z 5 X 2 X
.
Symbols commonly used in statistical formulas are SD
as follows:
The formula for a Z score is Z ϭ 10z ϩ 50.
X ϭ any score
⌺ ϭ sum of; add them up The Pearson r
⌺X ϭ the sum of all the scores
X ϭ the mean, or arithmetic average, of the The formula for the Pearson r is
scores ⌺XY 2 (⌺X)(⌺Y)
N
N ϭ total number of subjects r5 .
n ϭ number of subjects in a particular group (⌺X)2 (⌺Y)2
) c ⌺X2 2 N d c ⌺Y2 2 N d
Go to the topic “Descriptive Statistics” in the MyEducationLab (www.myeducationlab.com) for your course,
where you can:
◆ Find learning outcomes.
◆ Complete Assignments and Activities that can help you more deeply understand the chapter content.
◆ Apply and practice your understanding of the core skills identified in the chapter with the Building
Research Skills exercises.
◆ Check your comprehension of the content covered in the chapter by going to the Study Plan. Here you
will be able to take a pretest, receive feedback on your answers, and then access Review, Practice, and
Enrichment activities to enhance your understanding. You can then complete a final posttest.
This page intentionally left blank
CHAPTER THIRTEEN
Gremlins, 1984
“. . . inferential statistics help researchers to
know whether they can generalize to a population
of individuals based on information obtained from
a limited number of research participants.” (p. 341)
Inferential
Statistics
LEARNING OUTCOMES
After reading Chapter 13, you should be able to do the following:
1. Explain the concepts underlying inferential statistics.
2. Select among tests of significance and apply them to your study.
The chapter outcomes form the basis for the following task, which will require
you to write the results section of a research report.
TASK 7
For the same quantitative study you have been developing in Tasks 2–6, write the
results section of a research report. Specifically,
1. Generate data for each participant in your study.
2. Summarize and describe data using descriptive statistics, computed either by
hand or with the use of an appropriate statistical software package.
3. Analyze data using inferential statistics by hand or with the computer.
4. Interpret the results in terms of your original research hypothesis.
5. Present the results of your data analyses in a summary table.
CONCEPTS UNDERLYING INFERENTIAL STATISTICS
Inferential statistics are data analysis techniques for determining how likely it is
that results obtained from a sample or samples are the same results that would have
been obtained from the entire population. Put another way, inferential statistics are
used to make inferences about parameters, based on the statistics from a sample.
In the simplest language, whereas descriptive statistics show how often or how
frequent an event or score occurred, inferential statistics help researchers to know
whether they can generalize to a population of individuals based on information
obtained from a limited number of research participants.
As an example, imagine that Pinecrest Elementary School implemented an ex-
perimental third-grade reading curriculum and found that the students who used
it scored significantly higher in reading comprehension than those who used the
traditional curriculum (say, X 5 35 and X 5 43). Can this difference be generalized
1 2
to the larger population or other samples within it? Would the program be equally
successful at the district or state levels? Perhaps, although it’s possible that the dif-
ference between the original two samples occurred just by chance (possibly due to
characteristics of the particular individuals or classrooms sampled). And now we get
to the heart of inferential statistics, the concept of “how likely is it?”
341
342 CHAPTER 13 • INFERENTIAL STATISTICS
Inferential statistics allow researchers to de- means won’t all be the same, but they will form a
termine the likelihood that the difference between normal distribution around the population mean—
the old mean (X ) and the new mean (X ) is a real, a few will be a lot higher, a few will be a lot lower,
and about 68% will be within one standard devia-
12 tion of the mean for the whole population. It fol-
lows then that the mean of all these sample means
significant one, rather than one attributable to sam- will yield a good estimate of the population mean.1
pling error. Note that inferential statistics use data
from samples to assess likelihood (i.e., inferential An example will help to clarify this concept.
statistics produce probability statements about the Suppose that we do not know the population mean
populations), not guarantees. The degree to which IQ for the Stanford-Binet, Form L–M. To deter-
the results of a sample can be generalized to a mine the population mean, we decide to randomly
population is always expressed in terms of prob- select 100 samples of the same size (e.g., each
abilities; analyses do not “prove” that hypotheses sample has scores from 25 participants; size of a
are true or false. sample is represented by lower case n) from the
possible Stanford-Binet scores (i.e., the population
Understanding and using inferential statistics of scores). Computing the mean for each of the
requires basic knowledge of a number of concepts samples yields the following 100 means:
that underlie the analytical techniques. These con-
cepts are discussed in the following sections. 64 82 87 94 98 100 104 108 114 121
Standard Error 67 83 88 95 98 101 104 109 115 122
Inferences about populations are based on infor- 68 83 88 96 98 101 105 109 116 123
mation from samples. The chance that any sample
is exactly identical to its population is virtually nil, 70 84 89 96 98 101 105 110 116 124
however. Even when random samples are used, we
cannot expect that the sample characteristics will 71 84 90 96 98 102 105 110 117 125
be exactly the same as those of the population. For
example, we can randomly select five students from 72 84 90 97 99 102 106 111 117 127
Mrs. Alvarez’s class at Pinecrest Elementary and
compute the mean of their fall reading scores; we 74 84 91 97 99 102 106 111 118 130
can then randomly select five more students from
the same population and compute the mean of their 75 85 92 97 99 103 107 112 119 131
fall reading scores. It’s very likely that the two sam-
ple means will be different from one another, and 75 86 93 97 100 103 107 112 119 136
it’s also likely that neither mean will be identical to
the population mean (i.e., the mean for all students 78 86 94 97 100 103 108 113 120 142
in Mrs. Alvarez’s class). This expected variation
among the means is called sampling error. Recall Computing the mean of these sample means
that sampling error is not the researcher’s fault. involves adding them and dividing by 100 (i.e.,
Sampling error just happens and is as inevitable the number of means)—10,038/100 5 100.38. This
as taxes and educational research courses! Thus, if estimate of the population mean is very good; in
two sample means differ, the important question is fact, despite what we said previously, we know
whether the difference is just the result of sampling that the population mean for this test is 100. Fur-
error or is a meaningful difference that would also thermore, 71 of the means from our 100 samples
be found in the larger population. fall between 84 and 116. The standard deviation
for the Stanford-Binet is 16, so these scores are
A useful characteristic of sampling errors 61 SD from the mean. Ninety-six of the means fall
is that they are usually normally distributed— between 69 and 132, or 62 SD from the mean. Our
sampling errors vary in size (i.e., in some compari- distribution therefore approximates a normal curve
sons, sampling error is small, whereas in others it is quite well—the percentage of cases falling within
large), and these errors tend to form a normal, bell- each successive standard deviation is very close to
shaped curve. In other words, if we randomly select the percentage depicted in Chapter 12, Figure 12.1
500 different (but same-sized) samples from a as characteristic of the normal curve (i.e., 71% as
population and compute a mean for each sample, the compared to 68%, and 96% as compared to 95%).
1 To find the mean of the sample means, simply sum the sample
means and divide by the number of means, as long as each
sample is the same size.
CHAPTER 13 • INFERENTIAL STATISTICS 343
The concept illustrated by this example is a where SD
comforting one because we are rarely able to col- SEX 5 !N 2 1
lect data from 100 samples and compute 100 means
in our research. This example tells us, in essence, SEX 5 the standard error of the mean
that most of the sample means we obtain will be SD 5 the standard deviation for a sample
close to the population mean and only a few will
be very far away. In other words, once in a while, N 5 the sample size
just by chance, we will get a sample that is quite
different from the population, but not very often. Thus, if the SD of a sample is 12 and the sample
size is 100,
As with any normal distribution, a distribution
of sample means has not only its own mean (i.e., 12 12 12
the mean of the means) but also its own standard SEX 5 !100 2 1 5 !99 5 !9.95 5 1.21
deviation (i.e., the difference between each sample
mean and the mean of the means). The standard Using this estimate of the SEX, the sample
deviation of the sample means is usually called the mean, X, and the normal curve, we can estimate
standard error of the mean (SEX). The word error probable limits within which the population mean
indicates that the various sample means making falls. These limits are referred to as confidence lim-
up the distribution contain some error in their es- its. For example, if a sample X is 80 and the SEX is
timate of the population mean. The standard error 1.00, the population mean falls between 79 and 81
of the mean reflects how far, on average, any sam- (X 6 1 SEX), approximately 68% of the time, the
ple mean would differ from the population mean. population mean falls between 78 and 82 (X 6 2
According to the normal curve percentages (see SEX) approximately 95% of the time, and the popu-
Chapter 12, Figure 12.3), we can say that approxi- lation mean falls between 77 and 83 (X 6 3 SEX)
mately 68% of the sample means will fall within approximately 99% of the time. In other words,
one standard error on either side of the mean (re- the probability that the population mean is less
member, the standard error of the mean is a stan- than 78 or greater than 82 is only 5/100, or 5% (62
dard deviation), 95% will fall between 62 standard SD), and the probability that the population mean
errors, and 991% will fall between 63 standard is less than 77 or higher than 83 is only 1/100, or
errors. In other words, if the population mean is 1% (63 SD). Note that as our degree of confidence
60, and the standard error of the mean is 10, we increases, the limits get farther apart—we are 100%
can expect 68% of the sample means (i.e., means of confident that the population mean is somewhere
the scores taken from each sample) to be between between our sample mean plus infinity and minus
50 and 70 (60 6 10), 95% of the sample means to infinity.
fall between 40 and 80 [60 6 2(10)], and 99% of
the sample means to fall between 30 and 90 [60 6 The major factor affecting our ability to esti-
3(10)]. Thus, in this example it is very likely that a mate the standard error of the mean accurately is
sample mean would be 65, but a sample mean of the size of the sample we use for the estimate. As
98 is highly unlikely because 99% of sample means sample size increases, the standard error of the
fall between 30 and 90. Given a number of large, mean decreases—a mean computed from data from
randomly selected samples, we can quite accurately the whole population would have no sampling er-
estimate population parameters (i.e., the mean and ror at all, and a large sample is more likely than a
standard deviation of the whole population) by small sample to represent a population accurately.
computing the mean and standard deviation of the This discussion reinforces the idea that samples
sample means. The smaller the standard error, the should be as large as practically possible; smaller
more accurate the sample means as estimators of samples include more error than larger samples.
the population mean.
Another factor affecting the estimate of the
It is not necessary to select a large number of standard error of the mean is the size of the popu-
samples from a population to estimate the standard lation standard deviation. If it is large, members of
error, however. The standard error of the mean can the population are very spread out on the variable
be estimated from the standard deviation of a single of interest, and sample means will also be very
sample using this formula: spread out. Although researchers have no control
over the size of the population standard deviation,
344 CHAPTER 13 • INFERENTIAL STATISTICS
they can control sample size to some extent. Thus, The concept of rejecting the null hypothesis is a
researchers should make every effort to include as complex but important one. For example, suppose
many participants as practical so that inferences our inferential statistics suggest that the scores for the
about the population of interest will be as free of two groups are different enough that the difference is
error as possible. likely not due to sampling error. Because our null hy-
pothesis was that the scores would not differ, we can
Our discussion thus far has been in reference to reject it—the scores for the group are different, so the
the standard error of a mean. However, estimates of null hypothesis (i.e., no difference) does not reflect
standard error can also be computed for other sta- the state of affairs, given this sample. Rejecting the
tistics, such as measures of variability, relation, and null hypothesis can give us reasonable confidence
relative position. An estimate of standard error can (depending on the level of statistical significance)
also be calculated for the difference between two or that the difference we have found is due to the new
more means. Many statistics discussed later in this reading method and not other factors. Note, how-
chapter are based on an estimate of standard error. ever, that although we can reject the null hypothesis,
we can’t accept the research hypothesis—we haven’t
Hypothesis Testing proven that the new method is better; rather, we’ve
found just one instance where the new method is bet-
Hypothesis testing is a process of decision making ter than the old. Even 1,000 tests of the new method
in which researchers evaluate the results of a study are not enough to prove our hypothesis. Thus, we
against their original expectations. For example, state that the research hypothesis was supported (for
suppose we decide to implement a new reading pro- this sample), not that it was proven.
gram at Pinecrest Elementary School. The research
plan for our study includes a research hypothesis, Similarly, suppose our inferential statistics sug-
predicting a difference in scores for children us- gest that the scores for the two groups are really not
ing the new program compared to those using the very different, that the apparent difference is likely
old program, and a null hypothesis, predicting that due to sampling error. We can’t conclude that the
scores for the two groups will not differ. Following two methods for teaching reading are equally use-
data collection, we compute means and standard ful (or not useful) in all cases. In other words, we
deviations for each group and find that children us- can’t accept the null hypothesis; we haven’t proven
ing the new program had somewhat higher reading it. We’ve only found one instance where the dif-
scores than children using the old program. Because ference is likely due to sampling error. Perhaps in
our findings could help the district superintendent another study, we’d find significantly different re-
decide whether to invest thousands of dollars to im- sults. In this situation, we state that we have failed
plement the new reading curriculum, we need to de- to reject the null hypothesis or that the research
termine whether we’ve identified a real difference in hypothesis was not supported.
the programs or whether our results are simply due
to sampling error. Of course, we want to be reason- Ultimately, hypothesis testing is a process of
ably certain that the difference we found between evaluating the null hypothesis, rejecting it or fail-
the old and the new programs is a true or real dif- ing to reject it. Because we can never completely
ference caused by the new reading program (i.e., the control all the factors that may be responsible for
independent variable) and did not occur by chance. the outcome or test all the possible samples, we
In other words, we want to know if our research can never prove a research hypothesis. However,
hypothesis was supported—if the groups are signifi- if we can reject the null hypothesis, we have sup-
cantly different, we can reject the null hypothesis and ported our research hypothesis, gaining confidence
conclude that the new program is more effective. In that our findings reflect the true state of affairs in
short, hypothesis testing is the process of determin- the population (e.g., that the new reading method
ing whether to reject the null hypothesis (i.e., no leads to higher scores for our students).
meaningful differences, only those due to sampling
error) in favor of the research hypothesis (i.e., the Tests of Significance
groups are meaningfully different; one treatment is
more effective than another). Inferential statistics of- In the previous example, we noted that inferential
fer us useful evidence to make that decision. statistics can suggest that any differences between
scores for comparison groups are simply due to
CHAPTER 13 • INFERENTIAL STATISTICS 345
chance or that they are likely to reflect the true 5 in 100, we cannot be confident that we’ve found
state of affairs in the larger population. This pro- a real difference. In this case, we state that we have
cess involves tests of significance. In the language not found a significant difference between the pro-
of statistics, the term significance does not mean grams, we state further that we have failed to reject
“importance.” Rather, it refers to a statistical level of the null hypothesis, and we tell the Superintendent
probability at which we can confidently reject the to keep looking for a better reading program.
null hypothesis. Remember, inferential statistics tell
us the likelihood (i.e., probability) that the results With a probability criterion of 5 times out of 100
from our sample are just due to chance. If the prob- (5/100 or .05) that these results would be obtained
ability that our results are due to chance is 50%, simply due to chance, there is a high (but not perfect)
how much confidence can we have in them? What probability that the difference between the means
if that probability is 10%? 1%? Significance refers to did not occur by chance (95/100, or 100 2 5): We are
a selected probability level that indicates how much 95% confident. Obviously, if we can say we would
risk we are willing to take if the decision we make expect such a difference by chance only 1 time in
is wrong. Researchers do not decide whether scores 100, we are even more confident in our decision (i.e.,
for two sample groups are different based only on 99% sure that we’ve found a real difference). How
their intuition or best guess. Instead, we select and confident we are depends upon the probability level
apply an appropriate test of significance. at which we perform our test of significance.
To conduct a test of significance, we deter- Levels of confidence can be illustrated on the
mine a preselected probability level, known as level normal curve, as shown in Figure 13.1. We can de-
of significance (or alpha, symbolized as a). This termine the likelihood of a difference occurring by
probability level serves as a criterion to determine chance at the.05 or .01 level from the normal curve.
whether to reject or fail to reject the null hypoth- In essence, we are saying that any differences be-
esis (remember, we never accept the null hypoth- tween 62 SD will be considered as chance differ-
esis). The standard preselected probability level ences at the .05 level, and any differences between
used by educational researchers is usually 5 out of 63 SD will be considered chance differences at the
100 chances that the observed difference occurred .01 level. Thus, real or significant differences fall
by chance (symbolized as a 5 .05). Some studies outside 62 SD (.05) or 63 SD (.01).
demanding a more stringent level of significance
set a 5 .01 (i.e., probability is 1 out of 100 that Two-Tailed and One-Tailed Tests
results are simply due to chance), whereas other
research that may be more exploratory will set a 5 How we determine our level of significance is also
.10 (i.e., probability is 10 out of 100). The smaller affected by our directional hypothesis. For exam-
the probability level, the less likely it is that this ple, when testing the effectiveness of the treatment
finding would occur by chance. program for adolescents, we predicted outcomes
would be better for the residential-treatment than
For example, at Pinecrest we found a difference for the day-treatment program, but what if out-
between the reading scores of students taught with comes actually were worse? For this reason, some-
our new reading curriculum and those taught with times we need to look in both directions for the
the old curriculum. If a difference of the size we outcomes of our tests. In the language of statistics,
found is likely to occur only five (or fewer) times out we need to conduct a two-tailed test.
of every 100 possible samples from our population,
we can reject the null hypothesis and conclude that Tests of significance can be either one-tailed or
the difference we found is (most likely) a meaning- two-tailed. When we talk about “tails,” we are refer-
ful one—students at Pinecrest do better on reading ring to the extreme ends of the bell-shaped curve
tests after participating in the new reading program. of a sampling distribution. Figure 13.2 provides an
On the other hand, if a difference of the size we illustration. In the curve on the right, only one tail is
found is likely to occur more than five times out of shaded, representing 5% of the area under the curve.
every 100 samples, simply due to chance, we can- In the curve on the left, both tails are shaded, but each
not reject the null hypothesis. Even if the scores for tail represents only 2.5% of the area under the curve.
the groups appear to be different, if the probability The entire shaded region is known as the region of
that the difference is due to chance is greater than rejection—if we find differences between groups that
are this far from the mean, we can feel confident that
346 CHAPTER 13 • INFERENTIAL STATISTICS
FIGURE 13.1 • Regions of rejection for a 5 .05 and a 5 .01 99+%
95% Region of
Chance
Region of
Chance
Region of Region of Region of Region of
Rejection Rejection Rejection Rejection
−3 SE −2 SE −1 SE X 1 SE 2 SE 3 SE −3 SE −2 SE −1 SE X 1 SE 2 SE 3 SE
Significance level, α = .05 Significance level, α = .01
FIGURE 13.2 • Significance areas for one-tailed and two-tailed tests with a 5 .05
2.5% of area Region of 2.5% of area Region of 5% of area
under the Chance under the Chance under the
curve curve curve
Region of Region of Region of
Rejection Rejection Rejection
Two-tailed test, α = .05 One-tailed test, α = .05
our results are not simply due to chance. Note that for hour. In other words, if we make a graph with “no
both bell curves, a total of 5% of the scores fall into difference” at the middle and “big differences” at the
the shaded range: Alpha is set at .05. ends, most scores will fall into the region of chance
illustrated in Figure 13.2. Sometimes, however, the
A concrete example is useful to help under- two groups will appear very different (although just
stand the graphs and the distinction between a one- by chance, if the null hypothesis is true)—in some
tailed and a two-tailed test. Consider the following cases, the group with the snack will be better be-
null hypothesis: haved (i.e., one tail on the graph), and in other cases
the group without the snack will be better behaved
There is no difference between the behavior (i.e., the other tail on the graph).
during the hour before lunch of kindergarten
students who receive a midmorning snack When conducting our study, we want to know
and that of kindergarten students who do not if we can reject the null hypothesis; we believe it’s
receive a midmorning snack. not true. Assume, then, we have a directional re-
search hypothesis:
What if the null hypothesis is true—the midmorning
snack doesn’t matter? If we take repeated samples Kindergarten children who receive a
of kindergarten children and randomly divide the midmorning snack exhibit better behavior
children in each sample into two groups, we can during the hour before lunch than
expect that, for most of our samples, the two groups kindergarten students who do not receive a
will have very similar behavior during the lunch midmorning snack.
CHAPTER 13 • INFERENTIAL STATISTICS 347
To reject the null hypothesis and claim support for split into two regions of 2.5% each to cover both
our research hypothesis, we need to find not just possible outcomes (e.g., the children with snacks
that there’s a difference between groups but also will behave better, or the children without snacks
that children who get snacks exhibit better behav- will behave better). As should be clear from the
ior than their peers who don’t get snacks, and we graphs, the values that fall into the two shaded tails
need to feel confident that our results aren’t simply of the graph on the left are more extreme than the
due to chance. We set a 5 .05; a statistically signifi- values that fall into the one shaded tail of the graph
cant difference between the groups (i.e., not likely on the right. For example, when using a two-tailed
to be due to chance) will be large enough to fall test, the two groups of kindergarteners (i.e., with or
into the region of rejection, or the shaded region without snacks) need to be quite different—more
on the right tail of the bell curve on the right of different than they need to be if using only a one-
Figure 13.2. We look at only one tail because, ac- tailed test.
cording to our hypothesis, we’re only interested in
seeing whether the group receiving snacks behaves Type I and Type II Errors
better than the group without snacks.
But what if the outcome is reversed—children Based on a test of significance, as we have dis-
who get snacks behave much worse than children cussed, the researcher will either reject or not reject
who don’t? We haven’t supported our research hy- the null hypothesis. In other words, the researcher
pothesis (in fact, we’ve found the exact opposite!), will make the decision that the difference between
and although we’ve found a big difference between the means is, or is not, likely due to chance. Be-
groups, the mean difference doesn’t fall into the cause we are dealing with probability, not certainty,
region of rejection on our one-tailed graph. We we never know for sure whether we are absolutely
can’t reject the null hypothesis, then—but the null correct. Sometimes we’ll make mistakes—we’ll de-
hypothesis clearly doesn’t reflect the true state of cide that a difference is a real difference when it’s
affairs either! It should be clear that a two-tailed really due to chance, or we’ll decide that a differ-
test of significance would help us because it allows ence is due to chance when it’s not. These mistakes
for both possibilities—that the group that received are known as Type I and Type II errors.
a snack would be better behaved, or that the group To understand these errors, reconsider our ex-
without a snack would be better behaved. ample of the two methods of reading at Pinecrest
Tests of significance are almost always two- Elementary. Our decision-making process can lead
tailed. To select a one-tailed test of significance, the to four possible outcomes (see Figure 13.3).
researcher has to be very certain that a difference 1. The null hypothesis can in reality be true for
will occur in only one direction, and this is not very the population (i.e., no difference between the
often the case. However, when appropriate, a one-
tailed test has one major
advantage: The score dif- FIGURE 13.3 • The four possible outcomes of decision making concerning
ference required for signif- rejection of the null hypothesis
icance is smaller than for
a two-tailed test. In other The true status of the null
words, it is “easier” to ob- hypothesis. It is really
tain a significant difference
when predicting change True False
in only one direction. To (should not (should
be rejected) be rejected)
understand this concept
in more detail, reconsider The researcher's True Correct Type II
Figure 13.2. Because a 5 Decision Error
.05, the region of rejection decision. The (does not
researcher reject)
represents 5% of the area concludes that False Type I Correct
under the curve. In the the null (rejects) Error Decision
graph for the two-tailed hypothesis is
test, however, that 5% is
348 CHAPTER 13 • INFERENTIAL STATISTICS
reading methods: the new method new 5 old between the means of the old reading program and
method). If we decide that any difference we the new reading program; suppose our inferential
find is just due to chance, we fail to reject the statistics show that the difference is significant at
null hypothesis, and we have made a correct our preselected level of a 5 .05, and we reject the
decision. null hypothesis of no difference. In essence, we
■ Correct: Null hypothesis is true; researcher are saying that we are confident that the difference
resulted from the independent variable (i.e., the
fails to reject it and concludes no significant new method of reading), not random error, because
difference between groups. the chances are only 5 out of 100 (.05) that a dif-
2. The null hypothesis in reality is false (i.e., ference in the mean of the reading scores as large
new method ≠ old method). If we decide that (or larger) as the one we have found would occur
we are reasonably confident that our results solely by chance. What if we are worried that we
are not simply due to chance, we reject the have too much at stake if we make the wrong deci-
null hypothesis. We have made a correct sion about our results? For example, if we do not
decision. want to risk giving the Superintendent the wrong
■ Correct: Null hypothesis is false; researcher advice, we may decide a more stringent level of
rejects it and concludes that the groups are significance (a 5 .01) is necessary. We are saying
significantly different. that a difference as large as the one we have found
3. The null hypothesis is true (i.e., new method 5 between the reading scores at Pinecrest would be
old method), but we reject it, believing that expected to occur by chance only once for every
the results are not simply due to chance and 100 samples from our population—there’s only one
that the methods are different. In this case, chance in 100 that we make a Type I error if we
we have made an incorrect decision. We have conclude that the new reading method is better.
mistakenly assumed there is a difference in the
reading programs when there is none. So why not set our probability level at a 5
■ Incorrect: Null hypothesis is true, but .000000001 and hardly ever be wrong? If you select
researcher rejects it and concludes that the a to be very, very small, you definitely decrease
groups are significantly different. your chances of committing a Type I error; you
4. The null hypothesis is false (i.e., new ≠ old), will hardly ever reject a true null hypothesis. But as
but we fail to reject it, believing that the you decrease the probability of committing a Type I
groups are really the same. We are incorrect error, you increase the probability of committing a
because we have concluded there is no Type II error—that is, of not rejecting a null
difference when indeed there is a difference. hypothesis when you should.
■ Incorrect: Null hypothesis is false, but
researcher fails to reject it and concludes Because the choice of a probability level, a, is
that the groups are not significantly made before execution of the study, researchers
different. need to consider the relative seriousness of com-
mitting a Type I versus a Type II error and select
If the researcher incorrectly rejects a null hy- a accordingly. We must compare the consequences
pothesis (i.e., possibility 3), the researcher has of making each wrong decision. For example, per-
made a Type I error. If the researcher incorrectly haps the new method of reading at Pinecrest is
fails to reject a null hypothesis (i.e., possibility 4), much more expensive to implement than the old,
the researcher has made a Type II error. traditional method of reading: If we adopt it, we
will have to spend a great deal of money on ma-
The probability level selected by the researcher terials, in-service training, and new testing. Given
determines the probability of committing a Type I the expense of the new program, we can set our
error—that is, of rejecting a null hypothesis that is level of significance (a) to .01; we want to reduce
really true (i.e., thinking you’ve found an effect, the likelihood of Type I error. In other words, we
when you haven’t). Thus, if you select a 5 .05, you want to be confident (at a 5 .01) that, if we recom-
have a 5% probability of making a Type I error, mend the new program, it works better than the old
whereas if you select a 5 .01, you have only a 1% one—that any difference we may find is not simply
probability of committing a Type I error. For exam- random error. We may be more willing to risk a
ple, at Pinecrest Elementary we found a difference Type II error (i.e., concluding the new method isn’t
CHAPTER 13 • INFERENTIAL STATISTICS 349
better, although it really is) because it is such an abusers. A study of the two programs found the
expensive program to implement and we suspect residential program was significantly better at the
we may find a better but cheaper program. predetermined alpha of .10. Because the risk of
committing a Type I error (i.e., claiming the resi-
We’re willing to take that risk because, in this dential program was better when it wasn’t) was not
example, a Type I error is the more serious error high, a 5 .10 was an acceptable level of risk. Unfor-
for Pinecrest. If we conclude that the new program tunately, the residential treatment program, as you
really works, but it’s not really any better than the would imagine, was quite expensive. Even though
old program (i.e., a Type I error), the Superinten- the residential program was significantly different
dent is going to be very upset if a big investment than the day-treatment program, the researchers
is made based on our decision and the students recommended using the day-treatment program.
show no difference in achievement at the end of This example clearly shows the difference between
the year. On the other hand, if we conclude that statistical and practical significance. The higher cost
the new program does not really make a differ- of the residential program did not justify the sta-
ence, but it’s really better (i.e., a Type II error), tistical advantage over the day-treatment program.
it’s likely that nothing adverse will happen. In Furthermore, the researchers subsequently found
fact, the superintendent would not know that the that, had they set the level of significance at a higher
new method is better because, of course, we never level (a 5 .01), the difference in programs would
implemented it. We just hope the superintendent not have been statistically significant. The research-
does not find any research suggesting that the new ers concluded, as did the program administrators,
program was effective elsewhere. Given the choice, that the cheaper program was the better choice.
then, we would rather commit a Type II error
(i.e., rejecting a successful program) rather than a Both as researchers and as consumers, we make
Type I error (i.e., going to the expense of imple- choices every day based on acceptable levels of
menting an unsuccessful program). risk. For example, we may choose to take vitamins
each morning based on studies of their effective-
The choice of which type of error is worth ness that show only marginally significant results.
risking may not always be so clear, however. For But, because the risk of being wrong is not severe
example, if Pinecrest Elementary is not meeting its (i.e., Type I error—so what if they might not really
AYP (Adequate Yearly Progress) goals for reading work; as long as they don’t hurt, it’s worth a try),
under No Child Left Behind, the stakes are high. We we go ahead and take the vitamins. On the other
need to find a reading program better than the old hand, if we decided to jump out of an airplane, we
method, or we risk losing funding and potentially would want to use a parachute that has a very high
closing the school. Under these circumstances, we probability of working correctly and would want
may be more likely to risk committing a Type I to know how this type of parachute performed
error—we have little to lose if we select a program in repeated trials. And, we would want a highly
that is no better than the program we are now us- stringent probability level, such as a 5 .000001 or
ing, and a lot to gain if the program is in fact better. beyond. The risk of being wrong is fatal. When you
Therefore, we can use a level of significance that is are unsure what level of risk is acceptable, selecting
less stringent; we can accept the greater risk that a a 5 .05 is a standard practice that provides an ac-
difference of the size we find could occur by chance ceptable balance between Type I or Type II error.
in 5 out of 100 studies (a 5 .05) or 10 out of 100 Otherwise, consider the risk: Are you jumping out
(a 5 .10). of an airplane or are you trying to decide if you
should take Vitamin C? Fortunately, most choices
The decision about level of significance for a in the fields of education and human services are
particular study is based on both risk and practi- not life or death.
cal significance. If the consequences of committing
a Type I error are not severe or life threatening, Degrees of Freedom
we usually accept a lower level of significance
(e.g., a 5 .05 rather than a 5 .01). A study con- After determining whether the significance test will
ducted by one of the authors can help explain these be two-tailed or one-tailed and selecting a probabil-
choices further. A social service agency needed to ity cutoff (i.e., alpha), we then select an appropriate
make a choice whether to use a day-treatment or
residential-treatment program for adolescent drug
350 CHAPTER 13 • INFERENTIAL STATISTICS
statistical test and conduct the analysis. When com- test of significance should be used in a given study.
puting the statistics by hand, we check to see if It is important that the researcher select the ap-
we have significant results by consulting the ap- propriate test because an incorrect test can lead to
propriate table at the intersection of the probability incorrect conclusions. The first decision in select-
level and degrees of freedom (df) used to evaluate ing an appropriate test of significance is whether
significance. When the analysis is conducted on the a parametric test or a nonparametric test must be
computer, the output contains the exact level of sig- selected. Parametric tests are usually more power-
nificance (i.e., the exact probability that the results ful and generally preferable when practical. “More
are due to chance), and the degrees of freedom. powerful” in this case means that, based on the
results, the researcher is more likely to reject a null
An example may help illustrate the concept hypothesis that is false; in other words, use of a
of degrees of freedom, defined as the number powerful test makes the researcher more likely to
of observations free to vary around a parameter. identify a true effect and thus less likely to commit
Suppose we ask you to name any five numbers. a Type II error.
You agree and say “1, 2, 3, 4, 5.” In this case, N is
equal to 5; you had 5 choices and you could select A parametric test, however, requires certain as-
any number for each choice. In other words, each sumptions to be met for it to be valid. For example,
number was “free to vary”; it could have been any the measured variable must be normally distributed
number you wanted. Thus, you had 5 degrees of in the population (or at least that the form of the
freedom for your selections (df 5 N). Now suppose distribution is known). Many variables studied in
we tell you to name five numbers, you begin with education are normally distributed, so this assump-
“1, 2, 3, 4 . . .,” and we say: “Wait! The mean of the tion is often met. A second major assumption is
five numbers you choose must be 4.” Now you have that the data represent an interval or ratio scale of
no choice for the final number—it must be 10 to measurement, although in some cases ordinal data,
achieve the required mean of 4 (i.e., 1 1 2 1 3 1 such as from a Likert-type scale, may be included.
4 1 10 5 20, and 20/5 5 4). That final number is Because many measures used in education repre-
not free to vary; in the language of statistics, you sent or are assumed to represent interval data, this
lost one degree of freedom because of the restric- assumption is usually met. In fact, this is one major
tion that the mean must be 4. In this situation, you advantage of using an interval scale—it permits
only had 4 degrees of freedom (df 5 N 2 1). the use of a parametric test. A third assumption is
that the selection of participants is independent.
Each test of significance has its own formula for In other words, the selection of one subject in no
determining degrees of freedom. For example, for way affects selection of any other subject. Recall
the product moment correlation coefficient, Pear- from Chapter 5 that with random sampling, every
son r, the formula is N 2 2. The number 2 is a member of the population has an equal and inde-
constant, requiring that degrees of freedom for r pendent chance to be selected for the sample. Thus,
are always determined by subtracting 2 from N, if randomization is used in participant selection,
the number of participants. Each of the inferential the assumption of independence is met. Another
statistics discussed in the next section also has its assumption is that the variances of the comparison
own formula for degrees of freedom, but in every groups are equal (or at least that the ratio of the
case, the value for df is important in determining variances is known). Remember, the variance of a
whether the results are statistically significant. group of scores is the square of the standard devia-
tion (see Chapter 12 for discussion of variance and
SELECTING AMONG TESTS standard deviation).
OF SIGNIFICANCE
With the exception of independence, a small
Many different statistical tests of significance can violation of one or more of these assumptions
be applied in research studies. Factors such as usually does not greatly affect the results of tests
the scale of measurement represented by the data for significance. Because parametric statistics seem
(e.g., nominal, ordinal, etc.), method of participant to be relatively hardy, doing their job even with
selection, number of groups being compared, and moderate assumption violation, they are usually
number of independent variables determine which selected for analysis of research data. However, if
one or more assumptions are violated to a large
CHAPTER 13 • INFERENTIAL STATISTICS 351
degree—for example, if the distribution is ex- expected if the null hypothesis were true (i.e., no
tremely skewed—parametric statistics should not difference between the boys’ and the girls’ scores).
be used. In such cases, a nonparametric test, which In other words, the denominator is the standard er-
makes no assumptions about the shape of the dis- ror of the difference between the means—a function
tribution, should be used. Nonparametric tests are of both sample size and group variance. Smaller
appropriate when the data represent an ordinal or sample sizes and greater variation within groups are
nominal scale, when a parametric assumption has associated with greater random differences between
been greatly violated, or when the nature of the groups. Even if the null hypothesis is true, we do
distribution is not known. not expect two sample means to be identical; there
is always going to be some chance variation. The t
Nonparametric tests are not as powerful as para- test determines the likelihood that a difference of
metric tests. In other words, it is more difficult with this size would be expected solely by chance.
a nonparametric test to reject a null hypothesis at a
given level of significance; usually, a larger sample If we were making the t test calculation by hand,
size is needed to reach the same level of significance we would divide the numerator by the denominator
as in a parametric test. Additionally, many hypotheses and then determine whether the resulting t value
cannot be tested with nonparametric tests. Neverthe- reflects a statistically significant difference between
less, often we have no choice but to use nonpara- the groups by comparing the t we computed to a
metric statistics when we are dealing with societal table of t values, such as that shown in Table A.4 in
variables that are not conveniently measured on an Appendix A. If the t value is equal to or greater than
interval scale, such as religion, race, or ethnicity. the table value for the required df (i.e., reflecting
sample size) and alpha (i.e., reflecting significance
In the following sections, we examine both level), then we can reject the null hypothesis: Our
parametric and nonparametric statistics. Although results suggest a significant difference between the
we cannot discuss every statistical test available to groups. If the t value we compute is less than the
the researcher, we describe several statistics com- table value, we fail to reject the null hypothesis:
monly used in educational research. Any difference we have found is likely due to sam-
pling error (i.e., chance). Of course, typically, we
The t Test would conduct the t test with the computer, which
produces output showing the t value, its level of
The t test is used to determine whether two groups significance, and the degrees of freedom.
of scores are significantly different at a selected
probability level. For example, a t test can be used In determining significance, the t test is adjusted
to compare the reading scores for males and fe- for the fact that the distribution of scores for small
males at Pinecrest Elementary. samples becomes increasingly different from the nor-
mal distribution as sample sizes become increas-
The basic strategy of the t test is to compare the ingly smaller. For example, distributions for smaller
actual difference between the means of the groups samples tend to be higher at the mean and at the two
(X – X ) with the difference expected by chance. ends of the distribution. As a result, the t values re-
quired to reject a null hypothesis are higher for small
12 samples. As the size of the samples becomes larger,
the score distribution approaches normality. Note in
For our data from Pinecrest Elementary School, Table A.4 that as the number of participants increases
we can use a t test to determine if the difference (df), the t value (or test statistic) needed to reject
between the reading scores of the boys and girls is the null hypothesis becomes smaller. Table A.4 also
statistically significant; that is, the likelihood that shows that as the probability, or significance, level
any difference we find occurred by chance. It in- becomes smaller (i.e., .10, .05, .01, .001), a larger t
volves forming the ratio of the scores for the boys value is required to reject the null hypothesis.
and the girls, as shown in the formula below.
The t Test for Independent Samples
t 5 X1 2 X2
SS1 1 SS2 1 1 The t test for independent samples is a parametric
) a n1 1 n2 2 2 b a n1 1 n2 b test of significance used to determine whether, at
a selected probability level, a significant difference
In the formula, the numerator is the difference
between the sample means X1 and X2 and the de-
nominator is the chance difference that would be
352 CHAPTER 13 • INFERENTIAL STATISTICS
exists between the means of two independent the variable selection process required in statistical
samples. Independent samples are randomly tests we present a step-by-step example of the t
formed without any type of matching—the mem- test procedure using SPSS. Explanations for other
bers of one sample are not related to members of procedures we use in this chapter are available in
the other sample in any systematic way other than Appendix B.
that they are selected from the same population. If
two groups are randomly formed, the expectation To perform the t test in SPSS, first click on
is that at the beginning of a study they are essen- Analyze and choose Compare Means from the
tially the same with respect to performance on the pull-down menu. A submenu appears, as shown in
dependent variable. Therefore, if they are also es- Figure 13.4. From this submenu choose Independent-
sentially the same at the end of the study (i.e., their Samples T Test. . . . In summary, the options are as
means are close), the null hypothesis is probably follows:
true. If, on the other hand, their means are not close
at the end of the study, the null hypothesis is prob- Analyze
ably false and should be rejected. The key word Compare Means
is essentially. We do not expect the means to be Independent Samples T Test
identical at the end of the study—they are bound to
be somewhat different. The question of interest, of Selecting these options produces Independent-
course, is whether they are significantly different. Samples T Test window, shown in Figure 13.5.
Calculating the t Test for Independent In our example, we are comparing the fall read-
Samples Using SPSS 18 ing test scores (ReadF) of the boys in Mrs. Alvarez’s
third-grade class to the girls’ scores. We thus need
As we have discussed, the t test for independent to move the dependent (i.e., outcome) variable,
samples is used when we want to compare the fall reading score (ReadF), into the Test Variable(s)
scores for two groups. For example, at Pinecrest section. Next, we need to specify that we would
Elementary School we would want to know if the like to compare the group of boys and girls; we
boys’ reading scores are statistically different from need to select the Grouping Variable, which is gen-
the girls’ scores. Even though we know from our der. We define the groups by selecting the Define
previous example that the mean for the girls on Groups button, just underneath Grouping Variable,
the fall reading score (X 5 52.533) was higher as shown in Figure 13.6. Because the two groups
than that for the boys (X 5 41.054), we do not in our data set are specified as Group 1 for boys
know how likely it is that this difference would and Group 2 for girls we simply type the number
occur by chance or if it is a meaningful difference, 1 (for Group 1) and the number 2 (for Group 2).
statistically. The t test helps us decide whether the Remember, you need to specify the codes for the
difference between the boys’ and the girls’ scores groups to be tested, which may not always be 1 or
is statistically significant; that is, not likely to have 2 as in this example.
occurred by chance.
To run the analysis, click Continue to return to
We can use Excel, SPSS, or a variety of other the Independent-Samples T Test window. Click on
software applications to conduct a t test. Although the OK button. The analysis runs, and an output
each program has advantages and disadvantages, window opens, showing two tables. The first table
dedicated statistical packages, such as SPSS, are set is the Group Statistics table, shown in Table 13.1.
up in terms of dependent and independent vari- This table shows you the sample size for each
ables, and as our analyses become more complex group (represented in the “N” column) as well as
and use larger numbers of variables and cases, the mean test score for each group, the standard
statistical packages offer more advantages. Further- deviation, and the standard error of the mean.
more, although the set-up procedures for analy- Table 13.2 shows the results of the independent-
ses on different statistical software packages are samples t test, including additional statistics to as-
slightly different, they are all somewhat similar; sist with the interpretation. The first set of statistics
that is, we select the variables to be compared and comes under the heading “Levene’s Test for Equal-
the statistical test to run. To help you understand ity of Variances.” This test determines if the vari-
ances of the two groups in the analysis are equal.
If they are not, then SPSS makes an adjustment to
the remainder of the statistics to account for this
CHAPTER 13 • INFERENTIAL STATISTICS 353
FIGURE 13.4 • SPSS menu options for independent-samples t test
FIGURE 13.5 • Independent-samples t test window
354 CHAPTER 13 • INFERENTIAL STATISTICS
FIGURE 13.6 • Independent-samples t test window with Define Groups button
and Define Groups window
TABLE 13.1 • Independent-samples output Having selected the appropriate row to read,
we can find the observed t statistic value and its
Group Statistics corresponding probability value. In Table 13.2 the
observed t statistic we use for equal variances
Gender N Mean Std. Std. is 22.097 with its observed level of significance
ReadF 1 13 41.054 Deviation Error Mean (“Sig.”) 5 .047. The value for t is negative because
12 52.533 SPSS subtracts the second number from the first,
2 14.0581 3.8990 but the sign (i.e., negative or positive) has no effect
on how we interpret the level of significance. The
13.2480 3.8244 significance level of this t test (p 5 .047) indicates
that a difference between the means this large (i.e.,
difference. When the observed probability value of 211.4795) would happen by chance only 4.7 times
the Levene’s test (shown in “Sig.” column) is greater out of 100 in repeated studies. Because .047 is
than .05, you should read the results on the top row smaller than the standard alpha level (p 5 .05) we
of t test statistic, “equal variance assumed,” because had preselected for our level of significance, this
we have found no significant difference in the vari- example shows a statistically significant difference
ances. When the observed probability value for between the fall reading scores of the boys and
the Levene’s test is less than .05, you should read those of the girls. We can thus have confidence as
the results from the bottom row of t test statistics, we inform our colleagues at Pinecrest Elementary
“equal variance not assumed,” because the differ- that the boys are entering third grade with lower
ence between the group variances is significant. In reading achievement than the girls; we may suggest
Table 13.2, the observed probability value for the that the school staff accommodate for this differ-
Levene’s test is greater than .05 (i.e., Sig. 5 .560, ence using the current reading curriculum as the
no significant difference found) so you should read school year proceeds, or we may recommend new
from the top row of the t test statistic, equal vari- programs to benefit the boys in younger grades.
ances assumed.
CHAPTER 13 • INFERENTIAL STATISTICS 355
TABLE 13.2 • Independent-samples t test statistics
Independent Samples Test
Levene’s Test for t-test for Equality of Means
Equality of Variances
95% Confidence
Interval of the
ReadF Equal variances Sig. Difference
(2-tailed)
assumed F Sig. t df Mean Std. Error Lower Upper
Equal variances .047 Difference Difference
not assumed .350 .560 22.097 23 .047 211.4795 5.4750 222.8055 2.1535
22.102 22.987 211.4795 5.4615 222.7778 2.1811
Note that although we’ve used SPSS to present Calculating the t Test for
our example, we could use other programs and Nonindependent Samples
achieve the same result. For example, in Excel, Using SPSS 18
we would select the appropriate t test in the Data
Analysis menu and then define our variable range Even though we are using the same student data for
(e.g., boys’ fall reading scores compared to the Pinecrest Elementary School in our examples, the
girls’ scores) to conduct the test. The value for t and different questions we ask allow equally different
the probability that the results are due to chance analyses. To answer our question about whether
will be the same as those generated by SPSS. the students’ readings scores improved over the
school year, we conduct a t test of nonindependent
The t Test for Nonindependent Samples samples, comparing fall reading scores to spring
reading scores. Furthermore, because we are inter-
The t test for nonindependent samples is used to ested additionally if the students’ math scores also
compare groups that are formed by some type of improved, we conduct a second comparison includ-
matching. The nonindependent procedure is also ing MathF (i.e., fall scores) and MathS (i.e., spring
used to compare a single group’s performance on a scores). It is easy to conduct several analyses of
pretest and posttest or on two different treatments. nonindependent samples with SPSS—we just desig-
For example, at Pinecrest Elementary we want to nate the additional variables in the t test.
know if students’ reading scores improved from the
beginning of the year to the end of the year. Because To conduct these analyses, select Analyze in
we have fall and spring test scores for each student, the SPSS Data Editor window. To find the non-
this sample has nonindependent scores. When scores independent t test, scroll down the Analyze menu
are nonindependent, they are systematically related: and select the Compare Means option. From the
Because at Pinecrest the reading scores are from submenu, choose the option called Paired Samples
the same students at two different times, they are T Test. This difference in designation may seem a
expected to correlate positively with each other— bit confusing. One way to think of it is that we are
students with high scores in the fall will likely have comparing two sets of scores (i.e., fall and spring)
high scores in the spring, and students with low for the same group of students. Therefore, the rela-
scores in the fall will likely have low scores in the tion between the sets of scores is dependent on the
spring. When the scores are nonindependent, a spe- group of people, our Pinecrest students. We are
cial t test for nonindependent samples is needed. The not, however, matching each student to the scores;
error term of the t test tends to be smaller than for rather, we are comparing the group means in the
independent samples, and the probability that the fall to the group means in the spring (i.e., a pair of
null hypothesis will be rejected is higher. data points for each participant).
356 CHAPTER 13 • INFERENTIAL STATISTICS
In summary, the procedures for the SPSS analy- number of cases, or the sample size (N). The output
sis are as follows: shows that 25 people took each of the tests—the
number of students in Mrs. Alvarez’s class. The third
Analyze statistic shown is the standard deviation for each
Compare Means set of test scores. SD is used to compute the final
Paired-Samples T Test . . . statistic shown in the table, the standard error of
the mean scores. These values are used to compute
Because this analysis is somewhat similar to the t, shown in the remainder of the output (Table 13.4).
previous t test example, you may refer to the test
menu options in Appendix B, Figures B13.1 and SPSS generates and displays the t value, de-
B13.2. grees of freedom, and the significance value (i.e.,
Sig. in the output, but also referred to as the p
The paired samples t test requires that you value) showing the probability that these results
choose the variables to include in the analysis and, could be obtained simply by chance. The first box
as with the test for independent samples, move in the table shows the variables that are being com-
them to the right section of the window, labeled pared. The next four boxes show the difference be-
Paired Variables. As you select the variables, the tween the mean scores, the standard deviation, the
Current Selections section at the bottom of the standard error of the difference between the mean
window (Figure B13.2) shows them. Click the ar- scores, and the 95% confidence interval (i.e., the
row button to move them into the Paired Variables range of values in which you can be 95% confident
section on the right of the screen, and click the OK that the real difference between mean scores falls).
button to conduct the analysis. The last three boxes show the t value, the degrees
of freedom, and the significance value. If the value
The first section of the output, showing de- in the box labeled Sig. (2-tailed) is less than or
scriptive statistics for each variable, is displayed equal to a 5 .05, then the students’ fall reading and
in Table 13.3. First, the mean for each variable is math tests (i.e., ReadF and MathF) are significantly
shown. For ReadF, X 5 46.564; for MathF, X 5 different than their spring reading and math tests
45.408. Similarly, for ReadS, X 5 53.964 and for (i.e., ReadS and MathS).
MathS, X 5 45.560. The next number shown is the
The finding for reading is good news for the
TABLE 13.3 • Dependent samples output students and Pinecrest teachers: Students’ reading
scores in the spring are better than those from the fall
Paired Samples Statistics (t 5 26.722, p 5 .000; note that probability isn’t equal
to zero, but rather it’s so small that it can’t be shown
Mean Std. Std. in only three decimal places). We are able to reject
N Deviation Error Mean the null hypothesis of no difference in the scores.
We don’t, however, know why the scores are differ-
Pair 1 ReadF 46.564 25 14.6123 2.9225 ent. If, for example, we were testing a new reading
program at Pinecrest our findings would provide
ReadS 53.964 25 17.1274 3.4255 support for the new program. Or, the improvement
Pair 2 MathF 45.408 25 15.3108 3.0622
MathS 45.560 25 15.3964 3.0793
TABLE 13.4 • Dependent samples t test output
Paired Samples Test
Paired Differences
95% Confidence Interval
of the Difference
Pair 1 ReadF-ReadS Mean Std. Std. Lower Upper t Sig.
Pair 2 MathF-MathS Deviation Error Mean df (2-tailed)
27.4000 5.5039 1.1008 29.6719 25.1281 26.722 24 .000
2.1520 1.5286 .3057 2.7830 .4790 2.497 24 .624
CHAPTER 13 • INFERENTIAL STATISTICS 357
in reading scores could be due to Mrs. Alvarez’s skill scores very low on a pretest has a large opportunity
as a teacher. The improvement could also be due to to gain, but a participant who scores very high has
parental involvement or other variables we may have only a small opportunity to improve (the latter situ-
not controlled or considered. The t test can only tell ation, where participants score at or near the high
us if the difference between the means is likely due end of the possible range, is referred to as the ceil-
to chance, not why the difference occurs. ing effect). Who has improved, or gained, more—a
participant who goes from 20 to 70 (a gain of 50)
What happened to the math scores? Table 13.4 or a participant who goes from 85 to 100 (a gain of
shows that the fall math scores (X 5 45.408) and the only 15 but perhaps a perfect score)? Second, gain
spring math scores (X 5 45.560) were not signifi- or difference scores are less reliable than analysis of
cantly different. Without even conducting the t test, posttest scores alone.
we can expect intuitively that this difference is not
significant because the scores increased only .1520 The appropriate analysis for data from pretest–
between the fall and the spring—not much hap- posttest designs depends on the performance of
pened. The test confirms our intuition: t 5 2.497, the two groups on the pretest. For example, if both
Sig. 5 .624. In other words, we can expect this groups are essentially the same on the pretest and
finding by chance in more than 62% of other similar neither group has been previously exposed to the
studies. The students did not appear to learn very treatment planned for it, then posttest scores are
much, and we want to find out why, so we have best compared using a t test. If, on the other hand,
more work to do. The t test cannot tell us why the there is a difference between the groups on the
math scores have not improved, but it has notified pretest, the preferred approach is the analysis of
us that we have a problem. covariance. As discussed in Chapter 9, analysis of
covariance adjusts posttest scores for initial differ-
One final note on the t test: In this example, ences on some variable (in this case the pretest) re-
the t value is negative because SPSS automatically lated to performance on the dependent variable. To
subtracts the second mean score in the list from determine whether analysis of covariance is neces-
the first. If the students had done better on the fall sary, calculate a t test using the two pretest means.
reading test than the spring reading test (i.e., if the If the two pretest means are significantly different,
mean scores had been reversed), then the differ- use the analysis of covariance. If not, a simple t test
ence in the means and the resulting t test statistic can be computed on the posttest means.
would have been positive. We would hope not to
have a positive t value for Pinecrest because a posi- Analysis of Variance
tive value would suggest that students did worse in
the spring than the fall. Simple Analysis of Variance
Analysis of Gain or Difference Scores Simple, or one-way, analysis of variance (ANOVA)
is a parametric test of significance used to deter-
As we noted previously, when comparing the read- mine whether scores from two or more groups are
ing and math scores for students between fall and significantly different at a selected probability level.
spring, we did not match each student’s score. Whereas the t test is an appropriate test of the dif-
Instead we compared the mean of all students in ference between the means of two groups at a time
the fall to that of all students in the spring. Some (e.g., boys and girls), ANOVA is the test for multiple
researchers think, however, that a viable way to group comparisons. For our example, we introduce a
analyze data from two groups who are pretested, new data set composed of freshman college students
treated, and then tested again is: (1) to subtract at Pacific Crest College. Table B13.1 in Appendix B
each participant’s pretest score from his or her post- displays the Pacific Crest College dataset, which in-
test score (resulting in a gain, or difference, score), cludes data from 125 students. Although this dataset
(2) to compute the mean gain or difference for each for Pacific Crest College is still not large, it is sufficient
group, and (3) to calculate a t value for the differ- to allow us to accomplish basic ANOVA and multiple
ence between the two average mean differences. regression analyses. We use ANOVA to compare the
This approach has two main problems. First, every Pacific Crest College students’ college grade point
participant, or student in our example, does not average (CollGPA) based on their economic level
have the same potential to gain. A participant who
358 CHAPTER 13 • INFERENTIAL STATISTICS
(ECON), which is organized into three groups: low, likely to reflect significant differences among
groups. However, a significant finding tells the
medium, and high. In other words, economic level is researcher only that the groups are not all the
same. To identify how the groups differ (i.e., which
the grouping variable, with three levels, and college means are different from one another), additional
statistics, described next, are needed.
GPA is the dependent variable.
Multiple Comparisons
Three (or more) means are very unlikely to be
In essence, multiple comparisons involve calculation
identical; the key question is whether the differ- of a special form of the t test. This special t adjusts for
the fact that many tests are being executed. Each time
ences among the means represent true, significant we conduct a significance test, we accept a particular
probability level, a. For example, we agree that if the
differences or chance differences due to sampling results we find would occur by chance only 5 times in
every 100 samples, we will conclude that our results
error. To answer this question, ANOVA is used: An are meaningful and not simply due to chance. How-
ever, if we conduct two tests of significance on the
F ratio is computed. Although it is possible to com- same dataset, the chance of finding a significant differ-
ence increases (i.e., we now have two tests that could
pute a series of t tests, one for each pair of means, show a significant difference), but the chance of com-
mitting a Type I error increases as well (i.e., we now
to do so raises some statistical problems concern- have two chances to commit this error, one for each
test). When multiple comparisons are involved, then,
ing distortion of the probability of a Type I error, special statistics are needed to keep the error low.
and it is certainly more convenient to perform one In general, when conducting a study that in-
volves more than two groups, researchers plan a set
ANOVA than several t tests. For example, to analyze of comparisons between specific groups before col-
lecting the data, based on the research hypotheses.
four means, six separate t tests would be required Such comparisons are called a priori (i.e., “before
the fact”) or planned comparisons. For example, in
( X 1 2 X , X 1 2 X , X 1 2 X , X 2 2 X , X 2 2 X , X 3 2 X ) our study of Pacific Crest College, we may predict
that the GPAs of high-income students will differ
2 3 4 3 4 4 from those of low-income students and plan to
conduct that comparison. Often, however, it is not
ANOVA is much more efficient and keeps the error possible to state a priori predictions. In these cases,
we can use a posteriori, or post hoc (i.e., “after
rate under control. the fact”), comparisons. In either case, multiple
comparisons should not be a fishing expedition in
The concept underlying ANOVA is that the total which researchers look for any possible difference;
they should be motivated by hypotheses.
variation, or variance, of scores can be divided into
Calculating ANOVA with Post Hoc
two sources—variance between groups and variance Multiple Comparison Tests Using SPSS 18
within groups. Between-group variance considers, In this example, we use SPSS to run an ANOVA to
determine whether and how the college GPAs differ
overall, how the individuals in a particular group for students in the high, middle, and low economic
groups. We selected the Scheffé test as the multiple-
differ from individuals in the other groups. In our ex- comparison procedure because it is somewhat con-
servative in its analysis, requiring a large difference
ample of the Pacific Crest College students, between- between means to show significance. A number
group variance refers to the ways in which students
from different economic backgrounds will differ
from one another. Ultimately, between-group dif-
ferences are what researchers are usually interested
in. The within-group variance considers how stu-
dents vary from others in the same group. Not every
student in the high-economics group has the same
GPA, for example. These differences are known as
the within-group variance, or error variance.
To ensure that apparent group differences
aren’t just due to these differences among people
in general (i.e., just error), ANOVA involves a ratio,
known as F, with group differences as the nu-
merator (i.e., variance between groups) and error
as the denominator (i.e., variance within groups).
If the variance between groups is much greater than
the variance within groups, greater than would
be expected by chance, the ratio will be large,
and a significant effect will be apparent. If, on the
other hand, the variance between groups and the
variance within groups do not differ by more than
would be expected by chance, the resulting F ratio
is small; the groups are not significantly different.
To summarize, the greater the difference in
variance, the larger the F ratio; larger Fs are more
CHAPTER 13 • INFERENTIAL STATISTICS 359
of other multiple comparison techniques are also SPSS produces a series of tables as output. The
available to the researcher and can be selected in first table, shown in Table 13.5, shows the ratio of
the SPSS analysis; the discussion of each is beyond between-group variance to within-group variance,
the scope of this chapter. F 5 37.060, and the associated probability value,
Because ANOVA is an analytical method for p 5 .000. From this ANOVA, we can conclude that
comparing means, we begin the SPSS procedure as college GPA (CollGPA) differs for students at differ-
we have previously by selecting: ent economic levels (Econ). In the language of statis-
Analyze tics, we can reject the null hypothesis of no difference
Compare Means between the students’ GPAs; that is, students’ GPAs
One-Way ANOVA appear to be dependent on their economic level. No-
tice we say “appear”; we never have definitive proof
For your reference, the menu options for ANOVA because inferential statistics simply provide an evalu-
are shown in Appendix B, Figure B13.3. ation of probability. We can be quite confident of our
The second step is to select the Post Hoc . . . conclusion, however, because we have a relatively
button in the One-Way ANOVA window (shown high F statistic (37.060) which would occur by chance
in Figure B13.4). For this analysis, use the Scheffé fewer than once in 1,000 samples (i.e., p 5 .000;
multiple comparison technique, by checking the remember, SPSS shows only three decimal places).
appropriate box (displayed in Figure B13.5). Click The Scheffé test for our comparisons is dis-
the Continue button to conduct the analysis. played in Table 13.6. This table shows the mean test
score of each group compared with
TABLE 13.5 • Overall ANOVA solution that for each other group. For exam-
ANOVA ple, the first row shows a comparison
between the low and the middle eco-
CollGPA nomic groups, and the second row
Sum of Squares df shows a comparison between the low
Between Groups 2 Mean Square F Sig. and the high economic groups. The
Within Groups 12.445 122 difference between the mean scores
20.484 6.223 37.060 .000 is shown, along with the standard er-
.168
Total 32.929 124 ror of the difference and a probability
value for the test.
TABLE 13.6 • SPSS summary table for Scheffé multiple comparison test
Multiple Comparisons
Dependent Variable:
CollGPA Scheffe
95% Confidence Interval
(I) Econ (J) Econ Mean Std. Error Sig. Lower Bound Upper Bound
12 Difference (I–J) .08732 .988 2.2302 .2025
2.9258
3 2.01385 .08995 .000 2.2025 2.4800
21 .08732 .988 .2302
2.70290(*) .09414 .000
.01385 .08995 .000
3 2.68906(*) 2.9223 2.4558
.4800 .9258
31 .70290(*)
2 .68906(*) .09414 .000 .4558 .9223
*The mean difference is significant at the .05 level.
360 CHAPTER 13 • INFERENTIAL STATISTICS
From this test, we can see that the GPAs of the Although a factorial ANOVA is a more complex
students in the low-economic group do not differ procedure to conduct and interpret than a one-
from those of students in the middle-economic group way ANOVA, the basic process is similar. SPSS or
(i.e., Row 1 in the table, Sig. 5 .988). However, the other statistical packages provide the appropriate
GPAs of the students in the low-economic group statistical tests; we simply specify the independent
are significantly different from those of students in variables and dependent variable for the analysis.
the high-economic group (i.e., Row 2 in the table;
Sig. 5 .000). The positive and negative signs for Analysis of Covariance
the mean differences allow us to determine further
that the students in the highest economic level had Analysis of covariance (ANCOVA) is a form of
the highest mean on the GPA and students at the ANOVA that accounts for the different ways in which
lowest economic level had the lowest mean. the independent variables are measured, taking into
account the design of the study. When a study has
In this example, the multiple-comparison pro- two or more dependent variables, multivariate analy-
cedure allows us to identify that the overall differ- sis of covariance (MANCOVA) is an appropriate test.
ence shown by the ANOVA is due to the students ANCOVA is used in two major ways, as a technique
in the higher economic level having higher GPAs for controlling extraneous variables and as a means
than students at the middle and lower economic of increasing the power of a statistical test.
levels. Our findings match previously published
research indicating that more economically advan- For controlling variables, use of ANCOVA is
taged students, as a group, are likely to have higher basically equivalent to matching groups on the vari-
GPAs in college than students with fewer economic able or variables to be controlled. ANCOVA adjusts
resources. posttest scores for initial differences on a variable;
in other words, groups are equalized with respect to
Multifactor Analysis of Variance the control variable and then compared. ANCOVA
is thus similar to handicapping in bowling or golf.
If a research study uses a factorial design to inves- In an attempt to equalize teams, high scorers are
tigate two or more independent variables and the given little or no handicap, and low scorers are
interactions between them, the appropriate statisti- given big handicaps. Any variable that is correlated
cal analysis is a factorial, or multifactor, analysis of with the dependent variable can be controlled for
variance. This analysis yields a separate F ratio for using covariance. Examples of variables commonly
each independent variable and one for each inter- controlled using ANCOVA are pretest performance,
action. When two independent variables are ana- IQ, readiness, and aptitude. By analyzing covari-
lyzed, the ANOVA is considered a two-way; three ance, we are attempting to reduce variation in post-
independent variables is a three-way ANOVA, and test scores that is attributable to another variable.
so forth. In some analyses, two dependent variables Ideally, we would like all posttest variance to be
are analyzed in a multivariate analysis of variance, attributable to the treatment conditions.
or MANOVA. For example, suppose that we want to
consider whether gender and economic level both ANCOVA is used in both causal–comparative
affect students’ college achievement. MANOVA studies in which already-formed but not necessar-
would allow us to consider both independent vari- ily equal groups are involved and in experimental
ables (i.e., economic level, gender) and multiple studies in which either existing groups or randomly
dependent variables (e.g., college GPA as well as formed groups are involved. Unfortunately, the
other test scores we may have from math or lan- situation for which ANCOVA is least appropriate is
guage classes). As you can imagine, however, we the situation for which it is most often used. Use
need a large data set to run increasingly complex of ANCOVA assumes that participants have been
analyses with multiple independent and depen- randomly assigned to treatment groups. Thus, it is
dent variables. For example, of the 125 students best used in true experimental designs. If existing
at Pacific Crest College, there are no women in or intact groups are not randomly selected but are
the highest economic group who are at the lowest assigned to treatment groups randomly, ANCOVA
level of reading. Complex statistical analyses are may still be used, but results must be interpreted
not warranted without a larger sample size that has with caution. If ANCOVA is used with existing
meaningful variation among the variables. groups and nonmanipulated independent vari-
ables, as in causal–comparative studies, the results
CHAPTER 13 • INFERENTIAL STATISTICS 361
are likely to be misleading at best. Other assump- Multiple Regression
tions associated with the use of ANCOVA are not as
serious if participants have been randomly assigned The more independent variables we have measured
to treatment groups. or observed, the more likely we are to explain the
outcomes of the dependent variables. Multivariate
A second function of ANCOVA is that it in- statistical analyses tell us how much of the variance
creases the power of a statistical test by reduc- found in the outcome variable is attributed to the
ing within-group (error) variance. Power refers independent variables. Whereas ANOVA is the ap-
to the ability of a significance test to identify a propriate analysis when the independent variables
true research finding (i.e., there’s really a differ- are categorical, multiple regression is used with ratio
ence, and the statistical test shows a significant or interval variables. Multiple regression combines
difference), allowing the experimenter to reject variables that are known individually to predict (i.e.,
a null hypothesis that is false. In the language of correlate with) the criterion into a prediction equa-
statistics, increasing power reduces the likelihood tion known as a multiple regression equation. Mul-
that the experimenter will commit a Type II error. tiple regression is an extremely valuable procedure
Because ANCOVA can reduce random sampling for analyzing the results of a variety of experimen-
error by statistically equating different groups, it tal, causal–comparative, and correlational studies
increases the power of the significance test. The because it determines not only whether variables
power-increasing function of ANCOVA is directly are related but also the degree to which they are
related to the degree of randomization involved in related. Understanding how variables are related is
formation of the groups. Although increasing sam- beneficial both for researchers and for groups need-
ple size also increases power, we are often limited ing to make data-based decisions.
to samples of a given size for financial and practi-
cal reasons (e.g., Mrs. Alvarez’s class at Pinecrest Step-wise analysis is an often-used procedure
Elementary includes only 25 students); ANCOVA for regression because it allows us to enter or omit
thus is often the only way to increase power for a predictor variables into the regression equation step
particular study. by step (i.e., one variable at a time). We can see
which of the predictor variables are making the most
SPSS and many other statistical software pack- significant contribution to the criterion variable, and
ages, of course, provide the procedures for con- we can remove variables from our predictive model
ducting ANCOVA and MANCOVA. The procedures if they are not making a significant contribution.
are somewhat similar, in that we designate the de-
pendent and independent variables for analysis and Multiple regression is also the basis for path
the program produces tables of results. However, analysis that begins with a predictive model (see
given the assumptions underlying the use of these Figure 13.7). Path analysis identifies the degree to
complex statistics, researchers must be mindful of which predictor variables interact with each other
when and how these analyses should be employed. and contribute to the variance of the dependent
We cannot stress enough that analyses and their variables. Basically, path analysis involves multiple
subsequent meanings need to be formulated and regressions between and among all the variables in
interpreted in relation to the research design and the model and then specifies the direct and indirect
hypotheses you have formulated, not based exclu- effects of the predictor variables onto the criterion
sively on what appears on the computer screen. For variable. Although somewhat more complex to calcu-
example, the results of ANCOVA and MANCOVA are late than a simple multiple regression, path analysis
least likely to be valid when groups have not been provides an excellent picture of the causal relations
randomly selected and assigned, yet in educational among all the variables in a predictive model.
research we often are faced with this situation—can
you imagine trying to seek permission by assuring As an example of multiple regression, we use
parents that if their child is randomly assigned to the Pacific Crest College dataset (see Appendix B,
the less successful method, we could always have Table B.13.1) that we used previously with ANOVA.
the student repeat the same grade again next year A distinct advantage of multiple regression is that
with the more successful method? Often, reality we must create a predictive model that posits in
clashes with our knowledge of the most appropri- advance which variables predict the criterion vari-
ate research and statistics methods. able or variables, as illustrated in Figure 13.7. In
our example, we consider the effects of high school
362 CHAPTER 13 • INFERENTIAL STATISTICS
FIGURE 13.7 • Multiple regression model for Pacific improve these skills while they are still in high
Crest College school.
Independent Variables Dependent Variable Our basic question for the multiple regres-
sion analysis is: What are the best predictors of
High School college GPA? If we had a large number of vari-
GPA ables from which to choose, we would select the
variables that have the greatest likelihood of pre-
(HSGPA) dicting success, based on previous research. The
results of multiple regression would give us the
SAT College GPA answer to our question and would also tell us
(CollGPA) the relative contribution of each predictor vari-
High School able on the criterion variable. For example, we
Math Score may find that high school GPA, language score,
and SAT scores are the best predictors of college
(Math) GPA, with math score as the variable contribut-
ing the least. We could then run another multiple
High School regression excluding math score, or we could
Language Score add other variables to our multiple regression
equation. At this point we have a choice of pro-
(Lang) cedures for the multiple regression. We can enter
variables one at a time, or we can enter them all
math score (Math), high school language score at once, as shown in the example following. The
(Lang), and high school GPA (HSGPA) on Pacific outcomes would be similar; we just have several
Crest College students’ GPA. We consider only the choices of how we build and interpret them.
direct effects of each variable in the model on the
single criterion variable. Calculating Multiple Regression
Using SPSS 18
At Pacific Crest College, several groups are
interested in determining the variables that best Statistical software packages, such as SPSS, provide
predict a student’s college GPA. The Admissions various choices for conducting multiple regression,
Office at Pacific Crest College, for example, wants and some are quite complex. Because our goal in
to make correct decisions by admitting students this chapter is to give you a conceptual understand-
who will most likely be successful. High school ing of inferential statistics, we provide a simplified
counselors who recommend the college also would analysis of the model illustrated in Figure 13.7.
like to know what they can do at the high school Other options for accomplishing our analysis in-
level to increase a student’s chance for success in clude a complete step-wise regression where we
college—are any of the variables that predict suc- enter independent variables one at a time to con-
cess ones that they can control (i.e., are any vari- sider their cumulative effects or we could conduct
ables malleable)? In fact, two such variables may be a path analysis where we would consider all the
influenced by the high school counselors, math and multiple effects between the predictor variables
language scores. If our multiple regression shows and criterion variable. For the purposes of our ex-
that higher math or language scores are associated ample we enter all the predictor variables together
with a higher college GPA, we may recommend to consider their cumulative effect—the basic mul-
to the high school counselors that they encour- tiple regression procedure called Enter.
age students to improve their math and language
knowledge. If reading and math skills are predic- As with all statistical analyses, we specify our
tors of college success, then it behooves students to variables and follow the SPSS options as follows
(refer to Appendix B Figure B13.6 and Figure B13.7
for the Linear Regression window):
Analyze
Regression
Linear
CHAPTER 13 • INFERENTIAL STATISTICS 363
Once the Linear Regression TABLE 13.7 • Multiple regression output summary
window is open we select the crite-
rion (labeled by SPSS as dependent) Model Summary
variable (CollGPA) from the list of Model R R Square Adjusted R Std. Error of
variables in the box on the left. Next, 1 .893(a) .798 Square the Estimate
we select the four predictor (labeled
by SPSS as independent) variables .791 .23545
(HSGPA, SAT, Math, Lang) by high- a Predictors: (Constant), LANG, SAT, HSGPA, MATH
lighting each variable and clicking
on the arrow to move the variables
into the appropriate box. Underneath lower GPAs than others (i.e., there is variance), and
the independent variables box we select a multiple second, that we’ve identified some of the reasons—
regression procedure; we have selected Enter for not all the reasons (i.e., R2 is not 1.00), but our
this analysis. Also notice the box below the in- predictive model is quite good because our four
dependent variable is labeled Selection Variable. predictor variables explain or account for about
Although multiple regression uses ratio or interval 80% of the variance in college GPA. If we can find
variables, SPSS will run separate analyses using a other variables to explain the remaining 20% of the
nominal variable, if included in the Selection Vari- variance, we can predict college GPA with even a
able box. This option allows us to compare mul- higher amount of certainty. Remember, of course,
tiple regression analyses by gender or economic there is always a risk of being wrong or predicting
level, for example, both of which are included in incorrectly—we never have certainty.
our Pacific Crest College dataset and included in Now that we know all four independent vari-
the variables window on the left in Figure B13.7. ables provide an effective predictive model, our
If we wanted to compare multiple regression out- next step is to understand which variables are the
comes of the males and females on college GPA, we best predictors. Table 13.8 shows analysis of
would select gender as the selection variable and the variance from the SPSS Regression procedure.
then with the Rule button select 1 for males as our The SPSS output first lists the F ratio (F 5 118.494)
first analysis. For the next analysis we would then and the level of significance of the whole model
simply select 2 to conduct the multiple regression we are testing; in this case, the probability of this
test for the females. Multiple regression also allows finding occurring by chance is less than 1 in 1,000
the user to transform a nominal variable into an (i.e., although the output notes Sig. 5 .000, remem-
interval variable, known as a dummy variable. For ber, there is always a probability, very small in this
example, a nominal variable, such as gender, can example, that this finding could occur by chance).
be coded as 0 and 1 and entered into the multiple What is especially helpful about multiple regression
regression as if it were an interval variable. The is that it gives us information about the individual
interpretation of dummy variables indicates order contribution each variable is making to the variance
of the variable in the multiple regression equation in the criterion variable. To interpret the informa-
and has meaning only for purposes of the statistical tion in Table 13.8 we look first at the t value calcu-
calculation. lation and its level of significance. This calculation
The multiple regression summary output of our gives us the individual effect of each variable in the
example for predicting college GPA is shown in model (including the “constant” or “Y” value, which
Table 13.7. The complete model yields an R value is part of the regression equation2). The strongest
of .893, which is a calculation by SPSS from the predictors in the model are language score (Lang;
multiple regression equation. This R value is quite t 5 13.240, Sig. 5 .000) and high school GPA (HS-
helpful because when it is squared (R2 5 .798), it GPA; t 5 4.366, Sig. 5 .000).
provides the percentage of variance in the criterion Additionally, SPSS provides individual weights
variable explained by the predictor variables—the or coefficients to explain the contribution each
four predictor variables explain 79.8% of the vari- variable has on the criterion. Coefficients are cal-
ance in college GPA. A simple way to explain this culated as unstandardized B and as standardized
finding is that, first, we know that there are many beta, which accounts for the standard error. In
reasons that some students have higher GPAs or this example, the high beta weight (.792) for the
364 CHAPTER 13 • INFERENTIAL STATISTICS
TABLE 13.8 • Multiple regression: analysis of variance and of coefficients variable. Our regres-
sion models are only
ANOVA(b) as good as the data
Model Sum of Squares df Mean Square F Sig. we collect and the
choices we make
1 Regression 26.277 4 6.569 118.494 .000(a) regarding the vari-
Residual 6.653 120 .055 ables to include in
Total 32.929 124 our analyses.
Nevertheless,
a Predictors: (Constant), LANG, SAT, HSGPA, MATH this example sug-
b Dependent Variable: CollGPA gests that students’
language scores in
Coefficients(a) high school and
Standardized their high school
Coefficients GPAs are highly ef-
Model Unstandardized Coefficients fective predictors
1 (Constant) Beta of GPA at Pacific
B Std. Error t Sig. Crest College. In
7.698 .000
1.068 .139
HSGPA .196 .045 .211 4.366 .000 contrast, their SAT
SAT 3.96E-005 .000 .007 .137 .892 and math scores
MATH .002 2.016 2.263 .793 add very little to
LANG .022 .002 13.240 .000 the accuracy of pre-
.792 dicting their col-
a Dependent Variable: CollGPA lege GPAs. Based
on this analysis, we
can tell the Admis-
sions Office that the students with the best chances
language score (Lang) also shows it is the stron- of success in college are most likely those with
gest predictor. SAT, in contrast, is not a very good higher language scores and high school GPAs.
indicator of college GPA for these students in this Of course, we make this interpretation cautiously
particular regression model—the beta weight for because we did not study all variables that can
the SAT score is quite low (.007), as is its t value possibly affect college success. Obviously, some
(.137); the probability that the finding is due to high school students with high language scores
chance is quite high (.892). Our regression model and high GPAs will not succeed at college for other
also suggests that math score is not a good pre- reasons; for example, psychological, personal, or
dictor of college GPA (beta 5 2.016, t 5 2.263, family effects can contribute to college success
Sig. 5 .793). Note, however, that the significance or failure. However, considering the variables we
level of a variable in a multiple regression model have measured and over which we know counsel-
is dependent on the model, and in particular the ors and students have some control (e.g., improv-
other variables included in the model, especially ing language skills), we can offer advice about two
when there are strong relations between variables. very good predictors of college success at Pacific
It is important to consider all combinations of Crest. We cannot, however, make predictions for
variables in order to understand the effect of each other colleges because Pacific Crest students may
individual variable. be different from other college students.
Additionally, it is important to remember that
the results shown for Lang, HSGPA, SAT, and Math Chi Square
are for this sample only. Other samples using dif-
ferent regression models will likely show differ- Chi square, symbolized as x2, is a nonparametric
ent results. The design and validity of the study are test of significance appropriate when the data are
important—we want to include the variables that in the form of frequency counts or percentages and
have the most meaning or best predict the criterion proportions that can be converted to frequencies.
CHAPTER 13 • INFERENTIAL STATISTICS 365
It is used to compare frequencies occurring in dif- females at each of the three reading levels. We are
ferent categories or groups. Chi square is thus ap- interested in whether the pattern for the males (i.e.,
propriate for nominal data that fall into either true the distribution across reading level) is significantly
categories or artificial categories. A true category different than the pattern for the females, and on
is one in which persons or objects naturally fall, first inspection it appears so—only one female is at
independent of any research study (e.g., male vs. ReadLevel 1, whereas 18 males are; 44 females are
female), whereas an artificial category is one that at ReadLevel 3, whereas only 7 males are. Although
is operationally defined by a researcher (e.g., tall our data suggest differences, we do not yet know if
vs. short). Two or more mutually exclusive catego- these differences are meaningful until we consider
ries are required. Because in educational research the outcome of the chi-square analysis. In the lan-
we are often interested in the effects of nominal guage of statistics, a significant chi square tells us
variables, such as race, class, or religion, chi square that these variables (i.e., gender and reading level)
offers an excellent analytical tool. are not independent.
Simple frequency counts for the variables un- To determine if the variables are independent
der consideration are often presented in contin- or not, we compare the frequencies we actually
gency tables, such as those shown in Chapter 12, observed (symbolized as O) with the expected fre-
Tables 12.3 and 12.4. Whereas a contingency table quencies (symbolized as E). The expected frequen-
by itself presents basic descriptive data, a chi- cies are the numbers we would find if the variables
square analysis helps determine if any observed are independent—in other words, the pattern of
differences between the variables are meaningful distribution across reading level is the same for the
and is computed by comparing the frequencies of males and the females. The expected frequencies,
each variable observed in a study to the expected therefore, reflect the null hypothesis of no differ-
frequencies. Expected proportions are usually the ence. The expected frequencies and percentage
frequencies that would be expected if the groups distributions are presented in the expanded cross-
were equal (i.e., no difference between groups), al- tabulation shown in Table 13.10.
though occasionally they also may be based on past
data. The chi-square value increases as the differ- Although we typically conduct chi square on
ence between observed and expected frequencies the computer, the hand calculation is manageable
increases; large chi-square values indicate statisti- when we have a simple cross-tabulation table, such
cally significant differences. as Table 13.9. The formula (which is also used by
statistical programs such as SPSS) is:
As an example of a chi-square analysis, we con-
sider the relation between gender and reading level x2 5 a c ( fo 2 fe)2 d
for the students at Pacific Crest College. We use chi fe
square because we have two nominal variables:
gender (i.e., male, female) and reading level (i.e., In this formula fo is the observed frequency, and
low, medium, high). Reading level (ReadLevel) is fe is the expected frequency. As with other hand
a composite variable that considers the language calculations, we refer to a statistical table to determine
score, reading fluency, and a placement assessment
of reading and language ability when the students TABLE 13.9 • Contingency table of gender
started college. Reading could be considered an or- and reading level
dinal variable because the reading levels are in or-
der from low to medium to high. However, because Gender * ReadLevel Crosstabulation
reading level is a composite of both qualitative and
quantitative considerations, the distance between Count
low and medium is not likely the same as between
medium and high. For purposes of our example, ReadLevel
then, ReadLevel is considered nominal and should Gender 1 Total
be analyzed with a nonparametric measure, such 2 123 59
as chi square. Total 66
18 34 7
As illustrated in Table 13.9, the basic contin- 125
gency table shows the distribution of males and 1 21 44
19 55 51
366 CHAPTER 13 • INFERENTIAL STATISTICS
TABLE 13.10 • SPSS crosstabulation table of gender and menu options in Appendix B,
reading level with percentages Figure B.13.8, but they can be
summarized as follows:
Gender * ReadLevel Crosstabulation
Analyze
ReadLevel Descriptive Statistics
Total Crosstabs . . .
12 3
Gender 1 Count 18 34 7 59 Once in the Crosstabs window,
Expected Count 9.0 26.0
% within Gender 30.5% 57.6% 24.1 59.0 we need to specify the variables
% within ReadLevel 94.7% 61.8% 11.9% 100.0% to go in the rows and columns
% of Total 14.4% 27.2% 13.7% of the table; in this example they
47.2% are gender and reading level, as
2 Count 5.6% 47.2% shown in Figure B.13.9. To com-
Expected Count
% within Gender 1 21 44 66 pute chi square, we click on the
% within ReadLevel 10.0 29.0 26.9 66.0 Statistics button at the bottom of
% of Total 31.8% 66.7% 100.0% the window and then select the
1.5% 38.2% Chi-square statistic button in the
Count 5.3% 16.8%
Expected Count 86.3% 52.8% next screen. Finally, we click
% within Gender .8%
% within ReadLevel 35.2% 52.8% the Continue button in the upper
% of Total
Total 19 55 51 125 right of this screen to complete
19.0 55.0 51.0 125.0 the analysis. If we want to display
15.2% 44.0% the expected frequencies in each
100.0% 100.0%
15.2% 44.0% 40.8% 100.0% cell, we can return to the Crosstabs
100.0% 100.0% window and click on the Cells but-
40.8% 100.0% ton (shown in Figure B.13.9).
The first table SPSS generates
is the crosstabulation table, as
illustrated in Table 13.9, which
whether the value computed for x2 is significant shows the observed values for each cell. The next
(see Table A.6 in Appendix A for the distribution
of chi square), taking into account the degrees of table presents further information on the expected
freedom and probability level (i.e., the level of risk
we accept that this finding would occur by chance). frequencies and percentage of students in each cell
Degrees of freedom for chi square are computed by
multiplying the number of rows in the contingency in the contingency table (Table 13.10). These per-
table, minus one (i.e., the number of levels of one
variable, minus one) by the number of columns, centages are helpful to interpret the meaning of the
minus one (i.e., the number of levels of the other
variable, minus one): df 5 (R 2 1)(C 2 1). In this chi-square analysis.
example, then, df 5 (3 2 1)(2 2 1) 5 2.
The outcome of the chi-square calculation is
Calculating Chi Square Using SPSS 18
presented in Table 13.11. The first line shows a
To specify the chi-square statistic for a given set
of data, we go to the Descriptive Statistics sub- Pearson chi-square value, x2 5 44.875, which yields
menu by clicking Analyze (because chi square is
a nonparametric statistic, it is listed with descrip- a significance level of .000. Although SPSS pro-
tive statistics). Within this submenu, we choose
the Crosstabs . . . option. Chi square is one of the vides a larger variety of statistical computations, the
few analysis procedures in SPSS that does not have
a dedicated menu option. We display the SPSS Pearson chi-square is adequate for our purposes of
determining the relation between gender and read-
ing level. With the significant chi-square statistic,
we can conclude that gender and reading level are
not independent—in other words, the patterns for
the males are different than the patterns for the
females. Males and females are not distributed in
the same way at each reading level.
The Pearson chi-square value tells us, however,
only that the patterns are not the same; it does not
tell us how they differ. In other words, we don’t
know if the number of males differs significantly
CHAPTER 13 • INFERENTIAL STATISTICS 367
TABLE 13.11 • Chi-square analysis of gender sample size if we want to expand the number of
and reading level variables to include in a contingency table. For
example, because we have only 125 students in
Chi-Square Tests our Pacific Crest College sample, we can very eas-
ily have almost as many cells in the table as we
Value Asymp. have students. If we are interested in looking at
Pearson Chi-Square 7.747(a) df Sig. (2-sided) the reading levels of students sorted by economic
Likelihood Ratio 2 .021 level, gender, and ethnicity, we quickly run out
Linear-by-Linear 8.005 2 .018 of students to fill all the cells in the contingency
Association 6.783 1 .009 table. Gender has two values, reading level three
N of Valid Cases values, economic level three levels, and ethnicity
125 five—this table would have 90 cells (2 3 3 3 3
3 5), and we only have 125 students to fill the
a 0 cells (.0%) have expected count less than 5. The minimum table. It’s possible that some cells would have no
expected count is 8.02. students that fit into that combination of variables
(e.g., a White, high-economic level woman at the
from the number of females at each reading level or lowest reading level). Obviously, we need more
only at some of the reading levels. We need to go students or fewer variables to conduct an appro-
back to the crosstabulation table (i.e., Table 13.10), priate chi square.
which shows that a higher proportion of the fe-
males is found at the highest level of reading. In Other Investigative Techniques:
this example, 44 females are at the highest level of Data Mining, Factor Analysis,
reading; these 44 females account for 86.3% of the and Structural Equation Modeling
students (males and females) found at the high-
est level of reading. Clearly, the females greatly In addition to some of the standard statistical tests
outnumber the males. Likewise, 18 males and only we have presented, a number of other valuable
1 female are at the lowest level of reading; that is, analytical tools are extremely helpful, depending
94.7% of the students in the lowest level of reading on the purpose of the research and the data avail-
are males. able. Data mining, as an example, uses analytical
tools to identify and predict patterns in datasets or
To summarize, we have found from the chi- large warehouses of data that have been collected
square analysis that the observed distribution of from thousands of subjects and about hundreds of
students across gender and reading level is not variables. Data mining is used often in business
what would have been expected, simply due to and scientific research to discover relations and
chance. The two variables, gender and reading predict patterns among variables and outcomes.
level, are not independent. From this finding with In business, for example, data mining techniques
the Pacific Crest College students, we may con- may be used to discover purchasing patterns—
clude that males need additional reading help; we who buys what products how often and for what
may consider improving the high school language purposes—to identify where advertisements should
curriculum, providing a remedial reading program be placed and the products that should be sold in
for first-year males, or providing support services particular stores. Likewise, credit card companies
accordingly. are interested in who makes which purchases from
which stores and how often. These buying pat-
Chi square may also be used with more than terns are also important for security reasons to
two variables. Because contingency tables can be detect fraudulent purchases, as when thousands
of two, three, or more dimensions, depending on of dollars of video game purchases are charged
the number of variables, a multidimensional chi to an 80-year-old’s credit card. Obviously, quite
square can be thought of as a factorial chi square. sophisticated statistical techniques are needed to
Of course, as the contingency table gets larger by test multiple hypotheses with such large databases.
adding more variables, interpreting the chi square Among other statistical software packages, SAS and
becomes more complex. We are also limited by SPSS offer data-mining procedures in the more
368 CHAPTER 13 • INFERENTIAL STATISTICS
advanced versions. “Clementine,” for example, is most widely used is LISREL (Linear Structural
the data-mining procedure available on the full ver- Relationships), available on SPSS. LISREL can be
sion of SPSS. Newer advancements in data mining thought of as an ultra combination of path analysis
include text mining and Web mining procedures and factor analysis: It is an extremely complicated
to provide predictive models beyond the data the procedure that builds a structural model to ex-
researcher or business has collected. plain the interactive relations among a relatively
large number of variables. The distinct advantage
Factor analysis is a statistical procedure used of LISREL is that it begins with the creation of a
to identify relations among variables in a correla- complex path model that considers multiple rela-
tion matrix. Basically, factor analysis determines tions among independent and dependent variables
how variables group together based on what they as well as latent variables that are unobserved but
may have in common. Factor analysis is commonly responsible for measurement error. Factor analysis
used to reduce a large number of responses or yields groupings of variables or factors that are
questions to a few more meaningful groupings, tested with path analysis (i.e., multiple regression)
known as factors. For example, we may give our to show the strength of the factors in the model.
students or subjects a 100-question personality in- The disadvantage of LISREL is that it requires a
ventory. To reduce these 100 responses to a man- large dataset and is quite complex to interpret.
ageable number we can perform a factor analysis to Nonetheless, for the advanced researcher it is a
identify several key factors that the responses have powerful tool that uses the best capabilities of path
in common. A number of psychological inventories, analysis and factor analysis.
such as the Minnesota Multiphasic Personality In-
ventory (MMPI), were created with the assistance of Types of Parametric and
factor analysis. Responses to the MMPI are scored Nonparametric Statistical Tests
on 10 scales that represent indicators of factors
such as schizophrenia, depression, and hysteria. There are too many parametric and nonparamet-
However, factor analysis indicates only how the ric statistical methods to describe in detail here.
responses group together; the names and meaning Table 13.12 provides an overview of some of the
of the factors must be determined by the research- more commonly used tests and their associated
ers. The intelligence quotient (IQ) was also created purposes. The table is best used by first identify-
through factor analysis. Because the IQ itself repre- ing the levels of measurement of the study. Then
sents one factor that emerged from factor analysis, examine the purpose statements that fit the levels
a number of scholars are quite critical of how well of measurement and select the one that provides
the IQ actually measures a concept as complex as the best match. Other information in the table will
intelligence.2 Interpreting the meaning of factors is also help in carrying out the appropriate signifi-
challenging—factor analysis may be as much an art cance test. Of course, researchers should only use
form as a statistical analysis. statistical tests if they can confidently justify their
use and interpret the outcomes. Many a graduate
Structural Equation Modeling (SEM) can student has needlessly suffered in a thesis defense
be conducted by several software programs, the when trying to explain an overly complex statisti-
cal procedure that is unfamiliar. Select appropriate
2 See Gould, Stephen Jay. The Mismeasure of Man. New York: procedures you understand and be parsimonious in
W.W. Norton, 1981, 1996. Gardner, Howard. Frames of Mind: your explanations.
The Theory of Multiple Intelligences. New York: Basic Books,
1983.
CHAPTER 13 • INFERENTIAL STATISTICS 369
TABLE 13.12 • Commonly used parametric and nonparametric significance tests
Name of Test df Parametric (P) Purpose Var. 1 Var. 2
Test Statistic Non- Independent Dependent
parametric
(NP)
t test for t n1 1 n2 – 2 P Test the difference Nominal Interval or ratio
independent
samples between means of
two independent
groups
t test for t N–1 P Test the difference Nominal Interval or ratio
dependent
samples between means of
two dependent
groups
Analysis of F SSB 5 groups – 1; P Test the difference Nominal interval or ratio
variance
SSW 5 participants among three or
– groups – 1 more independent
groups
Pearson r N–2 P Test whether a Interval or Interval or ratio
product
correlation correlation is ratio
different from
zero (a relationship
exists)
Chi-square x2 rows – 1 times NP Test the difference Nominal Nominal
test column – 1
in proportions
in two or more groups
Median test x2 rows – 1 times NP Test the difference Nominal Ordinal
column – 1
of the medians
of two independent
groups
Mann- U N–1 NP Test the difference Nominal Ordinal
Whitney
U test of the medians
of two independent
groups
Wilcoxon Z N–2 NP Test the difference Nominal Ordinal
signed rank
test in the ranks
of two related groups
Kruskal- H groups – 1 NP Test the difference Nominal Ordinal
Wallis test
in the ranks
of three or more
independent
groups
Freidman x groups – 1 NP Test the difference Nominal Ordinal
test
in the ranks
of three or more
dependent groups
Spearman r N–2 NP Test whether a Ordinal Ordinal
rho
correlation is
different from zero
370 CHAPTER 13 • INFERENTIAL STATISTICS
SUMMARY
CONCEPTS UNDERLYING 8. The smaller the standard error of the mean,
INFERENTIAL STATISTICS the less sampling error. As the size of the
sample increases, the standard error of the
1. Inferential statistics deal with inferences about mean decreases. A researcher should make
populations based on the behavior of samples. every effort to acquire as large a sample as
Inferential statistics are used to determine possible.
how likely it is that results based on a sample
or samples are the same results that would 9. Standard error can also be calculated for
have been obtained for the entire population. other measures of central tendency, as well as
for measures of variability, relationship, and
2. The degree to which the results of a sample relative position. Standard error can also be
can be generalized to a population is always determined for the difference between means.
expressed in terms of probabilities, not in
terms of proof. Hypothesis Testing
Standard Error 10. Hypothesis testing is a process of decision
making in which researchers evaluate the
3. Expected, chance variation among means is results of a study against their original
referred to as sampling error. The question expectations. In short, hypothesis testing is
that guides inferential statistics is whether the process of determining whether to reject
observed differences are real or only the result the null hypothesis (i.e., no meaningful
of sampling errors. differences, only those due to sampling error)
in favor of the research hypothesis (i.e.,
4. A useful characteristic of sampling errors is the groups are meaningfully different; one
that they are usually normally distributed. treatment is more effective than another).
If a sufficiently large number of equal-sized
large samples are randomly selected from a 11. Because we can never completely control
population, the means of those samples will all the factors that may be responsible for
be normally distributed around the population the outcome or test all the possible samples,
mean. The mean of all the sample means will we can never prove a research hypothesis.
yield a good estimate of the population mean. However, if we can reject the null hypothesis,
we have supported our research hypothesis,
5. A distribution of sample means has its own gaining confidence that our findings reflect
mean and its own standard deviation. The the true state of affairs in the population.
standard deviation of the sample means
(i.e., the standard deviation of sampling Tests of Significance
errors) is usually called the standard error
of the mean (SEX). 12. A test of significance is a statistical procedure
in which we determine the likelihood (i.e.,
6. In a normal curve, approximately 68% of probability) that the results from our sample
the sample means will fall between plus and are just due to chance. Significance refers
minus one standard error of the population to a selected probability level that indicates
mean, 95% will fall between plus and minus how much risk we are willing to take if the
two standard errors, and 991% will fall decision we make is wrong.
between plus and minus three standard errors.
13. When conducting a test of significance,
7. In most cases, we do not know the mean or researchers set a probability level at which
standard deviation of the population, so we they feel confident that the results are not
estimate the standard error with the formula simply due to chance. This level of significance
is known as alpha, symbolized as a. The
SD
(SEX) 5 !N 2 1
CHAPTER 13 • INFERENTIAL STATISTICS 371
smaller the probability level, the less likely it is 23. The consequences of committing a Type I
that this finding would occur by chance. error thus affect the decision about level of
14. The standard preselected probability level significance for a particular study.
used by educational researchers is usually
5 out of 100 chances that the observed Degrees of Freedom
difference occurred by chance (symbolized
as a 5 .05). 24. Degrees of freedom are important in
determining whether the results of a
Two-Tailed and One-Tailed Tests study are statistically significant. Each test
of significance has its own formula for
15. Tests of significance can be either one-tailed determining degrees of freedom based on
or two-tailed. “Tails” refer to the extreme such factors as the number of subjects and the
ends of the bell-shaped curve of a sampling number of groups.
distribution.
SELECTING AMONG TESTS
16. A one-tailed test assumes that a difference OF SIGNIFICANCE
can occur in only one direction; the research
hypothesis is directional. To select a one- 25. Different tests of significance are appropriate
tailed test of significance, the researcher for different types of data. The first decision
should be quite sure that the results can occur in selecting an appropriate test of significance
only in the predicted direction. is whether a parametric test or nonparametric
test should be selected.
17. A two-tailed test assumes that the results
can occur in either direction; the research 26. Parametric tests are more powerful and
hypothesis is nondirectional. appropriate when the variable measured is
normally distributed in the population and
18. When appropriate, a one-tailed test has one the data represent an interval or ratio scale of
major advantage: It is statistically “easier” to measurement. Parametric tests also assume
obtain a significant difference when using a that participants are randomly selected for the
one-tailed test. study and that the variances of the population
comparison groups are equal.
Type I and Type II Errors
27. Nonparametric tests make no assumptions
19. A Type I error occurs when the null about the shape of the distribution and are
hypothesis is true, but the researcher rejects used when the data represent an ordinal or
it, believing—incorrectly—that the results nominal scale, when a parametric assumption
from the sample are not simply due to chance. has been greatly violated, or when the nature
For example, the groups aren’t different, of the distribution is not known.
but the researcher incorrectly concludes that
they are. The t Test
20. A Type II error occurs when the null 28. The t test is used to determine whether two
hypothesis is false, but the researcher fails to
reject it, believing—incorrectly, that the results groups of scores are significantly different
from the sample are simply due to chance.
For example, the groups are different, but at a selected probability level. The basic
the researcher incorrectly concludes that they
aren’t. strategy of the t test is to compare the actual
21. The preselected probability level (alpha) difference between the means of the groups
determines the probability of committing
a Type I error, that is, of rejecting a null (X 2 X) with the difference expected
hypothesis that is really true. 1 2
22. As the probability of committing a Type I by chance if the null hypothesis (i.e., no
error decreases, the probability of committing
a Type II error increases; that is, of not difference) is true. This ratio is known as
rejecting a null hypothesis when you should.
the t value.
29. If the t value is equal to or greater than
the value statistically established for the
predetermined significance level, we can
reject the null hypothesis.
372 CHAPTER 13 • INFERENTIAL STATISTICS
30. The t test is adjusted for the fact that the 37. The comparisons to be examined should
distribution of scores for small samples be planned before, not after, the data are
becomes increasingly different from a collected.
normal distribution as sample sizes become
increasingly smaller. 38. Of the many multiple comparison techniques
available, a commonly used one is the
31. The t test for independent samples is used to Scheffé test, which is a very conservative
determine whether, at a selected probability test.
level, a significant difference exists between
the means of two independent samples. Multifactor Analysis of Variance
Independent samples are randomly formed
without any type of matching. 39. Multifactor analysis of variance is appropriate
if a research study is based on a factorial
32. The t test for nonindependent samples is design and investigates two or more
used to compare groups that are formed by independent variables and the interactions
some type of matching or to compare a single between them. This analysis yields a separate
group’s performance on two occasions or on F ratio for each independent variable and one
two different measures. for each interaction.
Analysis of Gain or Difference Scores Analysis of Covariance
33. Subtracting each participant’s pretest score 40. Analysis of covariance (ANCOVA) is a form
from his or her posttest score results in of ANOVA used for controlling extraneous
a gain, or difference, score. Analyzing variables. ANCOVA adjusts posttest scores
difference scores is problematic because every for initial differences on some variable and
participant does not have the same potential compares adjusted scores.
to gain.
41. ANCOVA is also used as a means of increasing
Simple Analysis of Variance the power of a statistical test. Power refers
to the ability of a significance test to
34. Simple, or one-way, analysis of variance identify a true research finding, allowing the
(ANOVA) is used to determine whether scores experimenter to reject a null hypothesis that is
from two or more groups are significantly false.
different at a selected probability level.
42. ANCOVA is based on the assumption that
35. In ANOVA, the total variance of scores is participants have been randomly assigned to
attributed to two sources—variance between treatment groups. It is therefore best used in
groups (variance caused by the treatment or conjunction with true experimental designs.
other independent variables) and variance If existing, or intact, groups are involved but
within groups (error variance). ANOVA treatments are assigned to groups randomly,
involves a ratio, known as F, with variance ANCOVA may still be used but results must be
between groups as the numerator and error interpreted with caution.
variance as the denominator. If the variance
between groups is much greater than the Multiple Regression
variance within groups, greater than would
be expected by chance, the ratio will be large, 43. Multiple regression combines variables
and a significant effect will be apparent. that are known individually to predict (i.e.,
correlate with) the criterion into a multiple
Multiple Comparisons regression equation. It determines not only
whether variables are related but also the
36. Because ANOVA tells the researcher only degree to which they are related.
that the groups are not all the same, a test
involving multiple comparisons is needed to 44. Path analysis involves multiple regressions
determine how the groups differ. Multiple between and among all the variables in the
comparisons involve calculation of a special model and then specifies the direct and
form of the t test that adjusts the error rate. indirect effects of the predictor variables onto
the criterion variable.
CHAPTER 13 • INFERENTIAL STATISTICS 373
Chi Square Other Investigative Techniques: Data Mining,
Factor Analysis, and Structural Equation
45. Chi square, symbolized as x2, is a Modeling
nonparametric test of significance appropriate
when the data are in the form of frequency 48. Data mining is an analytical technique that
counts or percentages and proportions that identifies and predicts patterns in large
can be converted to frequencies. It is used to datasets with multiple variables.
compare frequencies occurring in different
categories or groups. 49. Factor analysis is a statistical procedure used
to identify relations among variables in a
46. Expected frequencies are usually the correlation matrix. Factor analysis determines
frequencies that would be expected if the how variables group together based on what
groups were equal and the null hypothesis they may have in common. Many personality
was not rejected. inventories and the IQ score were developed
through factor analysis.
47. Chi square is computed by comparing the
frequencies of each variable observed in a 50. Structural Equation Modeling (SEM),
study to the expected frequencies. As the principally LISREL (Linear Structural
number of variables and their associated Relationships), is a highly complex statistical
values increases the contingency table analysis that builds a structural model to
becomes geometrically larger and more explain the interactive relations among a
complex to interpret. relatively large number of variables.
Go to the topic “Inferential Statistics” in the MyEducationLab (www.myeducationlab.com) for your course,
where you can:
◆ Find learning outcomes.
◆ Complete Assignments and Activities that can help you more deeply understand the chapter content.
◆ Apply and practice your understanding of the core skills identified in the chapter with the Building
Research Skills exercises.
◆ Check your comprehension of the content covered in the chapter by going to the Study Plan. Here you
will be able to take a pretest, receive feedback on your answers, and then access Review, Practice, and
Enrichment activities to enhance your understanding. You can then complete a final posttest.
374 CHAPTER 13 • INFERENTIAL STATISTICS
PERFORMANCE CRITERIA TASK 7
Task 7 should look like the results section of a pages. (See Task 7 Example.) Note that the scores
research report. The data that you generate (scores are based on the administration of the test de-
you make up for each subject) should make sense. scribed in the Task 5 example. Note also that the
If your dependent variable is IQ for example, do student used a formula used in meta-analysis (de-
not generate scores such as 2, 11, 15; generate scribed briefly in Chapter 2) to calculate effect size
scores such as 84, 110, and 120. Got it? Unlike in a (ES). The basic formula is
real study, you can make your study turn out any
way you want. ES 5 Xe 2 Xc
SDc
Depending on the scale of measurement rep-
resented by your data, select and compute the ap- where
propriate descriptive statistics.
X 5 the mean (average) score for the
Depending on the scale of measurement rep- e
resented by your data, your research hypothesis,
and your research design, select and compute the experimental group
appropriate test of significance. Determine the sta-
tistical significance of your results for a selected Xe 5 the mean (average) score for the
probability level. Present your results in a summary control group
table, and relate how the significance or nonsignifi-
cance of your results supports or does not support SD 5 the standard deviation (variability)
your original research hypothesis. For example, c
you might say the following:
of the scores for the control group
Computation of a t test for independent sam-
ples (a 5 .05) indicated that the group that Although your actual calculations should not
received weekly reviews retained significantly be part of Task 7, they should be attached to it.
more than the group that received daily re- We have attached the step-by-step calculations for
views (see Table 1). Therefore, the original the Task 7 example. You may also perform your
hypothesis that “ninth-grade algebra students calculations using SPSS and attach these computa-
who receive a weekly review will retain sig- tions as well.
nificantly more algebraic concepts than ninth-
grade algebra students who receive a daily TABLE 1 • Means, standard deviations, and t
review” was supported. for the daily-review and weekly-review group on
the delayed retention test
An example of the table referred to (Table 1)
appears at the bottom of this page. Review Group
t
An example that illustrates the performance Daily Weekly
called for by Task 7 appears on the following
M 44.82 52.68 2.56*
SD 5.12 6.00
Note: Maximum score score 5 65.
df 5 38, p , .05.
TASK 7 Example
Effect of Interactive Multimedia on the Achievement of 10th-Grade Biology Students
Results
Prior to the beginning of the study, after the 60 students were randomly selected and assigned to
experimental and control groups, final science grades from the previous school year were obtained from school
records in order to check initial group equivalence. Examination of the means and a t test for independent
samples (␣ ϭ .05) indicated essentially no difference between the groups (see Table 1). A t test for independent
samples was used because the groups were randomly formed and the data were interval.
Table 1
Means, Standard Deviation, and t Tests for the Experimental and Control Groups
Group
Score IMM instructiona Traditional instructiona t
Prior 87.47 87.63 Ϫ0.08*
Grades 8.19 8.05 4.22**
M 32.27 26.70
4.45 5.69
SD
Posttest
NPSS:B
M
SD
Note. Maximum score for prior grades ϭ 100. Maximum score for posttest ϭ 40.
an ϭ 30.
*p > .05. **p < .05.
At the completion of the eight-month study, during the first week in May, scores on the NPSS:B were
compared, also using a t test for independent samples. As Table 1 indicates, scores of the experimental and
control groups were significantly different. In fact, the experimental group scored approximately one standard
deviation higher than the control group (ES ϭ .98). Therefore, the original hypothesis that “10th-grade biology
students whose teachers use IMM as part of their instructional technique will exhibit significantly higher
achievement than 10th-grade biology students whose teachers do not use IMM” was supported.
375
376
377
378
379
CHAPTER FOURTEEN
The Godfather, 1972
“No one recipe tells how to proceed with
data collection efforts.” (p. 381)
Qualitative Data
Collection
LEARNING OUTCOMES
After reading Chapter 14, you should be able to do the following:
1. Describe qualitative data collection sources and techniques.
2. Describe strategies to address the trustworthiness (i.e., validity) and
replicability (i.e., reliability) of qualitative research.
3. Describe the steps for getting started as a qualitative researcher ready to begin
data collection, or fieldwork.
After obtaining entry into a setting and selecting participants, the qualitative re-
searcher is ready to begin data collection, also commonly called fieldwork. Field-
work involves spending considerable time in the setting under study, immersing
oneself in this setting, and collecting as much relevant information as possible and
as unobtrusively as possible. Qualitative researchers collect descriptive—narrative
and visual—nonnumerical data to gain insights into the phenomena of interest.
Because the data that are collected should contribute to understanding the phe-
nomenon, data collection is largely determined by the nature of the problem. No
one recipe tells how to proceed with data collection efforts. Rather, the researcher
must collect the appropriate data to contribute to the understanding and resolution
of a given problem.
DATA COLLECTION SOURCES AND TECHNIQUES
Observations, interviews, questionnaires, phone calls, personal and official docu-
ments, photographs, recordings, drawings, journals, email messages and responses,
and informal conversations are all sources of qualitative data. Clearly, many sources
of data are acceptable, as long as the collection approach is ethical, feasible, and
contributes to an understanding of the phenomenon under study. The four data col-
lection techniques we discuss in this chapter are observing, interviewing (including
the use of focus groups and email), administering questionnaires, and examining
records. These techniques share one aspect: The researcher is the primary data col-
lection instrument.
Observing
When qualitative researchers obtain data by watching the participants, they are
observing. The emphasis during observation is on understanding the natural en-
vironment as lived by participants, without altering or manipulating it. For certain
381