232 CHAPTER 9 • CAUSAL–COMPARATIVE RESEARCH
of two groups, one composed of children with of the research findings. Not surprisingly, then, a
brain injuries and the other composed of children number of control procedures correct for identified
without brain injuries. An example of Case B is a inequalities on such variables.
comparison of two groups, one composed of in-
dividuals with strong self-concepts and the other Control Procedures
composed of individuals with weak self-concepts.
Another Case B example is a comparison of the Lack of randomization, manipulation, and control
algebra achievement of two groups, those who had are all sources of weakness in a causal–comparative
learned algebra via traditional instruction and those study. In other study designs, random assignment
who had learned algebra via computer-assisted in- of participants to groups is probably the best way to
struction. In both Case A and Case B designs, the try to ensure equality of groups, but random assign-
performance of the groups is compared using some ment is not possible in causal–comparative studies
valid measure selected from the types of instru- because the groups are naturally formed before
ments discussed in Chapter 6. the start of the study. Without random assignment,
the groups are more likely to be different on some
Definition and selection of the comparison important variable (e.g., gender, experience, age)
groups are very important parts of the causal– other than the variable under study. This other vari-
comparative procedure. The variable differentiat- able may be the real cause of the observed differ-
ing the groups must be clearly and operationally ence between the groups. For example, a researcher
defined because each group represents a different who simply compared a group of students who had
population and the way in which the groups are received preschool education to a group who had
defined affects the generalizability of the results. not may conclude that preschool education results
If a researcher wanted to compare a group of in higher first-grade reading achievement. However,
students with an unstable home life to a group of if all preschool programs in the region in which the
students with a stable home life, the terms unstable study was conducted were private and required high
and stable would have to be operationally defined. tuition, the researcher would really be investigating
An unstable home life could refer to any number the effects of preschool education combined with
of things, such as life with a parent who abuses membership in a well-to-do family. Perhaps parents
alcohol, who is violent, or who neglects the child. in such families provide early informal reading in-
It could refer to a combination of these or other struction for their children. In this case, it is very
factors. Operational definitions help define the difficult to disentangle the effects of preschool edu-
populations and guide sample selection. cation from the effects of affluent families on first-
grade reading. A researcher aware of the situation
Random selection from the defined populations could control for this variable by studying only chil-
is generally the preferred method of participant dren of well-to-do parents. Thus, the two groups to
selection. The important consideration is to select be compared would be equated with respect to the
samples that are representative of their respec- extraneous variable of parents’ income level. This
tive populations. Note that in causal–comparative example is but one illustration of a number of statis-
research the researcher samples from two already tical and nonstatistical methods that can be applied
existing populations, not from a single population. in an attempt to control for extraneous variables.
The goal is to have groups that are as similar as pos-
sible on all relevant variables except the grouping The following sections describe three control
variable. To determine the equality of groups, in- techniques: matching, comparing homogeneous
formation on a number of background and current groups or subgroups, and analysis of covariance.
status variables may be collected and compared
for each group. For example, information on age, Matching
years of experience, gender, and prior knowledge
may be obtained and examined for the groups be- Matching is a technique for equating groups on one
ing compared. The more similar the two groups or more variables. If researchers identify a variable
are on such variables, the more homogeneous they likely to influence performance on the dependent
are on everything but the variable of interest. This variable, they may control for that variable by pair-
homogeneity makes a stronger study and reduces wise matching of participants. In other words, for
the number of possible alternative explanations
CHAPTER 9 • CAUSAL–COMPARATIVE RESEARCH 233
each participant in one group, the researcher finds a research design and analyze the results with a statis-
participant in the other group with the same or very tical technique called factorial analysis of variance.
similar score on the control variable. If a participant A factorial analysis of variance (discussed further
in either group does not have a suitable match, in Chapter 13) allows the researcher to determine
the participant is eliminated from the study. Thus, the effects of the grouping variable (for causal–
the resulting matched groups are identical or very comparative designs) or independent variable (for
similar with respect to the identified extraneous experimental designs) and the control variable both
variable. For example, if a researcher matched par- separately and in combination. In other words, fac-
ticipants in each group on IQ, a participant in one torial analysis of variance tests for an interaction
group with an IQ of 140 would be matched with a between the independent/grouping variable and the
participant with an IQ at or near 140 in the other control variable such that the independent/grouping
group. A major problem with pair-wise matching variable operates differently at each level of the
is that invariably some participants have no match control variable. For example, a causal–comparative
and must therefore be eliminated from the study. study of the effects of two different methods of
The problem becomes even more serious when the learning fractions may include IQ as a control vari-
researcher attempts to match participants on two or able. One potential interaction between the group-
more variables simultaneously. ing and control variable would be that a method
involving manipulation of blocks is more effective
Comparing Homogeneous than other methods for students with lower IQs, but
Groups or Subgroups the manipulation method is no more effective than
other methods for students with higher IQs.
Another way to control extraneous variables is to
compare groups that are homogeneous with re- Analysis of Covariance
spect to the extraneous variable. In the study about
preschool attendance and first-grade achievement, Analysis of covariance is a statistical technique used
the decision to compare children only from well- to adjust initial group differences on variables used
to-do families is an attempt to control extraneous in causal–comparative and experimental studies.
variables by comparing homogeneous groups. If, In essence, analysis of covariance adjusts scores
in another situation, IQ were an identified extrane- on a dependent variable for initial differences on
ous variable, the researcher could limit groups only some other variable related to performance on
to participants with IQs between 85 and 115 (i.e., the dependent variable. For example, suppose we
average IQ). This procedure may lower the num- planned a study to compare two methods, X and Y,
ber of participants in the study and also limit the of teaching fifth graders to solve math problems.
generalizability of the findings because the sample When we gave the two groups a test of math abil-
of participants includes such a limited range of IQ. ity prior to introducing the new teaching methods,
we found that the group to be taught by Method Y
A similar but more satisfactory approach is to scored much higher than the group to be taught
form subgroups within each group to represent all by Method X. This difference suggests that the
levels of the control variable. For example, each Method Y group will be superior to the Method X
group may be divided into subgroups based on IQ: group at the end of the study just because members
high (e.g., 116 and above), average (e.g., 85 to 115), of the group began with higher math ability than
and low (e.g., 84 and below). The existence of com- members of the other group. Analysis of covari-
parable subgroups in each group controls for IQ. ance statistically adjusts the scores of the Method Y
This approach also permits the researcher to deter- group to remove the initial advantage so that at the
mine whether the target grouping variable affects end of the study the results can be fairly compared,
the dependent variable differently at different levels as if the two groups started equally.
of IQ, the control variable. That is, the researcher
can examine whether the effect on the dependent Data Analysis and Interpretation
variable is different for each subgroup.
Analysis of data in causal–comparative studies
If subgroup comparison is of interest, the best involves a variety of descriptive and inferential
approach is not to do separate analyses for each statistics. All the statistics that may be used in
subgroup but to build the control variable into the
234 CHAPTER 9 • CAUSAL–COMPARATIVE RESEARCH
a causal–comparative study may also be used in plausible that excessive absenteeism produces, or
an experimental study. Briefly, however, the most leads to, involvement in criminal activities as it is
commonly used descriptive statistics are the mean, that involvement in criminal activity produces, or
which indicates the average performance of a group leads to, excessive absenteeism. The way to deter-
on a measure of some variable, and the standard mine the correct order of causality—which variable
deviation, which indicates the spread of a set of caused which—is to determine which one occurred
scores around the mean—that is, whether the first. If, in the preceding example, a period of ex-
scores are relatively close together and clustered cessive absenteeism were frequently followed by
around the mean or widely spread out around the a student getting in trouble with the law, then the
mean. The most commonly used inferential statis- researcher could reasonably conclude that exces-
tics are the t test, used to determine whether the sive absenteeism leads to involvement in criminal
scores of two groups are significantly different from activities. On the other hand, if a student’s first
one another; analysis of variance, used to test for involvement in criminal activities were preceded
significant differences among the scores for three by a period of good attendance but followed by a
or more groups; and chi square, used to compare period of poor attendance, then the conclusion that
group frequencies—that is, to see if an event occurs involvement in criminal activities leads to excessive
more frequently in one group than another. absenteeism would be more reasonable.
Again, remember that interpreting the findings The possibility of a third, common explana-
in a causal–comparative study requires consider- tion is also plausible in many situations. Recall
able caution. Without randomization, manipulation, the example of parental attitude affecting both
and control factors, it is difficult to establish cause– self-concept and achievement, presented earlier in
effect relations with any great degree of confidence. the chapter. As mentioned, one way to control
The cause–effect relation may in fact be the reverse for a potential common cause is to compare ho-
of the one hypothesized (i.e., the alleged cause may mogeneous groups. For example, if students in
be the effect and vice versa). Reversed causality is both the strong self-concept group and the weak
not a reasonable alternative in every case, how- self-concept group could be selected from parents
ever. For example, preschool training may affect who had similar attitudes, the effects of parents’
reading achievement in third grade, but reading attitudes would be removed because both groups
achievement in third grade cannot affect preschool would have been exposed to the same parental
training. Similarly, one’s gender may affect one’s attitudes. To investigate or control for alternative
achievement in mathematics, but one’s achieve- hypotheses, the researcher must be aware of them
ment in mathematics certainly does not affect one’s and must present evidence that they are not better
gender! When reversed causality is plausible, it explanations for the behavioral differences under
should be investigated. For example, it is equally investigation.
CHAPTER 9 • CAUSAL–COMPARATIVE RESEARCH 235
SUMMARY
CAUSAL–COMPARATIVE RESEARCH: THE CAUSAL–COMPARATIVE RESEARCH
DEFINITION AND PURPOSE PROCESS
1. In causal–comparative research, the researcher Design and Procedure
attempts to determine the cause, or reason,
for existing differences in the behavior or 7. The basic causal–comparative design involves
status of groups. selecting two groups differing on some
variable of interest and comparing them on
2. The basic causal–comparative approach is some dependent variable. One group may
retrospective; that is, it starts with an effect possess a characteristic that the other does
and seeks its possible causes. A variation of not, or one group may possess more of a
the basic approach is prospective—that is, characteristic than the other.
starting with a cause and investigating its
effect on some variable. 8. Samples must be representative of their
respective populations and similar with
3. An important difference between causal– respect to critical variables other than the
comparative and correlational research is grouping variable.
that causal–comparative studies involve
two (or more) groups of participants Control Procedures
and one grouping variable, whereas
correlational studies typically involve two 9. Lack of randomization, manipulation, and
(or more) variables and one group of control are all sources of weakness in a
participants. Neither causal–comparative causal–comparative design. It is possible that
nor correlational research produces true the groups are different on some other major
experimental data. variable besides the target variable of interest,
and this other variable may be the cause of
4. The major difference between experimental the observed difference between the groups.
research and causal–comparative research is
that in experimental research the researcher 10. Three approaches to overcoming problems
can randomly form groups and manipulate the of initial group differences on an extraneous
independent variable. In causal–comparative variable are matching, comparing
research the groups are already formed and homogeneous groups or subgroups, and
already differ in terms of the variable in analysis of covariance.
question.
Data Analysis and Interpretation
5. Grouping variables in causal–comparative
studies cannot be manipulated, should not be 11. The descriptive statistics most commonly used
manipulated, or simply are not manipulated in causal–comparative studies are the mean,
but could be. which indicates the average performance of
a group on a measure of some variable, and
6. Causal–comparative studies identify relations the standard deviation, which indicates how
that may lead to experimental studies, but spread out a set of scores is—that is, whether
only if a relation is established clearly. the scores are relatively close together and
The alleged cause of an observed causal– clustered around the mean or widely spread
comparative effect may in fact be the out around the mean.
supposed cause, the effect, or a third variable
that may have affected both the apparent 12. The inferential statistics most commonly used
cause and the effect. in causal–comparative studies are the t test,
236 CHAPTER 9 • CAUSAL–COMPARATIVE RESEARCH
which is used to determine whether the scores 13. Interpreting the findings in a causal–
of two groups are significantly different from comparative study requires considerable
one another; analysis of variance, used to test caution. The alleged cause–effect relation may
for significant differences among the scores be the effect, and vice versa, or a third factor
for three or more groups; and chi square, may be the cause of both variables. The way
used to see if an event occurs more frequently to determine the correct order of causality is
in one group than another. to determine which one occurred first.
Go to the topic “Causal–Comparative Research” in the MyEducationLab (www.myeducationlab.com) for your
course, where you can:
◆ Find learning outcomes.
◆ Complete Assignments and Activities that can help you more deeply understand the chapter content.
◆ Apply and practice your understanding of the core skills identified in the chapter with the Building
Research Skills exercises.
◆ Check your comprehension of the content covered in the chapter by going to the Study Plan. Here you
will be able to take a pretest, receive feedback on your answers, and then access Review, Practice, and
Enrichment activities to enhance your understanding. You can then complete a final posttest.
Comparing Longitudinal Academic Achievement of Full-Day
and Half-Day Kindergarten Students
JENNIFER R. WOLGEMUTH NANCY LEECH
R. BRIAN COBB University of Colorado-Denver
MARC A. WINOKUR DICK ELLERBY
Colorado State University Poudre School District
ABSTRACT The authors compared the achievement children who attend FDK manifest greater achieve-
of children who were enrolled in full-day kindergarten ment than children who attend half-day kindergarten”
(FDK) to a matched sample of students who were en- (p. 270). According to the literature, there is mounting
rolled in half-day Kindergarten (HDK) on mathematics evidence that supports the academic, social, and lan-
and reading achievement in Grade 2, 3, and 4, several guage development benefits of FDK curricula (Cryan,
years after they left kindergarten. Results showed that Sheehan, Wiechel, & Bandy-Hedden, 1992; Hough &
FDK students demonstrated significantly higher achieve- Bryde, 1996; Karweit, 1992; Lore, 1992; Nelson, 2000).
ment at the end of kindergarten than did their HDK Successful FDK programs specifically extend traditional
counterparts, but that advantage disappeared quickly by kindergarten objectives and use added class hours to af-
the end of the first grade. Interpretations and implica- ford children more opportunities to fully integrate new
tions are given for that finding. learning (Karweit, 1992). Furthermore, most education
stakeholders support FDK because they believe that it
Key words: academic achievement of full and half- provides academic advantages for students, meets the
day kindergarten students, mathematics and reading needs of busy parents, and allows primary school teach-
success in elementary grades. ers to be more effective (Ohio State Legislative Office of
Education Oversight [OSLOEO], 1997).
Coinciding with increases in pre-kindergarten enroll-
ment and the number of parents working outside of the Length of School Day
home, full-day kindergarten (FDK) has become exceed- According to Wang and Johnstone (1999), the “major
ingly popular in the United States (Gullo & Maxwell, argument for full-day kindergarten is that additional
1997). The number of students attending FDK classes hours in school would better prepare children for first
in the United States rose from 30% in the early 1980s grade and would result in a decreased need for grade
(Holmes & McConnell, 1990) to 55% in 1998 (National retention” (p. 27). Furthermore, extending the kinder-
Center for Education Statistics, 2000), reflecting soci- garten day provides educational advantages resulting
etal changes and newly emerging educational priorities. from increased academic emphasis, time on task, and
Whereas kindergarten students were required to per- content coverage (Karweit, 1992; Nelson, 2000; Peck,
form basic skills, such as reciting the alphabet and McCaig, & Sapp, 1988). Advocates of FDK also contend
counting to 20, they are now expected to demonstrate that a longer school day allows teachers to provide a
reading readiness and mathematical reasoning while relaxed classroom atmosphere in which children can
maintaining the focus and self-control necessary to work experience kindergarten activities in a less hurried man-
for long periods of time (Nelson, 2000). ner (McConnell & Tesch, 1986). Karweit (1992) argued
that consistent school schedules and longer school days
In contrast, the popularity of half-day kindergarten help parents to better manage family and work respon-
(HDK) has decreased for similar reasons. For example, sibilities while providing more time for individualized
parents prefer FDK over HDK for the time it affords attention for young children.
(Clark & Kirk, 2000) and for providing their children
with further opportunities for academic, social, and per- Critics of FDK express concern that “children may
sonal enrichment (Aten, Foster, & Cobb, 1996; Cooper, become overly tired with a full day of instruction, that
Foster, & Cobb, 1998a, 1998b). children might miss out on important learning experi-
ences at home, and that public schools should not be
The shift in kindergarten preferences has resulted in
a greater demand for research on the effects of FDK in Address correspondence to R. Brian Cobb, College of Applied
comparison with other scheduling approached (Gullo & Human Sciences, Colorado State University, 222 West Laurel
Maxwell, 1997). Fusaro (1997) cautioned that “Before a Street, Fort Collins, CO 80521. (E-mail: [email protected])
school district decides to commit additional resources Copyright © 2006 Heldref Publications.
to FDK classes, it should have empirical evidence that
237
in the business of providing ‘custodial’ child care for the students who attended a FDK program showed rela-
5-year-olds” (Elicker & Mathur, 1997, p. 461). Peck and tively stronger gains on the reading and oral comprehen-
colleagues (1988) argued that some FDK programs use sion sections of the Comprehensive Test of Basic Skills.
the extra time to encroach on the first-grade curriculum In a 2-year evaluation of a new FDK program, Elicker and
in an ill-advised attempt to accelerate children’s cogni- Mathur (1997) reported that FDK students demonstrated
tive learning. However, in a 9-year study of kindergarten significantly more progress in literacy, mathematics, and
students, the Evansville-Vanderburgh School Corpora- general learning skills, as compared with students in
tion (EVSC, 1988) found that school burnout and aca- HDK programs. However, some researchers have not
demic stress were not issues for FDK students. Others found significant differences between the academic
conclude convincingly that the events that occur in class- achievement of students from FDK and HDK programs
rooms (e.g., teacher philosophy, staff development), (e.g., Gullo & Clements, 1984; Holmes & McConnell,
rather than the length of the school day, determine 1990; Nunnally, 1996).
whether curricula and instruction are developmentally
appropriate for young students (Clark & Kirk, 2000; Longitudinal Student Achievement
Elicker & Mathur, 1997; Karweit, 1994). Evidence supporting the long-term effectiveness of FDK
is less available and more inconsistent than is its short-
Parent Choice term effectiveness (Olsen & Zigler, 1989). For example,
the EVSC (1988) reported that FDK students had higher
A critical factor driving the growth of FDK is greater grades than did HDK students throughout elementary
parent demand for choice in kindergarten programs. and middle school, whereas Koopmans (1991) found that
Although surveys of parents with children in HDK often the “significance of the differences between all-day and
mention the importance of balancing education outside halfday groups disappears in the long run [as] test scores
the home with quality time in the home, Elicker and go down over time in both cohorts” (p. 16). Although
Mathur (1997) found that a majority of these parents OSLOEO (1997) concluded that the academic and social
would select a FDK program for their child if given the advantages for FDK students were diminished after the
opportunity. However, Cooper and colleagues (1998a) second grade, Cryan and colleagues (1992) found that
found that parents of FDK students were even more sup- the positive effects from the added time offered by FDK
portive of having a choice of programs than were parents lasted well into the second grade.
of HDK students.
Longitudinal research of kindergarten programming
Although some parents expressed concern about the conducted in the 1980s (Gullo, Bersani, Clements &
length of time that children were away from home, most Bayless, 1986; Puleo, 1988) has been criticized widely
were content with her option of FDK (Nelson, 2000): for its methodological flaws and design weaknesses.
In addition to the belief that FDK better accommodates For example, Elicker and Mathur (1997) identified the
their work schedules (Nelson), “parents of full-day chil- noninclusion of initial academic abilities in comparative
dren expressed higher levels of satisfaction with pro- models as a failing of previous longitudinal research on
gram schedule and curriculum, citing benefits similar the lasting academic effects of FDK.
to those expressed by teachers: more flexibility; more
time for child-initiated, in-depth, and creative activities; Study Rationale
and less stress and frustration” (Elicker & Mathur, 1997, In 1995, the Poudre School District (PSD) implemented
p. 459). Furthermore, Cooper and colleagues (1998a) a tuition-based FDK program in addition to HDK classes
found that parents of full-day students were happy with already offered. Although subsequent surveys of parent
the increased opportunities for academic learning af- satisfaction revealed that FDK provided children with fur-
forded by FDK programs. ther opportunities for academic enrichment (Aten et al.,
1996; Cooper et al., 1998a, 1998b), researchers have not
Student Achievement determined the veracity of these assumptions. Thus, we
conducted the present study to address this gap in the
Most researchers who compared the academic achieve- empirical evidence base.
ment levels of FDK and HDK kindergarten students found
improved educational performance within FDK programs Research Questions
(Cryan et al., 1992; Elicker & Mathur, 1997; Holmes & Because of the inconclusiveness in the research literature
McConnell, 1990; Hough & Bryde 1996; Koopmans, on the longitudinal academic achievement of FDK versus
1991; Wang & Johnstone, 1999). In a meta-analysis of HDK kindergarten students, we did not pose a priori re-
FDK research, Fusaro (1997) found that students who at- search hypotheses. We developed the following research
tended FDK demonstrated significantly higher academic questions around the major main effects and interactions
achievement than did students in half-day programs. of the kindergarten class variable (full day vs. half day),
Hough and Bryde (1996) matched six HDK programs covariates (age and initial ability), and dependent vari-
with six FDK programs and found that FDK students ables (K–5 reading and mathematics achievement).
outperformed HDK students on language arts and math-
ematics criterion-referenced assessments. In a study of
985 kindergarten students, Lore (1992) found that 65% of
238
1. What difference exists between FDK and HDK Reading Curriculum. The kindergarten reading curricu-
kindergarten students in their mathematics and lum was based predominantly on the Open Court sys-
reading abilities as they progress through elementary tem, which emphasizes phonemic awareness. Students
school, while controlling for their initial abilities? learned to segment and blend words by pronouncing
and repronouncing words when beginnings and endings
2. How does this differential effect vary, depending were removed. Teachers also included daily “letters to
on student gender? the class” on which students identified the letters of the
day and circled certain words. Teachers also read stories
Method to students, helped students write capital and lowercase
Participants letters and words, and encouraged them to read on their
own and perform other reading activities. Teachers ex-
The theoretical population for this study included stu- pected the students to know capital and lowercase letters
dents who attended elementary schools in moderately and their sounds, and some words by sight when they
sized, middle-to-upper class cities in the United States. completed kindergarten.
The actual sample included 489 students who attended
FDK or HDK from 1995 to 2001 at one elementary school Mathematics Curriculum. The kindergarten mathemat-
in a Colorado city of approximately 125,000 residents. ics curriculum was predominantly workbook based and
Because this study is retrospective, we used only archi- integrated into the whole curriculum. Students worked
val data to build complete cases for each student in the with mathematics problems from books, played number
sample. Hence, no recruitment strategies were necessary. games with the calendar, counted while standing in line
for lunch and recess, and practiced mathematical skills
Students were enrolled in one of three kindergarten in centers. Once a week, the principal came into the kin-
classes: 283 students (57.9%) attended half-day classes dergarten classes and taught students new mathematics
(157 half-day morning and 126 half-day afternoon) and games with cards or chips. The games included counting-
206 students (42.1%) attended full-day classes. Students on, skip-counting, and simple addition and subtraction.
ages ranged from 5 years 0 months to 6 years 6 months Students were expected to leave kindergarten knowing
upon entering kindergarten; overall average age was how to count and perform basic numerical operations
5 years 7 months. The total study included 208 girls (i.e., adding and subtracting 1).
(44.0%) and 265 boys (56.0%). The majority of students
received no monetary assistance for lunch, which was Measures
based on parent income (89.0%, n ϭ 424); 49 students
(10.0%) received some assistance. Twenty-six students Initial Reading Ability Covariate. When each partici-
(5.3%) spoke a language at home other than English. The pant entered kindergarten, school personnel (kindergar-
majority of students (90.5%, n ϭ 428) were Caucasian; ten teacher or school principal) assessed them for their
31 students (6.3%) were Hispanic; and 14 students (2.8%) ability to recognize capital and lowercase letters and to
were African American, Native American, or Asian Amer- produce their sounds. This letter-knowledge assessment
ican. Those data reflect the community demographics requested that students name all uppercase and lower-
within the school district. Because of the potential for case letters (shown out of order) and make the sounds
individual identification based on the small numbers of of the uppercase letters. Students received individual
students within the various ethnic groups and those re- testing, and school personnel recorded the total number
ceiving lunch assistance, our analyses excluded ethnicity of letters that the student identified correctly out of a
and lunch assistance as control variables. possible 78 letters. Letter-name and sound knowledge
are both essential skills in reading development (Stage,
Intervention Sheppard, Davidson, & Browning, 2001). Simply put,
theory suggests that letter-name knowledge facilitates
We excluded from the study students who switched the ability to produce letter sounds, whereas letter-
during the academic year from FDK to a HDK (or vice sounding ability is the foundation for word decoding
versa). FDK comprised an entire school day, beginning at and fluent reading (Ehri, 1998; Kirby & Parrila, 1999;
8:30 a.m. and ending at 3:00 p.m. HDK morning classes Trieman, Tincoff, Rodriguez, Mouzaki & Francis, 1998).
took place from 8:30 a.m. to 11:15 a.m.; HDK afternoon Predictive validity is evidenced in the numerous studies
classes occurred from 12:15 p.m. to 3:00 p.m. FDK in which researchers have reported high correlations
recessed at lunch and provided a 30-min rest period in (r ϭ .60 to r ϭ .90) between letter-naming and letter
the afternoon when students typically napped, watched sounding ability and subsequent reading, ability and
a video, or both. HDK student also recessed but did not achievement measures (Daly, Wright, Kelly, & Martens,
have a similar rest period. Both kindergarten programs 1997; Kirby & Parrila, 1999; McBride-Chang, 1999; Stage
employed centers (small ability-based groups) as part of et al., 2001).
their reading and mathematics instruction, and all kin-
dergarten teachers met weekly to discuss and align their Initial Mathematics Ability Covariate. When the stu-
curriculum. The amount of time spent on reading instruc- dents entered kindergarten, school personnel (kinder-
tion was two or three times greater than that dedicated garten teacher or school principal) assessed their initial
to mathematics.
239
mathematics ability. The assessment consisted of person- Results
nel asking students to identify numbers from 0 to 10. They
recorded the total number that the student named out of a Rationale for Analyses. We considered several alterna-
possible 11. The ability to recognize numbers and perform tives when we analyzed the data from this study. Our
basic numerical operations, such as counting to 10, is rec- first choice was to analyze the data by using three mul-
ognized as important indicators of kindergarten readiness tiway mixed analyses of covariances (ANCOVAs) with
(Kurdek & Sinclair, 2001). Researchers have shown that kindergarten group and gender as the between-groups
basic number skills (counting and number recognition) factors and the repeated measurements over time as the
in early kindergarten predict mathematics achievement in within-subjects factor. However, we rejected that analytic
first grade (Bramlett, Rowell, & Madenberg, 2000) and in technique for two reasons. First and foremost, all three
fourth grade (Kurdek & Sinclair). analyses evidenced serious violations of sphericity. Sec-
ond, this analytic design requires that all cases have all
K–2 Reading Fluency Dependent Variable: One– measures on the dependent variable (the within-subjects
Minute Reading (OMR) Assessment. The school prin- factor). That requirement reduced our sample size by as
cipal assessed K–2 reading achievement by conducting much as 75% in some of the analyses when compared
1-min, grade-appropriate reading samples with each stu- with our final choice of separate univariate, between-
dent at the beginning and end of the school year. The groups ANCOVAs.
kindergarten reading passage contained 67 words, the
first-grade passage had 121 words, and the second-grade Our second choice was to analyze the data with three
passage included 153 words. Students who finished a 2 ϫ 2 (Kindergarten Group [full day vs. half day] ϫ
passage in less than 1 min returned to the beginning of Gender) between-groups multivariate analyses of vari-
the passage and continued reading until the minute ex- ance (MANCOVAs) with the multiple dependent vari-
pired. The principal recorded the total number of words ables measures included simultaneously in the analysis.
that a student read correctly in 1 min. Students who Field (2000) recommended switching from repeated-
read passages from grades higher than their own were measure ANCOVAs to MANCOVAs when sample sizes
excluded from subsequent analyses. are relatively high and violations of sphericity are fairly
severe, as in our situation. Unfortunately, there also are
The OMR is a well-known curriculum-based mea- difficulties when researchers use MANCOVAs. First, the
sure of oral fluency that is theoretically and empirically analysis and interpretation of MANCOVA are extraor-
linked to concurrent and future reading achievement dinarily complex and cumbersome. More important,
(Fuchs, Fuchs, Hosp, & Jenkins, 2001). Scores on the a number of statisticians (e.g., Tabachnick & Fidell,
OMR correlate highly with concurrent criteria (r ϭ .70 1996) have counseled against using MANCOVA when
to .90; Parker, Hasbrouck, & Tindal, 1992). Evidence of strong intercorrelations exist between the dependent
oral fluency criterion validity includes high correlations measures. Finally, our data violated the homogeneity of
with teacher student-ability judgments (Jenkins & Jewell, covariance matrices, which is an additional assumption
1993), standardized student achievement test scores of MANCOVA.
(Fuchs, Fuchs, & Maxwell, 1988; Jenkins & Jewell),
reading inventories (Parker et al., 1992), and reading- Our final choice was to conduct separate univariate
comprehension tests (Hintze, Shapiro, Conte, & Basile, ANCOVAs with appropriate Bonferroni adjustments to
1997; Kranzler, Brownell, & Miller, 1998). prevent inflation in the Type I error rate. For the OMR,
we began our analyses with five 2 ϫ 2 (Kindergarten
Dependent Variables for Reading- and Mathematics- Group ϫ Gender) ANCOVAs, with initial reading abil-
Achievement-Level Tests: Reading and Mathematics ity as the covariate. We measured OMR at the end of
Levels. The Northwest Evaluation Association (NWEA) kindergarten and at the beginning and end of first and
developed standardized reading-, mathematics-, and second grades. The alpha level was set at .01 for each of
science-level tests for the Poudre School District. NWEA the five analyses.
generated the tests from a large data bank of items that
were calibrated on a common scale using Rasch measure- For the reading-level analyses, we conducted three
ment techniques. The tests measure student performance 2 ϫ 2 ANCOVAs because reading achievement tests
on a Rasch unit (RIT) scale that denotes a student’s were given in the spring of the second, third, and fourth
ability, independent of grade level. The elementary grades. The alpha level was set at .017 for each of the
school conducted reading- and mathematics-level tests analyses. For the mathematics levels analyses, we con-
once a year in the spring with all second- through sixth- ducted three 2 ϫ 2 ANCOVAs with the mathematics
grade students who could read and write. NWEA (2003) achievement tests given in the spring of the second,
reported that the levels tests correlate highly with other third, and fourth grades. The alpha level was also set at
achievement tests, including the Colorado State Assess- .017 for those analyses.
ment Program test (r ϭ.84 to .91) and the Iowa Tests of
Basic Skills (r ϭ.74 to .84). Test–retest reliability results Assessing Assumptions. We began our univariate
were similarly favorable, ranging from .72 to .92, depend- ANCOVA analyses by testing for univariate and mul-
ing on grade level and test (NWEA). tivariate normality. Univariate normality existed in all
11 analyses, at least with respect to skewness. There
240 were two instances in which univariate kurtosis exceeded
Table 1
Correlations of Dependent Variables with Initial Reading Ability and Age
OMR end OMR OMR end OMR OMR end Level 2 Level 4
Variable kindergarten beginning Grade 2 beginning Grade 2 rn rn
Grade 1 Grade 2 .40** 234 .30** 103
rn rn rn
rn rn .03 266 –.10 127
Initial .47** 403 .50** 265 .40** 198 .39** 97 .41** 182
reading
ability
Age .05 453 .03 301 .01 231 .03 105 .07 208
Note: OMR ϭ One-Minute Reading.
** p Ͻ .01.
acceptable boundaries for normality. Although there variable is included in the table, regardless of whether
were a limited number of instances in which multivari- it achieved statistical significance. Gender, on the other
ate normality was mildly violated, visual inspection of hand, is included in the source tables only in those analy-
the histograms and Q-Q plots suggested no substantive ses in which it achieved statistical significance (second-
deviations from normality, except for the OMR test given grade mathematics achievement).
at the end of kindergarten. Hence, we eliminated the test
from our final set of analyses. Given the large sample Table 3 shows that kindergarten class was statisti-
sizes and the relative robustness of ANCOVA against cally significant at the end of kindergarten, F(1, 400) ϭ
violations of normality, we proceeded with the remaining 35.08, p Ͻ .001, at the beginning of first grade,
10 ANCOVAs. F(1, 261) ϭ 11.43, p Ͻ .01, and at the end of first grade,
F(1, 194) ϭ 6.26, p Ͻ .05. The covariate, as expected,
We next assessed the assumption of homogeneity was strongly significant at all levels, and gender was not
of regression slope, which, if violated, generates much statistically significant at any level in the analyses. Signifi-
more difficulty in the interpretation of the results of the cance levels and the estimates of effect size declined as
analyses. Neither of the five OMR analyses nor any of the the participants progressed in school within and across
three mathematics levels analyses violated that assump- academic years.
tion. However, the third-grade reading-level analysis
violated the assumption. Hence, we removed that analy- Table 4 shows that the covariate was highly sig-
sis from the study, leaving only two analyses of reading nificant (as expected) but with no statistically significant
achievement at the second- and fourth-grade levels. effect for either kindergarten class or gender. Table 5
shows a similar pattern in the two preceding tables,
Finally, we assessed the correlation between the co- with (a) a statistically significant covariate, (b) absence
variate and the dependent variable. We began by as-
suming that the participant’s age (measured in months) Table 2
might be correlated significantly with the dependent
variables and should be included in our analyses as a Correlations of Dependent Variables with Initial
covariate. Tables 1 and 2 show the results of this analysis Mathematics Ability and Age
and that none of the correlations were statistically sig-
nificant. Hence, we did not include age in the analyses Variable Level 2 Level 3 Level 4
as a covariate.
Initial
Initial reading and mathematics abilities were the
other convariates included in the analyses. Our a priori mathematics
assumption was that those covariates had to correlate
significantly with their appropriate dependent variable ability
to the included in the analyses. As Tables 1 and 2 show,
all of the final correlation were statistically significant, r .35** .30** .22*
confirming the propriety of their use as covariates.
n 194 180 120
Findings
Age
Tables 3, 4, and 5 show the source tables for the OMR,
the reading levels, and the mathematics levels, respectively. r .03 –.02 –.09
In each table, the kindergarten grouping independent
n 264 189 127
*p Ͻ .05. **p Ͻ .01.
241
Table 3
Analysis of Covariance Results for OMR Fluency Tests as a Function of Kindergarten Class, Controlling for
Initial Reading Ability
Variable and source df MS fp
OMR (end kindergarten)
Kindergarten class
Initial reading ability 1 14,405.36 32.79 Ͻ.001
Error
1 59,031.95 134.37 Ͻ.001
OMR (beginning Grade 1)
Kindergarten class 398 439.33
Initial reading ability
Error
OMR (end Grade 1) 1 10,339.87 10.76 .001
Kindergarten class
Initial reading ability 1 96,556.42 100.43 Ͻ.001
Error
260 961.42
OMR (beginning Grade 2)
Kindergarten class
Initial reading ability
Error 1 5,261.69 5.73 .018
OMR (end Grade 2) 1 39,604.41 43.15 Ͻ.001
Kindergarten class
Initial reading ability 193 917.79
Error
1 185.25 .22 .64
1 14,922.39 17.45 Ͻ.001
92 855.23
1 100.23 .14 .71
1 25,530.89 35.52 Ͻ.001
177 718.73
Note. OMR ϭ One-Minute Reading.
Table 4
Analysis of Covariance Results for Reading Achievement Tests as a Function of Kindergarten Class, Controlling
for Initial Reading Ability
Variable and source df MS f p
Level 2 reading
Kindergarten class 1 43.82 .37 .55
Initial reading ability 1 5,496.21 45.79 Ͻ.001
Error 228 120.02
Level 4 reading 1
Kindergarten class 12.85 .10 .76
Initial reading ability 1 1,265.53 9.50 .003
Error 98 133.22
242
Table 5
Analysis of Covariance Results for Kindergarten and Mathematics Achievement Tests as a Function of
Kindergarten Class, Controlling for Initial Mathematics Ability
Variable and source df MS f p
Level 2 mathematics
.22 .64
Kindergarten class 1 22.53 6.87 .009
33.82 Ͻ.001
Gender 1 707.76
Initial mathematics ability 1 3,485.92 .34
16.66 .56
Error 248 103.08 Ͻ.001
Level 3 mathematics
.11 .75
Kindergarten class 1 29.74 6.37 .013
Initial mathematics ability 1 1,464.35
Error 175 87.89
Level 4 mathematics
Kindergarten class 1 12.47
Initial mathematics ability 1 756.59
Error 115 118.79
Table 6
Descriptive Information for Statistically Significant Comparison for Full-Day Versus Half-Day Kindergarten, on
All Dependent Variables
Kindergarten class
Half-day Full-day
Dependent variable N Ma SD N Ma SD ES(d)
OMR (end Kindergarten) 220 25.33 23.72 183 37.52 25.03 .50
OMR (beginning Grade 1) 156 44.62 36.57 109 57.61 36.47 .36
OMR (end of Grade 1) 120 84.56 33.50 78 95.87 33.77 ns
OMR (beginning Grade 2) 65 62.31 29.96 32 65.54 34.65 ns
OMR (end of Grade 2) 108 95.81 28.47 74 97.43 30.53 ns
Reading achievement (Grade 2) 137 195.95 12.05 96 196.86 11.77 ns
Reading achievement (Grade 4) 70 214.90 11.01 33 214.11 14.11 ns
Mathematics achievement (Grade 2) 151 199.71 11.08 102 199.09 10.86 ns
Mathematics achievement (Grade 3) 109 212.60 9.78 71 213.45 10.09 ns
Mathematics achievement (Grade 4) 82 218.94 11.14 38 219.64 11.09 ns
Note: OMR ϭ One-Minute Reading.
aCovariate adjusted means.
243
of statistical significance for the kindergarten class, and Although teachers agreed that the FDK advantage
(c) declining estimates of effect size as time in school probably did not extend past early elementary educa-
increased. Gender was statistically significant at the tion, their explanations for the ephemeral differences
second grade. varied and fell into three general categories: (a) effects
of differentiated instruction, (b) individual student de-
Table 6 shows the subsample sizes, means, standard velopment, and (c) individual student attributes.
deviations, and corrected effect sizes for each of the
two kindergarten alternatives across all dependent mea- Differentiated Instruction. All teachers, in various
sures. The only effect size estimate whose magnitude ways, suggested that differentiated instruction would
approaches Cohen’s (1998) standard for minimal practi- need to occur in every grade subsequent to kindergarten
cal significance (.25) is the first one reported in Table 6 to, at least partially, maintain higher achievement levels
(.44). That effect size indicates that FDK confers a evidenced by FDK students. When asked to define dif-
small-to-moderate advantage on reading ability at the ferentiated instruction, one teacher said:
end of the Kindergarten experience. At the beginning
and end of first grade, that advantage is no longer prac- What it means to me is that I need to meet that child
tically significant, although it is still positive. Beginning where they are. I mean I need to have appropriate
in second grade, the advantage in reading and math- material and appropriate instruction for that child. . . .
ematics is neither practically significant nor positive for I need to make sure that they’re getting what they
FDK students. need where they are. . . . But, I think you need to set
the bar pretty high and expect them to reach that; on
Follow-Up Interviews the other hand, I think you need to not set it so high
that you’re going to frustrate the kids that aren’t ready.
As a follow-up to our analyses, we interviewed the four
kindergarten teachers in January 2004, for their views on However, the kindergarten teachers recognized the chal-
(a) the kindergarten curriculum, (b) their perceived differ- lenges of using differentiated instruction and were care-
ences between FDK and HDK programming, and (c) their ful not to place blame on first- through third-grade
explanations for the findings that we observed between teachers. One teacher stated, “I’m not saying that not
FDK and HDK students in reading and mathematics everyone does differentiated instruction. But I think that
achievement. The teachers were women who had taught you have to be careful you don’t do too much whole
for 14, 9, 8, and 6 years, respectively. They had previ- group teaching to a group of kids that’s way past where
ously taught FDK and HDK kindergarten and had been they’re at.” Although all of the teachers agreed that dif-
teaching kindergarten at the elementary school research ferentiated instruction would be necessary to maintain
site for 10, 9, 6, and 4 years, respectively. Two of the differences after kindergarten, not all of them believed
teachers were still teaching kindergarten; the other two that this technique would be singularly sufficient. Some
teachers were teaching second and sixth grades, respec- teachers believed strongly that the “leveling out” was
tively. One teacher admitted that she had a “half-day predominantly a result of individual student develop-
bias,” whereas another teacher was a “proponent of full- ment or student attributes, or both, rather than teaching
day kindergarten.” methods.
All interviews consisted of open-ended questions and Students Development. Two teachers felt that the lev-
lasted between 30 min and 1 hr. The interviews were eling out of academic differences between FDK and
tape-recorded and transcribed and returned to the teach- HDK students by second grade resulted from natural de-
ers for review. After approval of the transcripts, we velopmental growth occurring after kindergarten. They
coded the interviews by using constant comparative explained:
analytic techniques (Strauss & Corbin, 1994), which
involved inductively identifying themes and developing You have kids that cannot hear a sound. They cannot
written summaries. hear, especially the vowel sounds. They are not ready
to hear those. They are not mature enough to hear
When questioned about the differences between FDK those sounds. You could go over that eight billion
and HDK, all teachers stated that they would have ex- times and they just aren’t ready to hear those sounds.
pected FDK students, in general, to perform better aca- They go into first grade and they’ve grown up over the
demically than HDK students at the end of kindergarten. summer and . . . it clicks with them. And they might
They attributed the difference to the increased time that have been low in my class, but they get to first grade
FDK students spent reviewing and practicing material. and they’re middle kids. They’ve kind of reached
However, consistent with our findings, all teachers were where their potential is.
equally doubtful that the differences would last. They
believed that the academic disparity between FDK and I mean, there’s big developmental gap in K, 1, 2
HDK students would disappear during first through and by third grade the kids that look[sic] behind, if
third grades. For example, one teacher stated that “That they’re going to be average or normal, they catch-up
kids, by third grade, catch up or things kind of level out by then. . . . Like some kids in second grade, they still
so I don’t think there’d be much of a difference.”
244
struggle with handwriting and reversal and by now it’s (e.g., Hough & Bryde, 1996) in that FDK confers ini-
a red flag if they’re still doing that, developmentally tial benefits on academic achievement but that these
everything should be fitting together in their little benefits diminish relatively rapidly (OSLOEO, 1997).
bodies and minds and they should be having good We are unclear why the rapid decline occurs, but we
smooth handwriting and writing in the right direction. offer this insight from several school administrators and
And if that’s not happening then that’s flag. And by teachers with whom we interacted in our discussions of
third grade . . . if they’re not forming like an average these data:
student then there’s something else that needs to be
looked at. So it’s a development thing and it’s just Teachers in the first few grades are so concerned with
when kids are ready. students who enter their classes [with] nonexistent
reading and math skills that they spend the majority of
Yet, both of those teachers acknowledged that HDK their time bringing these students up to minimal math
students do have to work to catch up to FDK students, and reading criteria at the expense of working equally
citing (a) less time spent on material, (b) differences in hard with students whose reading and math achieve-
FDK and HDK teachers’ instructional philosophies, and ment are above average. Hence, the high-achieving
(c) lack of familiarity with all-day school as disadvan- students’ gains at the end of kindergarten gradually
tages that HDK students must overcome in first grade to erode over the next few years with lack of attention.
equal their FDK counterparts.
We concur with Fusaro (1997) that districts must make
Student attributes. A final explanation that teachers their choices involving FDK with a full understanding of
offered for the leveling out of differences suggested that what the benefits may be for academic achievement and
individual student attributes accounted for student dif- nonachievement outcomes. Our findings of initial gains
ferences in subsequent grades. Three teachers believed place the onus of maintaining those gains on schools
that, no matter what kindergarten program students at- and teachers through their own internal policies, proce-
tended, their inherent level of academic ability or level dures, and will to sustain those gains.
of parent involvement, or both, were most important in
eventually determining how individual students would Our study, of course, is not without limitations. We
compare with other students. For example, studied only one school, albeit over a relatively long
period of time, with well-established measures and with
I think they get to where their ability is, regardless reasonably well-equated groups. The greatest reserva-
of. . . . You can give them a good start and I think tion we have about the generalizability of our findings
that can make a difference, but a high kid is going to clearly focuses on the predicted decline in long-term
be high whether they were in full or half. And those benefits of FDK for schools, making it a priority to
gray kids, you can give them a boost and they can be assure that teachers provide differentiated instruction to
higher than maybe they would have been in half-day, all students to advance each one as far as possible dur-
you know you can give them a better start. ing the academic year rather than to move all students
to a common set of expected learning at the end of the
Thus, these three teachers believed that student at- academic year. We recognize that school policies, proce-
tributes, such as inherent ability or degree of parent dures, and culture play important roles in the variability
involvement in their schooling, would ultimately play in student achievement, regardless of the skill levels
a more significant role in how students would even- of students entering first grade. Although our results
tually compare with one another in second and third will likely generalize to a wide variety of elementary
grades, regardless of whether they attended FDK or school children, they also will likely generalize to those
HDK programs. children who attend schools whose instructional poli-
cies and practices in the early grades are similar to the
Discussion school in this study.
What can be determined about the effects of FDK versus
HDK kindergarten as a result of our analyses? Children NOTE
who attend FDK can and do learn more through that
experience than do their HDK counterparts. Nonethe- The authors appreciate the thoughtful participation
less, the additional learning appears to decline rapidly, of Suzie Gunstream and the other elementary
so much so that by the start of first grade, the benefits teachers whose invaluable practitioner insights
of FDK have diminished to a level that has little practical helped us make sense of the findings.
value. That effect was consistent across two measures
of reading and one measure of mathematics. The effect REFERENCES
also was consistent across gender, given that there was
a gender by kindergarten-group interaction in only one Aten, K. K., Foster, A., & Cobb, B. (1996). Lopez full-
of the analyses. day kindergarten study. Fort Collins, CO: Research
and Development Center for the Advancement of
Our findings are consistent with past meta-analytic Student Learning.
research (Fusaro, 1997) and high-quality empirical studies
245
Bramlett, R. K., Rowell, R. K., & Madenberg, K. (2000). behavior, and attendance. The Journal of Educa-
Predicting first grade achievement from kindergarten tional Research, 78, 51–56.
screening measures: A comparison of child and Gullo, D. F., & Maxwell, C. B. (1997). The effects
family predictors. Research in Schools, 7, 1–9. of different models of all-day kindergarten on
children’s development competence. Early Child
Clark, P., & Kirk, E. (2000). All-day kindergarten. Development and Cave, 139, 119–128.
Childhood Education, 76, 228–231. Hintze, J. M., Shapiro, E. S., Conte, K. L., & Basile, I. A.
(1997). Oral reading fluency and authentic reading
Cohen, J. (1988). Statistical power and analysis for the material: Criterion validity of the technical features
behavioral sciences (2nd ed.). Hillsdale, NJ: Erlbaum. of CBM survey-level assessment. School Psychology
Review, 26, 535–553.
Cooper, T., Foster, A., & Cobb, B. (1998a). Half- or Holmes, C. T., & McConnell, B. M. (1990, April). Full-
full-day kindergarten: Choices for parents in Pou- day versus half-day kindergarten: An experimental
dre School, District. Fort Collins, CO: Research and study. Paper presented at the annual meeting of
Development Center for the Advancement of Stu- the American Educational Research Association,
dent Learning. Boston, MA.
Hough, D., & Bryde, S. (1996, April). The effects of
Cooper, T., Foster, A., & Cobb, B. (1998b). Full- and full-day kindergarten on student achievement and
half-day kindergarten: A study of six elementary affect. Paper presented at the annual meeting of
schools. Fort Collins, CO: Research and Development the American Educational Research Association,
Center for the Advancement of Student Learning. New York.
Jenkins, J. R., & Jewell, M. (1993). Examining the
Cryan, J. R., Sheehan, R., Wiechel, J., & Brandy- validity of two measures for formative teaching:
Hedden, I. G. (1992). Success outcomes of all-day Reading aloud and maze. Exceptional Children,
kindergarten: More positive behavior and increased 59, 421–432.
achievement in the years after. Early Childhood Karweit, N. L. (1992). The kindergarten experience.
Research Quarterly, 7, 187–203. Educational Leadership, 49(6), 82–86.
Karweit, N. L. (1994). Issues in kindergarten organiza-
Daly, E. J. III, Wright, J. A., Kelly, S. Q., & Martens, tion and curriculum. In R. E. Slavin, N. L. Karweit, &
B. K. (1997). Measures of early academic reading B. A. Wasik (Eds.), Preventing early school failure:
skills: Reliability and validity with a first grade Research, policy and practice. Needham Heights,
sample. School Psychology Quarterly, 12, 268–280. MA: Allyn & Bacon.
Kirby, J. R., & Parrila, R. K. (1999). Theory-based
Ehri, L. C. (1998). Grapheme-phoneme knowledge is prediction of early reading. The Alberta Journal of
essential for learning to read words in English. In Educational Research, 45, 428–447.
J. L. Metsala & L. C. Ehri (Eds.), Word recognition in Koopmans, M. (1991). A study of the longitudinal
beginning reading (pp. 3–40). Mahwah, NJ: Erlbaum. effects of all-day kindergarten attendance on
achievement. Newark, NJ: Board of Education,
Elicker, J., & Mathur, S. (1997). What do they do all Office of Research, Evaluation, and Testing.
day? Comprehensive evaluation of a full-day kin- Kranzler. J. H., Brownell, M. T., & Miller, M. D. (1998).
dergarten. Early Childhood Research Quarterly, 12, The construct validity of curriculum-based mea-
459–480. surement of reading: An empirical test of a plau-
sible rival hypothesis. Journal of School Psychology,
Evansville-Vanderburgh School Corporation. (1998). A 36, 399–415.
longitudinal study of the consequences of full-day Kurdek, L. A., & Sinclair, R. J. (2001). Predicting
kindergarten: Kindergarten through grade eight reading and mathematics achievement in fourth-
Evansville, IN: Author. grade children from kindergarten readiness scores.
Journal of Educational Psychology, 93, 451–455.
Field, A. (2000). Discovering statistics using SPSS for Lore, R. (1992). Language development component:
Windows. London: Sage. Full-day kindergarten program 1990–1991 find
evaluation report. Columbus, OH: Columbus Pub-
Fuchs, L. S., Fuchs, D., Hosp, M. K., & Jenkins, J. R. lic Schools, Department of Program Evaluation.
(2001). Oral fluency as an indicator of reading McBride-Chang, C. (1999). The ABC’s of the ABCs:
competence: A theoretical, empirical, and historical The development of letter-name and letter-sound
analysis. Scientific Studies of Reading, 5, 239–256. knowledge. Merrill-Power Quarterly, 45, 285–308.
McConnell, B. B., & Tesch, S. (1986). Effectiveness of
Fuchs, L. S., Fuchs, D., & Maxwell, L. (1998). The valid- kindergarten scheduling. Educational Leadership,
ity of informal measures of reading comprehension. 44(3), 48–51.
Remedial and Special Education, 9, 20–28.
Fusaro, J. A. (1997). The effect of full-day kindergar-
ten on student achievement: A meta-analysis. Child
Study Journal, 27, 269–277.
Gullo, D. F., Bersani, C. U., Clements, D. H., & Bayless,
K. M. (1986). A comparative study of “all-day”,
“alternate-day”, and “half-day” kindergarten sched-
ules: Effects on achievement and classroom social
behaviors. Journal of Research in Childhood Edu-
cation, 1, 87–94.
Gullo, D. F., & Clements, D. H. (1984). The effects of
kindergarten schedule on achievement classroom
246
National Center for Education Statistics. (2000). Peck, J. T., McCaig, G., & Sapp, M. E. (1988). Kinder-
America’s kindergartens. Washington, DC: Author. garten policies: What is best for children? Washington,
DC: National Association for the Education of
Nelson, R. F. (2000). Which is the best kindergarten? Young Children.
Principal, 79(5), 38–41.
Puleo, V. T. (1988). A review and critique of research
Northwest Evaluation Association. (2003). Reliability on full-day kindergarten. The Elementary School
estimates and validity evidence for achievement Journal, 88, 427–439.
level tests and measures of academic progress.
Retrieved March 30, 2003, from http://www.nwea Stage, S. A., Sheppard, J., Davidson, M. M., & Browning,
.org/Research/NorthingStudy.htm M. M. (2001). Prediction of first-graders’ growth in
oral reading fluency using kindergarten letter flu-
Nunnally, J. (1996). The impact of half-day versus full- ency. The Journal of School Psychology, 39, 225–237.
day kindergarten programs on student outcomes:
A pilot project. New Albany, IN: Elementary Edu- Strauss, A., & Corbin, J. (1994). Grounded theory
cation Act Title I. (ERIC Document Reproduction methodology: An overview. In N. K. Denzin &
Service No. ED396857) Y. S. Lincoln (Eds.), Handbook of qualitative
research (pp. 273–285). Thousand Oaks, CA: Sage.
Ohio State Legislative Office of Education Oversight.
(1997). An overview of full-day kindergarten. Tabachnick, B. G., & Fidell, L. S. (1996). Using multi-
Columbus, OH: Author. variate statistics (3rd ed.). New York: Harper & Row.
Olsen, D., & Zigler, E. (1989). An assessment of the Trieman, R., Tincoff, R., Rodriguez, K., Mouzaki, A.,
all-day kindergarten movement. Early Childhood & Francis, D. J. (1998). The foundations of literacy:
Research Quarterly, 4, 167–186. Learning the sounds of letters. Childhood Develop-
ment, 69, 1524–1540.
Parker, R., Hasbrouck, J. E., & Tindal, G. (1992).
Greater validity for oral reading fluency: Can mis- Wang, Y. L., & Johnstone, W. G. (1999). Evaluation
cues help? The Journal of Special Education, 25, of a full-day kindergarten program. ERS Spectrum,
492–503. 17(2), 27–32.
Note: “Comparing Longitudinal Academic Achievement of Full-Day and Half-Day Kindergarten Students,” by J. R. Wolgemuth,
R. B. Cobb, & M. A. Winokur, The Journal of Educational Research 99(5), pp. 260–269, 2006. Reprinted with permission of the Helen
Dwight Reid Educational Foundation. Published by Heldref Publications, 1319 Eighteenth St., NW, Washington, DC 20036–1802,
Copyright © 2006.
247
CHAPTER TEN
Up, 2009
“When well conducted, experimental
studies produce the soundest evidence
concerning cause-effect relations.” (p. 251)
Experimental
Research
LEARNING OUTCOMES
After reading Chapter 10, you should be able to do the following:
1. Briefly define and state the purpose of experimental research.
2. Briefly define the threats to validity in experimental research.
3. Define and provide examples of group experimental designs.
These outcomes form the basis for the following task, which requires you to develop
the method section of a research report for an experimental study.
TASK 6D
For a quantitative study, you have created research plan components (Tasks 2, 3A),
described a sample (Task 4A), and considered appropriate measuring instruments
(Task 5). If your study involves experimental research, now develop the method sec-
tion of a research report. Include a description of participants, data collection meth-
ods, and research design (see Performance Criteria at the end of Chapter 11, p. 305).
Definition RESEARCH METHODS SUMMARY
Design(s)
Experimental Research
In experimental research the researcher manipulates at least one independent
variable, controls other relevant variables, and observes the effect on one or more
dependent variables.
An experiment typically involves a comparison of two groups (although some
experimental studies have only one group or even three or more groups). The
experimental comparison is usually one of three types: (1) comparison of two
different approaches (A versus B), (2) comparison of a new approach and the
existing approach (A versus no A), and (3) comparison of different amounts of a
single approach (a little of A versus a lot of A).
Group experimental designs include: pre-experimental designs (the one-shot case
study, the one-group pretest–posttest design, and the static-group comparison), true
experimental designs (the pretest–posttest control group design, the posttest-only
control group design, and the Solomon four-group design), quasi-experimental designs
(the nonequivalent control group design, the time-series design, the counterbalanced
designs), and factorial designs.
(continued )
249
250 CHAPTER 10 • EXPERIMENTAL RESEARCH
Experimental Research (Continued )
Types of appropriate In educational experimental research, the types of research questions are often focused
research questions on independent variables including method of instruction, type of reinforcement,
arrangement of learning environment, type of learning materials, and length of
treatment.
Key characteristics • The manipulation of an independent variable is the primary characteristic that
differentiates experimental research from other types of research.
• An experimental study is guided by at least one hypothesis that states an
expected causal relation between two variables.
• In an experiment, the group that receives the new treatment is called the
experimental group, and the group that receives a different treatment or is
treated as usual is called the control group.
• The use of randomly formed treatment groups is a unique characteristic
of experimental research.
Steps in the process 1. Select and define a problem.
2. Select participants and measuring instruments.
3. Prepare a research plan.
4. Execute procedures.
5. Analyze the data.
6. Formulate conclusions.
Potential challenges • Experimental studies in education often suffer from two problems: a lack
of sufficient exposure to treatments and failure to make the treatments
substantially different from each other.
• An experiment is valid if results obtained are due only to the manipulated
independent variable and if they are generalizable to individuals or
contexts beyond the experimental setting. These two criteria are referred to,
respectively, as the internal validity and external validity of an experiment.
• Threats to internal validity include history, maturation testing, instrumentation,
statistical regression, differential selection of participants, mortality, selection–
maturation interactions and other interactive effects.
• Threats to external validity include pretest–treatment interaction, multiple-
treatment interference, selection–treatment interaction, specificity of variables,
treatment diffusion, experimenter effects, and reactive arrangements.
Example What are the differential effects of two problem-solving instructional approaches
(schema-based instruction and general strategy instruction) on the mathematical
word problem-solving performance of 22 middle school students who had learning
disabilities or were at risk for mathematics failure?
EXPERIMENTAL RESEARCH: about the links between variables. In experimental
DEFINITION AND PURPOSE research the researcher manipulates at least one
independent variable, controls other relevant vari-
Experimental research is the only type of research ables, and observes the effect on one or more de-
that can test hypotheses to establish cause–effect re- pendent variables. The researcher determines “who
lations. It represents the strongest chain of reasoning gets what“; that is, the researcher has control over
CHAPTER 10 • EXPERIMENTAL RESEARCH 251
the selection and assignment of groups to treat- experiment is conducted to test the experimental
ments. The manipulation of an independent vari- hypothesis. In addition, in an experimental study,
able is the primary characteristic that differentiates the researcher is in on the action from the very
experimental research from other types of research. beginning, selecting the groups, deciding how to
The independent variable, also called the treatment, allocate treatment to the groups, controlling extra-
causal, or experimental variable, is that treatment neous variables, and measuring the effect of the
or characteristic believed to make a difference. In treatment at the end of the study.
educational research, independent variables that are
frequently manipulated include method of instruc- It is important to note that the experimental
tion, type of reinforcement, arrangement of learning researcher controls both the selection and the as-
environment, type of learning materials, and length signment of the research participants. That is, the
of treatment. This list is by no means exhaustive. The researcher randomly selects participants from a
dependent variable, also called the criterion, effect, single, well-defined population and then randomly
or posttest variable, is the outcome of the study, the assigns these participants to the different treatment
change or difference in groups that occurs as a result conditions. This ability to select and assign partici-
of the independent variable. It gets its name because pants to treatments randomly makes experimental
it is dependent on the independent variable. The research unique—the random assignment of par-
dependent variable may be measured by a test or ticipants to treatments, also called manipulation
some other quantitative measure (e.g., attendance, of the treatments, is the feature that distinguishes
number of suspensions, time on task). The only it from causal–comparative research. Experimental
restriction on the dependent variable is that it must research has both random selection and random
represent a measurable outcome. assignment, whereas causal–comparative research
has only random selection, not assignment, because
Experimental research is the most structured random assignment to a treatment from a single
of all research types. When well conducted, ex- population is not possible in causal–comparative
perimental studies produce the soundest evidence studies. Rather, participants in causal–comparative
concerning cause–effect relations. The results of ex- studies are obtained from different, already-existing
perimental research permit prediction, but not the populations.
kind that is characteristic of correlational research.
A correlational study predicts a particular score for An experiment typically involves a comparison
a particular individual. Predictions based on experi- of two groups (although some experimental stud-
mental findings are more global and often take the ies have only one group or even three or more
form, “If you use Approach X, you will probably groups). The experimental comparison is usually
get different results than if you use Approach Y.” one of three types: (1) comparison of two different
Of course, it is unusual for a single experimental approaches (A versus B), (2) comparison of a new
study to produce broad generalization of results approach and the existing approach (A versus no
because any single study is limited in context and A), and (3) comparison of different amounts of a
participants. However, replications of a study in- single approach (a little of A versus a lot of A). An
volving different contexts and participants often example of an A versus B comparison is a study
produce cause–effect results that can be generalized that compares the effects of a computer-based ap-
widely. proach to teaching first-grade reading to a teacher-
based approach. An example of an A versus no A
The Experimental Process comparison is a study that compares a new hand-
writing method to the classroom teachers’ existing
The steps in an experimental study are basically approach. An example of a little of A versus a lot
the same as in other types of research: selecting of A comparison is a study that compares the ef-
and defining a problem, selecting participants and fect of 20 minutes of daily science instruction on
measuring instruments, preparing a research plan, fifth graders’ attitudes toward science to the effect
executing procedures, analyzing the data, and for- of 40 minutes of daily science instruction. Experi-
mulating conclusions. An experimental study is mental designs are sometimes quite complex and
guided by at least one hypothesis that states an ex- may involve simultaneous manipulation of several
pected causal relation between two variables. The independent variables. At this stage of the game,
however, we recommend that you stick to just one!
252 CHAPTER 10 • EXPERIMENTAL RESEARCH
In an experiment, the group that receives the the first problem, no matter how effective a treat-
new treatment is called (not surprisingly) the experi- ment is, it is not likely to be effective if students
mental group, and the group that receives a different are exposed to it for only a brief period. To test a
treatment or is treated as usual is called the control hypothesis concerning the effectiveness of a treat-
group. A common misconception is that a control ment adequately, an experimental group would
group always receives no treatment, but a group need to be exposed to it long enough that the
with no treatment would rarely provide a fair com- treatment has a chance to work (i.e., produce a
parison. For example, if the independent variable measurable effect). Regarding the second problem
were type of reading instruction, the experimental (i.e., difference in treatments), it is important to
group may be instructed with a new method, and operationalize the variables in such a way that the
the control group may continue instruction with the difference between groups is clear. For example,
method currently used. The control group would in a study comparing team teaching and traditional
still receive reading instruction; members would lecture teaching, team teaching must be operation-
not sit in a closet while the study was conducted—if alized in a manner that clearly differentiated it from
they did, the study would be a comparison of the the traditional method. If team teaching simply
new method with no reading instruction at all. Any meant two teachers taking turns lecturing in the
method of instruction is bound to be more effective traditional way, it would not be very different from
than no instruction. An alternative to labeling the so-called traditional teaching and the researcher
groups as control and experimental is to describe would be very unlikely to find a meaningful differ-
the treatments as comparison groups, treatment ence between the two study treatments.
groups, or Groups A and B.
Manipulation and Control
The groups that are to receive the different
treatments should be equated on all variables that As noted several times previously, direct manipula-
may influence performance on the dependent vari- tion by the researcher of at least one independent
able. For example, in the previous example, initial variable is the characteristic that differentiates ex-
reading readiness should be very similar in each perimental research from other types of research.
treatment group at the start of the study. The re- Manipulation of an independent variable is often a
searcher must make every effort to ensure that the difficult concept to grasp. Quite simply, it means
two groups are similar on all variables except the that the researcher selects the treatments and de-
independent variable. The main way that groups cides which group will get which treatment. For ex-
are equated is through simple random or stratified ample, if the independent variable in a study were
random sampling. number of annual teacher reviews, the researcher
may decide to form three groups, representing
After the groups have been exposed to the treat- three levels of the independent variable: one group
ment for some period, the researcher collects data receiving no review, a second group receiving one
on the dependent variable from the groups and tests review, and a third group receiving two reviews.
for a significant difference in performance. In other Having selected research participants from a single,
words, using statistical analysis, the researcher de- well-defined population (e.g., teachers at a large
termines whether the treatment made a real differ- elementary school), the researcher would randomly
ence. For example, suppose that at the end of an assign participants to treatments. Independent vari-
experimental study evaluating reading method, one ables that are manipulated by the experimenter are
group had an average score of 29 on a measure of also known as active variables.
reading comprehension and the other group had an
average score of 27. Clearly the groups are differ- Control refers to the researcher’s efforts to re-
ent, but is a 2-point difference a meaningful differ- move the influence of any variable, other than the
ence, or is it just a chance difference produced by independent variable, that may affect performance
measurement error? Statistical analysis allows the on the dependent variable. In other words, in an
researcher to answer this question with confidence. experimental design, the groups should differ only
on the independent variable. For example, suppose a
Experimental studies in education often suffer researcher conducted a study to test whether student
from two problems: a lack of sufficient exposure tutors are more effective than parent tutors in teaching
to treatments and failure to make the treatments
substantially different from each other. Regarding
CHAPTER 10 • EXPERIMENTAL RESEARCH 253
first graders to read. In this study, suppose the stu- are involved. It certainly is a lot easier to control
dent tutors were older children from higher grade solids, liquids, and gases! Our task is not an impos-
levels, and the parent tutors were members of the sible one, however, because we can concentrate on
PTA. Suppose also that student tutors helped each identifying and controlling only those variables that
member of their group for 1 hour per school day may really affect or interact with the dependent
for a month, whereas the parent tutors helped each variable. For example, if two groups had significant
member of their group for 2 hours per week for a differences in shoe size or height, such differences
month. Finally, suppose the results of the study indi- would probably not affect the results of most edu-
cate that the student tutors produced higher reading cation studies. Techniques for controlling extrane-
scores than the parent tutors. Given this study design, ous variables are presented later in this chapter.
concluding that student tutors are more effective than
parent tutors would certainly not be fair. Participants THREATS TO EXPERIMENTAL
with the student tutors received 2½ times as much VALIDITY
help as that provided to the parents’ group (i.e.,
5 hours per week versus 2 hours per week). Because As noted, any uncontrolled extraneous variables
this researcher did not control the time spent in affecting performance on the dependent variable
tutoring, he or she has several possible conclusions— are threats to the validity of an experiment. An ex-
student tutors may in fact be more effective than par- periment is valid if results obtained are due only to
ent tutors, longer periods of tutoring may be more the manipulated independent variable and if they
effective than shorter periods regardless of type of are generalizable to individuals or contexts beyond
tutor, or the combination of more time/student tutors the experimental setting. These two criteria are re-
may be more effective than the combination of less ferred to, respectively, as the internal validity and
time/parent tutors. To make the comparison fair and external validity of an experiment.
interpretable, both students and parents should tutor
for the same amount of time; in other words, time of Internal validity is the degree to which ob-
tutoring must be controlled. served differences on the dependent variable are
a direct result of manipulation of the independent
A researcher must consider many factors when variable, not some other variable. In other words, an
attempting to identify and control extraneous vari- examination of internal validity focuses on threats or
ables. Some variables may be relatively obvious; rival explanations that influence the outcomes of an
for example, the researcher in the preceding study experimental study but are not due to the indepen-
should control for reading readiness and prior read- dent variable. In the example of student and parent
ing instruction in addition to time spent tutoring. tutors, a plausible threat or rival explanation for the
Some variables may not be as obvious; for example, research results is the difference in the amount of
both student and parent tutors should use similar tutoring time. The degree to which experimental
reading texts and materials. Ultimately, two differ- research results are attributable to the independent
ent kinds of variables need to be controlled: par- variable and not to another rival explanation is the
ticipant variables and environmental variables. A degree to which the study is internally valid.
participant variable (such as reading readiness) is
one on which participants in different groups in a External validity, also called ecological validity,
study may differ; an environmental variable (such is the degree to which study results are generaliz-
as learning materials) is a variable in the setting able, or applicable, to groups and environments
of the study that may cause unwanted differences outside the experimental setting. In other words, an
between groups. A researcher should strive to en- examination of external validity focuses on threats
sure that the characteristics and experiences of the or rival explanations that disallow the results of a
groups are as equal as possible on all important study to be generalized to other settings or groups.
variables except the independent variable. If rel- A study conducted with groups of gifted ninth grad-
evant variables can be controlled, group differences ers, for example, should produce results that are
on the dependent variable can be attributed to the applicable to other groups of gifted ninth graders. If
independent variable. research results were never generalizable outside the
experimental setting, then no one could profit from
Control is not easy in an experiment, espe- research. An experimental study can contribute to
cially in educational studies, where human beings
254 CHAPTER 10 • EXPERIMENTAL RESEARCH
educational theory or practice only if its results and Threats to Internal Validity
effects are replicable and generalize to other places
and groups. If results cannot be replicated in other Probably the most authoritative source on experi-
settings by other researchers, the study has low ex- mental design and threats to experimental validity
ternal, or ecological, validity. is the work of Donald Campbell, in collaboration
with Julian Stanley and Thomas Cook.2 They identi-
So, all one has to do to conduct a valid ex- fied eight main threats to internal validity: history,
periment is to maximize both internal and external maturation, testing, instrumentation, statistical re-
validity, right? Wrong. Unfortunately, a Catch-22 gression, differential selection of participants, mor-
complicates the researcher’s experimental life. To tality, and selection–maturation interaction, which
maximize internal validity, the researcher must ex- are summarized in Table 10.1. However, before
ercise very rigid controls over participants and con- describing these threats to internal validity, we note
ditions, producing a laboratory-like environment. the role of experimental research in overcoming
However, the more a research situation is narrowed these threats. You are not rendered helpless when
and controlled, the less realistic and generalizable it faced with them. Quite the contrary, the use of
becomes. A study can contribute little to educational random selection of participants, the researcher’s
practice if techniques that are effective in a highly assignment of participants to treatments, and con-
controlled setting are not also effective in a less trol of other variables are powerful approaches
controlled classroom setting. On the other hand, the to overcoming the threats. As you read about the
more natural the experimental setting becomes, the threats, note how random selection and assignment
more difficult it is to control extraneous variables. to treatments can control most threats.
It is very difficult, for example, to conduct a well-
controlled study in a classroom. Thus, the researcher History
must strive for balance between control and realism.
If a choice is involved, the researcher should err When discussing threats to validity, history refers
on the side of control rather than realism1 because to any event occurring during a study that is not
a study that is not internally valid is worthless. A part of the experimental treatment but may affect
useful strategy to address this problem is to demon- the dependent variable. The longer a study lasts,
strate an effect in a highly controlled environment the more likely it is that history will be a threat.
(i.e., with maximum internal validity) and then redo A bomb scare, an epidemic of measles, or global
the study in a more natural setting (i.e., to examine current events are examples of events that may
external validity). In the final analysis, however, produce a history effect. For example, suppose
the researcher must seek a compromise between a you conducted a series of in-service workshops
highly controlled and highly natural environment. designed to increase the morale of teacher partici-
pants. Between the time you conducted the work-
In the following pages we describe many threats shops and the time you administered a posttest
to internal and external validity. Some extraneous measure of morale, the news media announced
variables are threats to internal validity, some are that, due to state-level budget problems, funding to
threats to external validity, and some may be threats the local school district was to be significantly re-
to both. How potential threats are classified is not duced, and promised pay raises for teachers would
of great importance; what is important is that you likely be postponed. Such an event could easily
be aware of their existence and how to control for wipe out any effect the workshops may have had,
them. As you read, you may begin to feel that there and posttest morale scores may well be consider-
are just too many threats for a researcher to control. ably lower than they otherwise may have been (to
However, the task is not as formidable as it may say the least!).
at first appear because experimental designs can
control many or most of the threats you are likely 2 Experimental and Quasi-Experimental Designs for Research,
to encounter. Also, remember that each threat is a by D. T. Campbell and J. C. Stanley, 1971, Chicago: Rand
potential threat only—it may not be a problem in a McNally; Quasi-Experimentation: Design and Analysis Issues for
particular study. Field Settings, T. D. Cook and D. T. Campbell, 1979, Chicago:
Rand McNally.
1 This is a clear distinction between the emphases of quantitative
and qualitative research.
CHAPTER 10 • EXPERIMENTAL RESEARCH 255
TABLE 10.1 • Threats to internal validity
Threat Description
History
Unexpected events occur between the pre- and posttest, affecting the dependent
Maturation variable.
Testing Changes occur in the participants, from growing older, wiser, more experienced,
Instrumentation etc., during the study.
Statistical regression Taking a pretest alters the result of the posttest.
Differential selection
of participants The measuring instrument is changed between pre- and posttesting, or a single
Mortality measuring instrument is unreliable.
Selection-maturation Extremely high or extremely low scorers tend to regress to the mean on retesting.
interaction
Participants in the experimental and control groups have different characteristics
that affect the dependent variable differently.
Different participants drop out of the study in different numbers, altering the
composition of the treatment groups.
The participants selected into treatment groups have different maturation rates.
Selection interactions also occur with history and instrumentation.
Maturation most likely to occur in studies that measure factual
information that can be recalled. For example, tak-
Maturation refers to physical, intellectual, and ing a pretest on solving algebraic equations is less
emotional changes that naturally occur within in- likely to improve posttest performance than taking
dividuals over a period of time. In a research a pretest on multiplication facts would.
study, these changes may affect participants’ per-
formance on a measure of the dependent vari- Instrumentation
able. Especially in studies that last a long time,
participants become older and perhaps more The instrumentation threat refers to unreliability,
coordinated, less coordinated, unmotivated, anx- or lack of consistency, in measuring instruments
ious, or just plain bored. Maturation is more likely that may result in an invalid assessment of perfor-
to be a problem in a study designed to test the ef- mance. Instrumentation may threaten validity in
fectiveness of a psychomotor training program on several different ways. A problem may occur if the
3-year-olds than in a study designed to compare researcher uses two different tests, one for pretest-
two methods of teaching algebra. Young partici- ing and one for posttesting, and the tests are not
pants typically undergo rapid biological changes, of equal difficulty. For example, if the posttest is
raising the question of whether changes on the de- more difficult than the pretest, improvement may
pendent variable are due to the training program be masked. Alternatively, if the posttest is less
or to maturation. difficult than the pretest, it may indicate improve-
ment that is not really present. If data are collected
Testing through observation, the observers may not be ob-
serving or evaluating behavior in the same way at
Testing, also called pretest sensitization, refers to the end of the study as at the beginning. In fact, if
the threat of improved performance on a posttest they are aware of the nature of the study, they may
that results from a pretest. In other words, simply record only behavior that supports the researcher’s
taking a pretest may improve participants’ scores hypothesis. If data are collected through the use of
on a posttest, regardless of whether they received a mechanical device, the device may be poorly cali-
any treatment or instruction in between. Testing is brated, resulting in inaccurate measurement. Thus,
more likely to be a threat when the time between the researcher must take care in selecting tests,
the tests is short; a pretest taken in September is observers, and mechanical devices to measure the
not likely to affect performance on a posttest taken dependent variable.
in June. The testing threat to internal validity is
256 CHAPTER 10 • EXPERIMENTAL RESEARCH
Statistical Regression English classes to participate in your study. You
have no guarantee that the two classes are equiva-
Statistical regression usually occurs in studies lent. If your luck is really bad, one class may be
where participants are selected on the basis of their the honors English class and the other class may
extremely high or extremely low scores. Statistical be the remedial English class—it would not be
regression is the tendency of participants who too surprising if the honors class did much better
score highest on a test (e.g., a pretest) to score on the posttest! Already-formed groups should be
lower on a second, similar test (e.g., a posttest) avoided if possible; when they are included in a
and of participants who score lowest on a pre- study, the researcher should select groups that
test to score higher on a posttest. The tendency are as similar as possible and should administer a
is for scores to regress, or move, toward a mean pretest to check for initial equivalence.
(i.e., average) or expected score. Thus, extremely
high scorers regress (i.e., move lower) toward the Mortality
mean, and extremely low scorers regress (i.e., move
higher) toward the mean. For example, suppose First, let us make it perfectly clear that the mortality
a researcher wanted to test the effectiveness of a threat is usually not related to participants dying!
new method of instruction on the spelling ability Mortality, or attrition, refers to a reduction in the
of poor spellers. The researcher could administer a number of research participants; this reduction oc-
100-item, 4-alternative, multiple-choice spelling curs over time as individuals drop out of a study.
pretest, with questions reading, “Which of the Mortality creates problems with validity particularly
following four words is spelled incorrectly?” The when different groups drop out for different rea-
researcher could then select for the study the 30 stu- sons and with different frequency. A researcher can
dents who scored lowest. However, perhaps none assess the mortality of groups by obtaining demo-
of the students knew any of the words and guessed graphic information about the participant groups
on every question. With 100 items, and 4 choices before the start of the study and then determining
for each item, a student would be expected to re- if the makeup of the groups has changed at the end
ceive a score of 25 just by guessing. Some students, of the study.
however, just due to rotten guessing, would receive
scores much lower than 25, and other students, A change in the characteristics of the groups
equally by chance, would receive much higher due to mortality can have a significant effect on
scores than 25. If all these students took the test the results of the study. For example, participants
a second time, without any instruction interven- who drop out of a study may be less motivated or
ing, their expected scores would still be 25. Thus, uninterested in the study than those who remain.
students who scored very low the first time would This type of attrition frequently occurs when the
be expected to have a second score closer to 25, participants are volunteers or when a study com-
and students who scored very high the first time pares a new treatment to an existing treatment.
would also be expected to score closer to 25 the Participants rarely drop out of control groups or
second time. Whenever participants are selected on existing treatments because few or no additional
the basis of their extremely high or extremely low demands are made on them. However, volunteers
performance, statistical regression is a viable threat or participants using the new, experimental treat-
to internal validity. ment may drop out because too much effort is re-
quired for participation. The experimental group
Differential Selection of Participants that remains at the end of the study then repre-
sents a more motivated group than the control
Differential selection of participants is the se- group. As another example of mortality, suppose
lection of subjects who have differences before the Suzy Shiningstar (a high-IQ–and-all-that student)
start of a study that may at least partially account got the measles and dropped out of your control
for differences found in a posttest. The threat that group. Before Suzy dropped out, she managed to
the groups are different before the study begins infect her friends in the control group. Because birds
is more likely when a researcher is comparing of a feather often flock together, Suzy’s control-
already-formed groups. Suppose, for example, you group friends may also be high-IQ–and-all-that
receive permission to invite two of Ms. Hynee’s students. The experimental group may end up
CHAPTER 10 • EXPERIMENTAL RESEARCH 257
looking pretty good when compared to the control populations. Building on the work of Campbell
group simply because many of the top students and Stanley, Bracht and Glass3 refined and ex-
dropped out of the control group. The researcher panded discussion of threats to external validity
cannot assume that participants drop out of a and classified these threats into two categories.
study in a random fashion and should, if possible, Threats affecting “generalizing to whom”—that is,
select a design that controls for mortality. For ex- threat affecting the groups to which research re-
ample, one way to reduce mortality is to provide sults be generalized—make up threats to popu-
some incentive to participants to remain in the lation validity. Threats affecting “generalizing to
study. Another approach is to identify the kinds of what”—that is, threats affecting the settings, con-
participants who drop out of the study and remove ditions, variables, and contexts to which results
similar participants from the other groups in equal can be generalized—make up threats to ecological
numbers. validity. The following discussion incorporates the
contributions of Bracht and Glass into Campbell
Selection–Maturation Interaction and Stanley’s (1971) conceptualizations; the threats
and Other Interactive Effects to external validity are summarized later in this
chapter in Table 10.2.
The effects of differential selection may also in-
teract with the effects of maturation, history, or Pretest–Treatment Interaction
testing, with the resulting interaction threatening
internal validity. In other words, if already-formed Pretest–treatment interaction occurs when par-
groups are included in a study, one group may ticipants respond or react differently to a treat-
profit more (or less) from a treatment or have ment because they have been pretested. Pretesting
an initial advantage (or disadvantage) because of may sensitize or alert subjects to the nature of the
maturation, history, or testing factors. The most treatment, potentially making the treatment effect
common of these interactive effects is selection– different than it would have been had subjects not
maturation interaction, which exists if partici- been pretested. Campbell and Stanley illustrated
pants selected into the treatment groups matured this effect by pointing out the probable differences
at different rates during the study. For example, between two groups—participants who view the
suppose that you received permission to include antiprejudice film Gentleman’s Agreement after tak-
two of Ms. Hynee’s English classes in your study; ing a lengthy pretest dealing with anti-Semitism
both classes are average and apparently equiva- and participants who view the movie without a
lent on all relevant variables. Suppose, however, pretest. Individuals not pretested could conceivably
that for some reason Ms. Hynee had to miss one enjoy the movie as a good love story, unaware that
of her classes but not the other (maybe she had it deals with a social issue. Individuals who had
to have a root canal) and Ms. Alma Mater took taken the pretest, in contrast, may be much more
over Ms. Hynee’s class. As luck would have it, likely to see a connection between the pretest and
Ms. Mater proceeded to cover much of the mate- the message of the film. If pretesting affects par-
rial now included in your posttest (i.e., a problem ticipants’ responses on the dependent measure,
with history). Unbeknownst to you, your experi- the research results are generalizable only to other
mental group would have a definite advantage, pretested groups; the results are not even generaliz-
and this advantage, not the independent variable, able to the population from which the sample was
may cause posttest differences in the dependent selected.
variable. A researcher must select a design that
controls for potential problems such as this or For some studies the potential interactive effect
make every effort to determine if they are operat- of a pretest is a more serious consideration than
ing in the study. others. For example, taking a pretest on algebraic
algorithms would probably have very little impact
Threats to External Validity on a group’s responsiveness to a new method of
teaching algebra, but studies involving self-report
Several major threats to external validity can limit
generalization of experimental results to other 3 “The External Validity of Experiments,” by G. H. Bracht and
G. V. Glass, 1968, American Educational Research Journal, 5,
pp. 437–474.
258 CHAPTER 10 • EXPERIMENTAL RESEARCH
TABLE 10.2 • Threats to external validity
Threat Description
Pretest–treatment
interaction The pretest sensitizes participants to aspects of the treatment and thus influences
Selection–treatment posttest scores.
interaction
Multiple-treatment The nonrandom or volunteer selection of participants limits the generalizability
interference of the study.
Specificity of variables
When participants receive more than one treatment, the effect of prior treatment can
Treatment diffusion affect or interact with later treatment, limiting generalizability.
Experimenter effects Poorly operationalized variables make it difficult to identify the setting and procedures
to which the variables can be generalized.
Reactive arrangements
Treatment groups communicate and adopt pieces of each other’s treatment, altering the
initial status of the treatment’s comparison.
Conscious or unconscious actions of the researchers affect participants’ performance
and responses.
The fact of being in a study affects participants so that they act in ways different from
their normal behavior. The Hawthorne and John Henry effects are reactive responses
to being in a study.
measures, such as attitude scales and interest in- behavior, behavior modification and corporal pun-
ventories, are especially susceptible to this threat. ishment (admittedly an extreme example we’re
The pretest–treatment interaction is also minimal in using to make a point!). For 2 months, behavior
studies involving very young children, who would modification techniques were systematically applied
probably not see or remember a connection be- to the participants, and at the end of this period you
tween the pretest and the subsequent treatment. found behavior to be significantly better than before
Similarly, for studies conducted over a period of the study began. For the next 2 months, the same
months or longer, the effects of the pretest would participants were physically punished (with hand
probably have worn off or be greatly diminished by slappings, spankings, and the like) whenever they
the time a posttest is given. misbehaved, and at the end of the 2 months behav-
ior was equally as good as after the 2 months of be-
When a study is threatened by pretest–treatment havior modification. Could you then conclude that
interaction, researchers should select a design that behavior modification and corporal punishment are
either controls for the threat or allows the research- equally effective methods of behavior control? Cer-
ers to determine the magnitude of the effect. For ex- tainly not. In fact, the goal of behavior modification
ample, the researcher can (if it’s feasible) make use is to produce self-maintaining behavior—that is,
of unobtrusive measures—ways to collect data behavior that continues after direct intervention
that do not intrude on or require interaction with is stopped. The good behavior exhibited by the
research participants—such as reviewing school participants at the end of the corporal punishment
records, transcripts, and other written sources. period could well be due to the effectiveness of pre-
vious exposure to behavior modification; this good
Multiple-Treatment Interference behavior could exist in spite of, rather than because
of, exposure to corporal punishment. If it is not pos-
Sometimes the same research participants receive sible to select a design in which each group receives
more than one treatment in succession. Multiple- only one treatment, the researcher should try to
treatment interference occurs when carryover ef- minimize potential multiple-treatment interference
fects from an earlier treatment make it difficult to by allowing sufficient time to elapse between treat-
assess the effectiveness of a later treatment. For ments and by investigating distinctly different types
example, suppose you were interested in compar- of independent variables.
ing two different approaches to improving classroom
CHAPTER 10 • EXPERIMENTAL RESEARCH 259
Multiple-treatment interference may also oc- the population of schools to which the researcher
cur when participants who have already partici- would like to generalize the results. Administrators
pated in a study are selected for inclusion in and instructional personnel in the tenth school may
another, apparently unrelated study. If the acces- have higher morale, less fear of being inspected, or
sible population for a study is one whose members more zeal for improvement than personnel in the
are likely to have participated in other studies other nine schools. In the research report, research-
(e.g., psychology majors), then information on pre- ers should describe any problems they encountered
vious participation should be collected and evalu- in acquiring participants, including the number of
ated before subjects are selected for the current times they were turned down, so that the reader
study. If any members of the accessible population can judge the seriousness of a possible selection–
are eliminated from consideration because of pre- treatment interaction.
vious research activities, a note should be made in
the research report. Specificity of Variables
Selection–Treatment Interaction Like selection–treatment interaction, specificity of
variables is a threat to generalizability of research
Selection–treatment interaction, another threat to results regardless of the particular experimental de-
population validity, occurs when study findings sign. Any given study has specificity of variables;
apply only to the (nonrepresentative) groups in- that is, the study is conducted with a specific kind
volved and are not representative of the treatment of participant, using specific measuring instruments,
effect in the extended population. This interac- at a specific time, and under a specific set of circum-
tion occurs when study participants at one level stances. We have discussed the need to describe
of a variable react differently to a treatment than research procedures in sufficient detail to permit an-
other potential participants in the population, at other researcher to replicate the study. Such detailed
another level, would have reacted. For example, a descriptions also permit interested readers to assess
researcher may conduct a study on the effectiveness how applicable findings are to their situations. When
of microcomputer-assisted instruction on the math studies that supposedly manipulated the same inde-
achievement of junior high students. Classes avail- pendent variable get quite different results, it is often
able to the researcher (i.e., the accessible popula- difficult to determine the reasons for the differences
tion) may represent an overall ability level at the because researchers have not provided clear, opera-
lower end of the ability spectrum for all junior high tional descriptions of their independent variables.
students (i.e., the target population). If so, positive When operational descriptions are available, they
effect shown by the participants in the sample may often reveal that two independent variables with
be valid only for lower ability students, rather than the same name were defined quite differently in the
for the target population of all junior high students. separate studies. Because such terms as discovery
Similarly, if microcomputer-assisted instruction ap- method, whole language, and computer-based in-
pears ineffective for this sample, it may still be ef- struction mean different things to different people,
fective for the target population. it is impossible to know what a researcher means by
these terms unless they are clearly defined. General-
Selection–treatment interaction, like the prob- izability of results is also tied to the clear definition
lem of differential selection of participants asso- of the dependent variable, although in most cases
ciated with internal validity, mainly occurs when the dependent variable is clearly operationalized
participants are not randomly selected for treat- as performance on a specific measure. When a re-
ments, but this threat can occur in designs involving searcher has a choice of measures to select from, he
randomization as well, and the way a given popula- or she should address the comparability of these in-
tion becomes available to a researcher may threaten struments and the potential limits on generalizability
generalizability, no matter how internally valid an arising from their use.
experiment may be. For example, suppose that, in
seeking a sample, a researcher is turned down by Generalizability of results may also be affected
nine school systems before finally being accepted by short- or long-term events that occur while the
by a tenth. The accepting system is very likely to be study is taking place. This threat is referred to as
different from the other nine systems and also from the interaction of history and treatment effects and
260 CHAPTER 10 • EXPERIMENTAL RESEARCH
describes the situation in which events extraneous the students began talking to their teachers about
to the study alter the research results. Short-term, the different spelling classes. Ms. Vader asked
emotion-packed events, such as the firing of a Mr. Darth if she could try the videos in her class, and
superintendent, the release of district test scores, her students liked them so well that she incorpo-
or the impeachment of a president may affect the rated them into her spelling program. The diffusion
behavior of participants. Usually, however, the re- of Mr. Darth’s treatment into Ms. Vader’s treatment
searcher is aware of such happenings and can as- produced two overlapping treatments that did not
sess their possible impact on results, and accounts represent the initial intended treatments. To reduce
of such events should be included in the research treatment diffusion, a researcher may ask teachers
report. The impact of long-term events, such as who are implementing different treatments not to
wars and economic depressions, however, is more communicate with each other about the treatments
subtle and tougher to evaluate. until the study is completed or may carry out the
study in more than one location, thus allowing only
Another threat to external validity is the inter- one treatment per school.
action of time of measurement and treatment effect.
This threat results from the fact that posttesting Experimenter Effects
may yield different results depending on when it is
done. A posttest administered immediately after the Researchers themselves also present potential threats
treatment may provide evidence for an effect that to the external validity of their own studies. A re-
does not show up on a posttest given some time searcher’s influences on participants or on study pro-
after treatment. Conversely, a treatment may have a cedures are known as experimenter effects. Passive
long-term but not a short-term effect. The only way experimenter effects occur as a result of character-
to assess the generalizability of findings over time is istics or personality traits of the experimenter, such
to measure the dependent variable at various times as gender, age, race, anxiety level, and hostility level.
following treatment. These influences are collectively called experimenter
personal-attributes effects. Active experimenter ef-
To summarize, to deal with the threats associ- fects occur when the researcher’s expectations of
ated with specificity, the researcher must opera- the study results affect his or her behavior and con-
tionally define variables in a way that has meaning tribute to producing certain research outcomes. This
outside the experimental setting and must be care- effect is referred to as the experimenter bias effect.
ful in stating conclusions and generalizations. An experimenter may unintentionally affect study
results, typically in the desired direction, simply by
Treatment Diffusion looking, feeling, or acting a certain way.
Treatment diffusion occurs when different treat- One form of experimenter bias occurs when the
ment groups communicate with and learn from researcher affects participants’ behavior or is inac-
each other. When participants in one treatment curate in evaluating behavior because of previous
group know about the treatment received by a knowledge of the participants. For example, suppose
different group, they often borrow aspects from a researcher hypothesizes that a new reading ap-
that treatment; when such borrowing occurs, the proach will improve reading skills. If the researcher
study no longer has two distinctly different treat- knows that Suzy Shiningstar is in the experimental
ments but rather has two overlapping ones. The group and that Suzy is a good student, she may give
integrity of each treatment is diffused. Often, the Suzy’s reading skills a higher rating than they actu-
more desirable treatment—the experimental treat- ally warrant. This example illustrates another way a
ment or the treatment with additional resources— researcher’s expectations may contribute to produc-
is diffused into the less desirable treatment. For ing those outcomes: Knowing or even believing that
example, suppose Mr. Darth’s and Ms. Vader’s participants are in the experimental or the control
classes were trying two different treatments to im- group may cause the researcher unintentionally to
prove spelling. Mr. Darth’s class received videos, evaluate their performances in a way consistent with
new and colorful spelling texts, and prizes for im- the expectations for that group.
proved spelling. In Ms. Vader’s class, the students
were asked to list words on the board, copy them It is difficult to identify experimenter bias in a
into notebooks, use each word in a sentence, and study, which is all the more reason for researchers
study at home. After the first week of treatments, to be aware of its consequences on the external
CHAPTER 10 • EXPERIMENTAL RESEARCH 261
validity of a study. The researcher should strive to Through tremendous effort he managed to win the
avoid communicating emotions and expectations ensuing contest, dropping dead at the finish line. In
to participants in the study. Additionally, experi- the John Henry effect, research participants who are
menter bias effects can be reduced by blind scor- told that they will form the control group for a new,
ing, in which the researcher doesn’t know whose experimental method, start to act like John Henry.
performance is being evaluated. They decide to challenge the new method by put-
ting extra effort into their work, essentially saying
Reactive Arrangements (to themselves), “We’ll show them that our old ways
are as effective as their newfangled ways!” By doing
Reactive arrangements, also called participant ef- this, however, the control group performs atypi-
fects, are threats to validity that are associated with cally; their performance provides a rival explanation
the way in which a study is conducted and the for the study results. When the John Henry effect
feelings and attitudes of the participants involved. occurs, the treatment under investigation does not
As discussed previously, to maintain a high degree appear to be very effective because posttest perfor-
of control and obtain internal validity, a researcher mance of the experimental group is not much (if at
may create an experimental environment that is all) better than that of the control group.
highly artificial and not easily generalizable to non-
experimental settings; this is a reactive arrangement. As an antidote to the Hawthorne and John
Henry effects, educational researchers often attempt
Another type of reactive arrangement results to achieve a placebo effect. The term comes from
from participants’ knowledge that they are involved medical researchers who discovered that any appar-
in an experiment or their feeling that they are in ent medication, even sugar and water, could make
some way receiving special attention. The effect subjects feel better; any beneficial effect caused by a
that such knowledge or feelings can have on the person’s expectations about a treatment rather than
participants was demonstrated at the Hawthorne the treatment itself became known as the placebo
Plant of the Western Electric Company in Chicago effect. To counteract this effect, a placebo approach
some years ago. As part of a study to investigate the was developed in which half the subjects in an
relation between various working conditions and experiment receive the true medication and half re-
productivity, researchers investigated the effect of ceive a placebo (e.g., sugar and water). The use of a
light intensity and worker output. The researchers placebo is, of course, not known by the participants;
increased light intensity and production went up. both groups think they are taking real medicine.
They increased it some more and production went The application of the placebo effect in educational
up some more. The brighter the place became, the research is that all groups in an experiment should
more production rose. As a check, the researchers appear to be treated the same. Suppose, for ex-
decreased the light intensity, and guess what, pro- ample, you have four groups of ninth graders, two
duction went up! The darker it got, the more work- experimental and two control, and the treatment is a
ers produced. The researchers soon realized that it film designed to promote a positive attitude toward
was the attention given the workers, not the illumi- a vocational career. If the experimental participants
nation, that was affecting production. To this day, are to be excused from several classes to view the
the term Hawthorne effect is used to describe any film, then the control participants should also be
situation in which participants’ behavior is affected excused and shown another film whose content is
not by the treatment per se but by their awareness unrelated to the purpose of the study (e.g., Drugs
of participating in a study. and You: Just Say No!). As an added control, all par-
ticipants may be told that there are two movies and
A related reactive effect, known as compensa- that eventually everyone will see both movies. In
tory rivalry or the John Henry effect, occurs when other words, it should appear as if all the students
members of a control group feel threatened or chal- are doing the same thing.
lenged by being in competition with an experimen-
tal group and they perform way beyond what would Another reactive arrangement, or partici-
normally be expected. Folk hero John Henry, you pant effect, is the novelty effect, which refers
may recall, was a “steel drivin’ man” who worked to the increased interest, motivation, or engage-
for a railroad. When he heard that a steam drill was ment participants develop simply because they
going to replace him and his fellow steel drivers, are doing something different. In other words, a
he challenged and set out to beat the machine.
262 CHAPTER 10 • EXPERIMENTAL RESEARCH
treatment may be effective because it is different, is that if subjects are assigned at random (by
not because it is better. To counteract the novelty chance) to groups, there is no reason to believe
effect, a researcher should conduct a study over a that the groups will be greatly different in any sys-
period of time long enough to allow the treatment tematic way. In other words, they should be about
novelty to wear off, especially if the treatment in- the same on participant variables such as ability,
volves activities very different from the subjects’ gender, or prior experience, and on environmental
usual routine. variables as well. If the groups are the same at the
start of the study and if the independent variable
Obviously there are many internal and external makes no difference, the groups should perform
threats to the validity of an experimental (or causal– essentially the same on the dependent variable.
comparative) study. You should be aware of likely On the other hand, if the groups are the same at
threats and strive to nullify them. One main way to the start of the study but perform differently after
overcome threats to validity is to choose a research treatment, the difference can be attributed to the
design that controls for such threats. We examine independent variable.
some of these designs in the following sections.
As noted previously, the use of randomly
GROUP EXPERIMENTAL formed treatment groups is a unique character-
DESIGNS istic of experimental research; this control factor
is not possible with causal–comparative research.
The validity of an experiment is a direct function Thus, randomization is used whenever possible—
of the degree to which extraneous variables are participants are randomly selected from a popula-
controlled. If such variables are not controlled, it is tion and randomly assigned to treatment groups. If
difficult to interpret the results of a study and the subjects cannot be randomly selected, those avail-
groups to which results can be generalized. The able should at least be randomly assigned. If par-
term confounded is sometimes used to describe a ticipants cannot be randomly assigned to groups,
situation in which the effects of the independent then at least treatment conditions should be ran-
variable are so intertwined with those of extraneous domly assigned to the existing groups. Addition-
variables that it becomes difficult to determine the ally, the larger the groups, the more confidence the
unique effects of each. Experimental design strives researcher can have in the effectiveness of random-
to reduce this problem by controlling extraneous ization. Randomly assigning 6 participants to two
variables. Good designs control many sources that treatments is much less likely to equalize extrane-
affect validity; poor designs control few. ous variables than randomly assigning 50 partici-
pants to two treatments.
As discussed in previous chapters, two types
of extraneous variables in need of control are par- To ensure random selection and assignment,
ticipant variables and environmental variables. Par- researchers use tools such as a table of random
ticipant variables include both organismic variables numbers and other randomization methods that
and intervening variables. Organismic variables are rely on chance. For example, a researcher could
characteristics of the participants that cannot be flip a coin or use odd and even numbers on a die
altered but can be controlled for; the sex of a to assign participants to two treatments; heads or
participant is an example. Intervening variables in- an even number would signal assignment to Treat-
trude between the independent and the dependent ment 1, and tails or an odd number would signal
variable and cannot be directly observed but can be assignment to Treatment 2.
controlled for; anxiety and boredom are examples.
If groups cannot be randomly formed, a num-
Control of Extraneous Variables ber of other techniques can be used to try to
equate groups. Certain environmental variables,
Randomization is the best way to control for many for example, can be controlled by holding them
extraneous variables simultaneously; this procedure constant for all groups. Recall the example of the
is effective in creating equivalent, representative student tutor versus parent tutor study. In that ex-
groups that are essentially the same on all relevant ample, help time was an important variable that
variables. The underlying rationale for randomization had to be held constant, that is, made the same for
both groups for them to be fairly compared. Other
environmental variables that may need to be held
CHAPTER 10 • EXPERIMENTAL RESEARCH 263
constant include learning materials, prior exposure, is randomly assigned to one group and the other
meeting place and time (e.g., students may be more member to the other group. The next two highest
alert in the morning than in the afternoon), and ranked participants (i.e., third and fourth ranked)
years of teacher experience. are the second pair, and so on. The major advantage
of this approach is that no participants are lost. The
In addition, participant variables should be major disadvantage is that it is a lot less precise
held constant, if possible. Techniques to equate than pair-wise matching.
groups based on participant characteristics include
matching, comparing homogeneous groups or sub- Comparing Homogeneous
groups, participants serving as their own controls, Groups or Subgroups
and analysis of covariance.
Another previously discussed way to control an
Matching extraneous variable is to compare groups that are
homogeneous with respect to that variable. For
Matching is a technique for equating groups on one example, if IQ were an identified extraneous vari-
or more variables, usually ones highly related to able, the researcher may select only participants
performance on the dependent variable. The most with IQs between 85 and 115 (i.e., average IQ).
commonly used approach to matching involves The researcher would then randomly assign half
random assignment of pairs, one participant to the selected participants to the experimental group
each group. In other words, the researcher attempts and half to the control group. This procedure also
to find pairs of participants similar on the variable lowers the number of participants in the popula-
or variables to be controlled. If the researcher is tion and additionally restricts the generalizability
matching on gender, obviously the matched pairs of the findings to participants with IQs between
must be of the same gender. If the researcher is 85 and 115. As noted in the discussion of causal–
matching on variables such as pretest, GRE, or abil- comparative research, a similar, more satisfactory
ity scores, the pairing can be based on similarity of approach is to form different subgroups represent-
scores. Note, however, that unless the number of ing all levels of the control variable. For example,
participants is very large, it is unreasonable to try the available participants may be divided into sub-
to make exact matches or matches based on more groups with high (i.e., 116 and above), average
than one or two variables. (i.e., 85 to 115), and low (i.e., 84 and below) IQ.
Half the participants from each subgroup could
Once a matched pair is identified, one mem- then be randomly assigned to the experimental
ber of the pair is randomly assigned to one group and half to the control group. This proce-
treatment group and the other member to the dure should sound familiar; it describes stratified
other treatment group. A participant who does sampling. If the researcher is interested not just in
not have a suitable match is excluded from the controlling the variable but also in seeing if the in-
study. The resulting matched groups are identical dependent variable affects the dependent variable
or very similar with respect to the variable being differently at different levels of IQ, the best ap-
controlled. proach is to build the control variable right into the
design. Thus, the research design would have six
A major problem with such matching is that cells: two treatments by three IQ levels. Diagram
invariably some participants will not have a match the design for yourself, and label each cell with its
and must be eliminated from the study. One way treatment and IQ level.
to combat loss of participants is to match less strin-
gently. For example, the researcher may decide that Participants as Their Own Controls
if two ability test scores are within 20 points, they
constitute an acceptable match. This approach may When participants serve as their own controls, the
increase the number of subjects, but it can defeat design of the study involves a single group of par-
the purpose of matching if the criteria for a match ticipants who are exposed to multiple treatments,
are too broad. one at a time. This strategy helps to control for par-
ticipant differences because the same participants
A related matching procedure is to rank all get both treatments. In situations in which the
the participants from highest to lowest, based on
their scores on the variable to be matched. The
two highest ranking participants, regardless of raw
score, are the first pair. One member of the first pair
264 CHAPTER 10 • EXPERIMENTAL RESEARCH
effect of the dependent variable disappears quickly whether the groups will be pretested, and how
after treatment, or in which a single participant is data will be analyzed. Particular combinations of
the focus of the research, participants can serve as such factors produce different designs that are ap-
their own controls. propriate for testing different types of hypotheses.
In selecting a design, first determine which designs
This approach is not always feasible; you can- are appropriate for your study and for testing your
not teach the same algebraic concepts to the same hypothesis, then determine which of these are also
group twice using two different methods of instruc- feasible given the constraints under which you may
tion (well, you could, but it would not make much be operating. If, for example, you must use existing
sense). Furthermore, a problem with this approach groups, a number of designs will be automatically
is a carryover effect from one treatment to the next. eliminated. From the designs that are appropriate
To use a previous example, it would be very diffi- and feasible, select the one that will control the
cult to evaluate the effectiveness of corporal punish- most threats to internal and external validity and
ment for improving behavior if the group receiving will yield the data you need to test your hypothesis
corporal punishment were the same group that had or hypotheses. Designs vary widely in the degree to
previously been exposed to behavior modification. which they control various threats to internal and
If only one group is available, a better approach, if external validity, although no design can control for
feasible, is to divide the group randomly into two certain threats, such as experimenter bias.
smaller groups, each of which receives both treat-
ments but in a different order. The researcher could There are two major classes of experimental
at least get some idea of the effectiveness of corpo- designs: single-variable designs and factorial de-
ral punishment because one group would receive it signs. A single-variable design is any design that
before behavior modification. involves one manipulated independent variable; a
factorial design is any design that involves two or
Analysis of Covariance more independent variables, at least one of which
is manipulated. Factorial designs can demonstrate
The analysis of covariance is a statistical method for relations that a single-variable design cannot. For
equating randomly formed groups on one or more example, a variable found not to be effective in a
variables. Analysis of covariance adjusts scores on single-variable study may interact significantly with
a dependent variable for initial differences on some another variable.
other variable, such as pretest scores, IQ, read-
ing readiness, or musical aptitude. The covariate Single-Variable Designs
should be related to performance on the dependent
variable. Single-variable designs are classified as pre-
experimental, true experimental, or quasi-
Analysis of covariance is most appropriate experimental, depending on the degree of control
when randomization is used; the results are weak- they provide for threats to internal and external
ened when a study deals with intact groups, un- validity. Pre-experimental designs do not do a very
controlled variables, and nonrandom assignment good job of controlling threats to validity and should
to treatments. Nevertheless, in spite of randomiza- be avoided. In fact, the results of a study based on
tion, the groups may still differ significantly prior a pre-experimental design are so questionable they
to treatment. Analysis of covariance can be used in are not useful for most purposes except, perhaps,
such cases to adjust posttest scores for initial pre- to provide a preliminary investigation of a prob-
test differences. However, the relation between the lem. True experimental designs provide a very high
independent and covariate variables must be linear degree of control and are always to be preferred.
(i.e., represented by a straight line). Quasi-experimental designs do not control as well
as true experimental designs but do a much better
Types of Group Designs job than the pre-experimental designs. The less use-
ful designs are discussed here only so that you will
The experimental design to a great extent dictates know what not to do and so that you will recognize
the specific procedures of a study. Selection of a their use in published research reports and be ap-
given design influences factors such as whether a propriately critical of their findings.
control group will be included, whether participants
will be randomly selected and assigned to groups,
CHAPTER 10 • EXPERIMENTAL RESEARCH 265
Pre-Experimental Designs irrelevant in this design (see Figure 10.1). The
threats that are relevant, such as history, matura-
Here is a research riddle for you: Can you do tion, and mortality, are not controlled. Even if the
an experiment with only one group? The answer research participants score high on the posttest,
is . . . yes, but not a really good one. As Figure 10.1 you cannot attribute their performance to the treat-
illustrates, none of the pre-experimental designs ment because you do not know what they knew
does a very good job of controlling extraneous before you administered the treatment. If you have
variables that jeopardize validity. a choice between using this design and not doing a
study, don’t do the study. Do a different study with
The One-Shot Case Study. The one-shot case a better controlled design.
study involves a single group that is exposed to a
treatment (X) and then posttested (O). No threats The One-Group Pretest–Posttest Design. The one-
to validity are controlled in this design except those group pretest–posttest design involves a single
that are automatically controlled because they are
FIGURE 10.1 • Sources of invalidity for pre-experimental designs External
Sources of Invalidity
Internal
Designs
History
Maturation
Testing
Instrumentation
Regression
Selection
Mortality
Selection
Interactions
Pretest-X
Interaction
Multiple-X
Interference
One-Shot Case Study – – (+) (+) (+) (+) – (+) (+) (+)
XO – – – – – (+) + (+) – (+)
(+) (+)
One-Group Pretest– + – (+) (+) (+) – – –
Posttest Design
OXO
Static-Group
Comparison
X1 O
X2 O
Each line of Xs and O s represents a group.
Note: Symbols: X or X1 = unusual treatment; X2 = control treatment; O = test, pretest, or posttest;
+ = factor controlled for; (+) factor controlled for because not relevant; and – = factor not
controlled for.
Figures 10.1 and 10.2 basically follow the format used by Campbell and Stanley and are
presented with a similar note of caution: The figures are intended to be supplements to, not
substitutes for, textual discussions. You should not totally accept or reject designs because of their
pluses and minuses; you should be aware that the design most appropriate for a given study is
determined not only by the controls provided by the various designs but also by the nature of the
study and the setting in which it is to be conducted.
Although the symbols used in these figures, and their placement, vary somewhat from
Campbell and Stanley's format, the intent, interpretations, and textual discussions of the two
presentations are in agreement (personal communication with Donald T. Campbell, April 22, 1975).
266 CHAPTER 10 • EXPERIMENTAL RESEARCH
group that is pretested (O), exposed to a treatment are typically more anxious at the beginning of a
(X), and then tested again (O). The success of the course because they do not know exactly what they
treatment is determined by comparing pretest and are in for (i.e., fear of the unknown). After a couple
posttest scores. This design controls some threats to of weeks in the course, students may find that it is
validity not controlled by the one-shot case study, not as bad as they imagined, or if it turns out to be
but a number of additional factors relevant to this as bad or worse, they will drop it (i.e., mortality). In
design are not controlled. For example, history and addition, the professor doesn’t know whether the
maturation are not controlled. If participants do sig- students read the booklet!
nificantly better on the posttest than on the pretest,
the improvement may or may not be due to the The only situations for which the one-group
treatment. Something else may have happened pretest–posttest design is appropriate is when the
to the participants that affected their performance, behavior to be measured is not likely to change all
and the longer the study takes, the more likely it is that by itself. Certain prejudices, for example, are not
this “something” will threaten validity. Testing and likely to change unless a concerted effort is made.
instrumentation also are not controlled; the partici-
pants may learn something on the pretest that helps The Static-Group Comparison. The static-group
them on the posttest, or unreliability of the measures comparison involves at least two nonrandomly
may be responsible for the apparent improvement. formed groups, one that receives a new or unusual
Statistical regression is also not controlled. Even if treatment (i.e., the experimental treatment) and
subjects are not selected on the basis of extreme another that receives a traditional treatment (i.e.,
scores (i.e., high or low), a group may do very the control treatment). Both groups are posttested.
poorly on the pretest just by poor luck. For example, The purpose of the control group is to indicate
participants may guess badly on a multiple-choice what the performance of the experimental group
pretest and improve on a posttest simply because, would have been if it had not received the experi-
this time, their guessing produces a score that is mental treatment. This purpose is fulfilled only to
more in line with an expected score. Finally, the the degree that the control group is equivalent to
external validity threat of pretest–treatment interac- the experimental group.
tion is not controlled in this design. Participants may
react differently to the treatment than they would In static-group comparisons, although the
have if they had not been pretested. terms experimental and control are commonly used
to describe the groups, it is probably more appro-
To illustrate the problems associated with this priate to call them both comparison groups be-
design, consider a hypothetical study. Suppose a cause each serves as the comparison for the other.
professor teaches a statistics course and is con- Each group receives some form of the independent
cerned that the high anxiety level of students inter- variable (i.e., the treatment). For example, if the
feres with their learning. The professor prepares a independent variable is type of drill and practice,
100-page booklet in which she explains the course, the experimental group (X1) may receive computer-
tries to convince students that they will have no assisted drill and practice, and the control group
problems, and promises all the help they need to may receive worksheet drill and practice. Occasion-
successfully complete the course, even if they have ally, but not often, the experimental group may
a poor math background. The professor wants to receive something while the control group receives
see if the booklet helps to reduce anxiety. At the be- nothing. For example, a group of teachers may re-
ginning of the term, she administers an anxiety test ceive some type of in-service education while the
and then gives each student a copy of the booklet comparison group of teachers receives nothing.
with instructions to read it as soon as possible. Two In this case, X is in-service training, and X is no
weeks later she administers the anxiety scale again,
and the students’ scores indicate much less anxiety 12
than at the beginning of the term. The professor is
satisfied and prides herself on the effectiveness of in-service training.
the booklet for reducing anxiety. However, a num- The static-group comparison design can be ex-
ber of alternative factors or threats may explain the
students’ decreased anxiety. For example, students panded to deal with any number of groups. For
three groups, the design takes the following form:
X1 O
X2 O
XO
3
CHAPTER 10 • EXPERIMENTAL RESEARCH 267
Each group serves as a control or comparison assignment (R) must be involved. Additionally, all
the true designs have a control group (X ). Finally,
group for the other two. For example, if the in-
2
dependent variable were number of minutes of
although the posttest-only control group design
review at the end of math lessons, then X may looks like the static-group comparison design, ran-
1 dom assignment in the former makes it very differ-
ent in terms of control.
represent 6 minutes of review, X2 may represent
The Pretest–Posttest Control Group Design. The
3 minutes of review, and X3 may represent no min- pretest–posttest control group design requires
at least two groups, each of which is formed by
utes of review. Thus X would help us to assess the random assignment. Both groups are administered
3 a pretest, each group receives a different treat-
ment, and both groups are posttested at the end
impact of X , and X would help us to assess the of the study. Posttest scores are compared to de-
22 termine the effectiveness of the treatment. The
pretest–posttest control group design may also
impact of X . be expanded to include any number of treatment
1 groups. For three groups, for example, this design
takes the following form:
Again, the degree to which the groups are equiv-
R O X1 O
alent is the degree to which their comparison is ROX O
reasonable. In this design, because participants are 2
not randomly assigned to groups and no pretest data ROX O
3
are collected, it is difficult to determine the extent
The combination of random assignment and
to which the groups are equivalent. That is, posttest the presence of a pretest and a control group
serve to control for all threats to internal validity.
differences may be due to initial group differences Random assignment controls for regression and
selection factors; the pretest controls for mortality;
in maturation, selection, and selection interactions, randomization and the control group control for
maturation; and the control group controls for his-
rather than the treatment effects. Mortality is also tory, testing, and instrumentation. Testing is con-
trolled because if pretesting leads to higher posttest
a problem; if you lose participants from the study, scores, the advantage should be equal for both the
experimental and control groups. The only weak-
you have no information about what you have lost ness in this design is a possible interaction between
the pretest and the treatment, which may make
because you have no pretest data. On the positive the results generalizable only to other pretested
groups. The seriousness of this potential weakness
side, the presence of a comparison group controls depends on the nature of the pretest, the nature of
the treatment, and the length of the study. When
for history because events occurring outside the ex- this design is used, the researcher should assess
and report the probability of a pretest–treatment
perimental setting should equally affect both groups. interaction. For example, a researcher may indicate
that possible pretest interaction was likely to be
In spite of its limitations, the static-group com- minimized by the nonreactive nature of the pretest
(e.g., chemical equations) and by the length of the
parison design is occasionally employed in a pre- study (e.g., 9 months).
liminary or exploratory study. For example, one The data from this and other experimental de-
signs can be analyzed to test the research hypoth-
semester, early in the term, a teacher wondered esis regarding the effectiveness of the treatments in
if the kind of test items given to educational re-
search students affects their retention of course
concepts. For the rest of the term, students in one
section of the course were given multiple-choice
tests, and students in another section were given
short-answer tests. At the end of the term, group
performances were compared. The students receiv-
ing short-answer test items had higher total scores
than students receiving the multiple-choice items.
On the basis of this exploratory study, a formal
investigation of this issue was undertaken, with
randomly formed groups.
True Experimental Designs
True experimental designs control for nearly all
threats to internal and external validity. As Figure 10.2
indicates, all true experimental designs have one char-
acteristic in common that the other designs do not
have: random assignment of participants to treat-
ment groups. Ideally, participants should be ran-
domly selected and randomly assigned; however,
to qualify as a true experimental design, random
FIGURE 10.2 • Sources of invalidity for true experimental designs
and quasi-experimental designs
Sources of Invalidity External
Internal
Designs History
Maturation
Testing
Instrumentation
Regression
Selection
Mortality
Selection
Interactions
Pretest-X
Interaction
Multiple-X
Interference
TRUE EXPERIMENTAL DESIGNS
1. Pretest–Posttest + +++++ + + – (+)
Control Group Design + + (+) (+) (+) + – + (+) (+)
+ +++++ + + + (+)
R O X1 O
R O X2 O – – (+)
(+) – (+)
2. Posttest-Only
Control Group Design
R X1 O
R X2 O
3. Solomon Four-Group
Design
R O X1 O
R O X2 O
R X1 O
R X2 O
QUASI-EXPERIMENTAL DESIGNS
4. Nonequivalent Control
Group Design ++++–++
O X1 O
O X2 O
5. Time-Series Design
O O O O X O O O O – + + – + (+) +
6. Counterbalanced +++++++ – ––
Design
X1O X2O X3O
X3O X1O X2O
X2O X3O X1O
Note: Symbols: X or X1 = unusual treatment; X2 = control treatment; O = test, pretest, or posttest; R =
random assignment of subjects to groups; + = factor controlled for; (+) = factor controlled for because
not relevant; and – = factor not controlled for. This figure is intended to be a supplement to, not
substitute for, textual discussions. See note that accompanies Figure 10.1.
268
CHAPTER 10 • EXPERIMENTAL RESEARCH 269
several different ways. The best way is to compare the group design, the posttest-only control group de-
posttest scores of the two treatment groups. The sign can be expanded to include more than two
pretest is used to see if the groups are essentially groups.
the same on the dependent variable at the start of
the study. If they are, posttest scores can be directly The combination of random assignment and the
compared using a statistic called the t test. If the presence of a control group serves to control for all
groups are not essentially the same on the pretest threats to internal validity except mortality, which
(i.e., random assignment does not guarantee equal- is not controlled because of the absence of pretest
ity), posttest scores can be analyzed using analysis data on participants. However, mortality may or
of covariance, which adjusts posttest scores for may not be a problem, depending on the duration
initial differences on any variable, including pretest of the study. If it isn’t a problem, the researcher may
scores. This approach is superior to using gain or report that although mortality is a potential threat
difference scores (i.e., posttest minus pretest) to to validity with this design, it did not prove to be a
determine the treatment effects. threat because the group sizes remained constant or
nearly constant throughout the study. If the prob-
A variation of the pretest–posttest control group ability of differential mortality is low, the posttest-
design involves random assignment of members of only design can be very effective. However, if the
matched pairs to the treatment groups. There is groups may be different with respect to pretreat-
really no advantage to this technique, however, ment knowledge related to the dependent variable,
because any variable that can be controlled through the pretest–posttest control group design should be
matching can be better controlled using other pro- used. Which design is best depends on the study.
cedures such as analysis of covariance. If the study is short, and if it can be assumed that
neither group has any knowledge related to the
Another variation of this design involves one or dependent variable, then the posttest-only design
more additional posttests. For example: may be the best choice. If the study is to be lengthy
(i.e., good chance of mortality), or if the two groups
ROX OO potentially differ on initial knowledge related to the
1 dependent variable, then the pretest–posttest con-
trol group design may be the best.
R O X O O
2 A variation of the posttest-only control group
design involves random assignment of matched
This variation has the advantage of providing in- pairs to the treatment groups, one member to each
formation about the effect of the independent vari- group, to control for one or more extraneous vari-
able both immediately following treatment and at ables. However, there is really no advantage to this
a later date. Recall that the interaction of time of technique, because any variable that can be con-
measurement and treatment effects is a threat to trolled by matching can better be controlled using
external validity because posttesting may yield dif- other procedures.
ferent results depending on when it is done—a
treatment effect (or lack of one) that is based on What if you face the following dilemma:
the administration of a posttest immediately fol- The study is going to last 2 months; information
lowing the treatment may not be found if a delayed about initial knowledge is essential; the pretest
posttest is given after treatment. Although adding is an attitude test, and the treatment is designed
multiple posttests does not completely solve this to change attitudes. This is a classic case where
problem, it greatly minimizes it by providing infor- pretest–treatment interaction is probable. One solu-
mation about group performance subsequent to the tion is to select the lesser of the two evils by taking
initial posttest. our chances that mortality will not be a threat. An-
other solution, if enough participants are available,
The Posttest-Only Control Group Design. The is to use the Solomon four-group design, which we
posttest-only control group design is the same discuss next.
as the pretest–posttest control group design except
there is no pretest—participants are randomly as- The Solomon Four-Group Design. As Figure 10.2
signed to at least two groups, exposed to the differ- shows, the Solomon four-group design is a com-
ent treatments, and posttested. Posttest scores are bination of the pretest–posttest control group
then compared to determine the effectiveness of
the treatment. As with the pretest–posttest control
270 CHAPTER 10 • EXPERIMENTAL RESEARCH
design and the posttest-only control group design. used), then the pretest–posttest control group design
The Solomon four-group design involves random may be best. Thus, which design is the best depends
assignment of participants to one of four groups. on the nature of the study and the conditions under
Two groups are pretested and two are not; one of which it is to be conducted.
the pretested groups and one of the groups not
pretested receive the experimental treatment; and Quasi-Experimental Designs
all four groups are posttested. The combination of
the pretest–posttest control group design and the Sometimes it is just not possible to assign individual
posttest-only control group design in this way re- participants to groups randomly. For example, to
sults in a design that controls for pretest-treatment receive permission to include schoolchildren in
interaction and for mortality. a study, a researcher often has to agree to keep
existing classrooms intact. In other words, entire
In this example, the design has two independent classrooms, not individual students, are assigned
variables, each with two levels: group assignment to treatments. When random assignment is not pos-
(i.e., treatment or control) and pretest status (i.e., yes sible, a researcher may choose from a number of
or no). The correct way to analyze the data result- quasi-experimental designs that provide adequate
ing from this application of the Solomon four-group controls. As you review the following discussion
design is to use a 2 ϫ 2 factorial analysis of variance. of three quasi-experimental designs, keep in mind
The 2 ϫ 2 factorial analysis tells the researcher sev- that designs such as these are to be used only when
eral things. First, if the participants who received it is not feasible to use a true experimental design.
the treatment (regardless of whether they took the
pretest) perform differently than the participants The Nonequivalent Control Group Design. This
who did not receive treatment (i.e., were in a control design is very much like the pretest–posttest con-
group), the researcher can conclude the treatment trol group design discussed previously. In non-
has an effect. Second, if the participants who took equivalent control group design, two (or more)
the pretest (regardless of whether they were in the treatment groups are pretested, administered a
treatment or the control groups) perform differently treatment, and posttested. The difference is that
than the participants who did not take the pretest, it involves random assignment of intact groups to
the researcher can conclude that simply taking the treatments, not random assignment of individuals.
pretest affects the dependent variable. Finally, if For example, suppose a school volunteered six in-
the participants who took the pretest AND received tact classrooms for a study. Three of six classrooms
the treatment perform differently on the posttest may be randomly assigned to the experimental
than those in the experimental group but did not group (X1) and the remaining three assigned to
receive the treatment, a pretest–treatment interaction the control group (X2). The inability to assign in-
is likely present. If the two experimental groups per- dividuals to treatments randomly (as opposed to
form equally well on the posttest (i.e., no pretest– assigning whole classes) adds validity threats such
treatment interaction) but better than the two control as regression and interactions between selection,
groups, the researcher can more confidently con- maturation, history, and testing.
clude that the treatment had an effect that can be
generalized to the population. To reduce some of the threats and strengthen
the study, the researcher should make every effort
A common misconception is that because the Sol- to include groups that are as equivalent as possible.
omon four-group design controls for so many threats Comparing an advanced algebra class to a remedial
to validity, it is always the best design to choose. It algebra class, for example, would not be com-
isn’t; this design introduces other challenges that paring equivalent groups. If differences between
must be considered. For example, it requires twice the groups on any major extraneous variable are
as many participants as most other true experimental identified, analysis of covariance can be used to
designs, and participants are often hard to find. If equate the groups statistically. An advantage of the
mortality is not likely to be a problem and pretest nonequivalent control group design is that because
data are not needed, then the posttest-only design established classes or groups are selected, possible
may be the best choice. If pretest–treatment interac- effects from reactive arrangements are minimized.
tion is unlikely and testing is a normal part of the sub- Groups may not even be aware that they are in-
jects’ environment (such as when classroom tests are volved in a study.
CHAPTER 10 • EXPERIMENTAL RESEARCH 271
The Time-Series Design. This design is an elabo- indicate a treatment effect, with Pattern C more
ration of the one-group pretest–posttest design. In permanent than in Pattern B. Pattern D does not
the time-series design, one group is repeatedly indicate a treatment effect even though student
pretested until pretest scores are stable. The group scores are higher on O than O4; the pattern is too
is then exposed to a treatment and, after treatment 5
erratic to make a decision about treatment effect.
implementation, repeatedly posttested. If a group Scores appear to be fluctuating up and down, so
performs essentially the same on a number of the O to O fluctuation cannot be attributed to the
pretests and then significantly improves following 45
treatment. These four patterns illustrate that com-
a treatment, the researcher can be more confident paring O and O is not sufficient; in all four cases,
about the effectiveness of the treatment than if 45
just one pretest and one posttest were adminis-
O indicates a higher score than O , but only in two
54
of the patterns does it appear that the difference is
tered. For example, if the statistics professor we due to a treatment effect.
discussed earlier measured anxiety several times A variation of the time-series design is the mul-
before giving the students her booklet, she could tiple time-series design, which involves the addition
see if anxiety was declining even before receiving of a control group to the basic design, as shown:
her booklet. OOOOX OOOO
History is a problem with the time-series design 1
because some event or activity may occur between OOOOX OOOO
2
the last pretest and the first posttest. Instrumen- This variation eliminates history and instrumenta-
tation may also be a problem, but only if the re- tion as validity threats and thus represents a design
searcher changes measuring instruments during the with no likely threats to internal validity. The mul-
study. Pretest–treatment interaction is certainly a tiple time-series design can be used most effectively
possibility; if one pretest can interact with a treat- in situations where testing is a naturally occur-
ment, more than one pretest can only make mat- ring event, such as in research involving school
ters worse. If instrumentation or pretest–treatment classrooms.
interaction threatens validity, however, you will Counterbalanced Designs. In a counterbalanced
probably be aware of the problem because scores
will change prior to treatment. design, all groups receive all treatments but in a
Determining the effectiveness of the treatment different order, and groups are posttested after each
involves analysis of the pattern of the test scores, treatment. Although the example of a counterbal-
although statistical analyses appropriate for a anced design in Figure 10.2 includes three groups
time-series design are quite advanced. Figure 10.3 and three treatments, any number of groups (more
illustrates some of the
possible patterns that may FIGURE 10.3 • Possible patterns for the results of a study based on a
be found. The vertical line time-series design
between O and O indi- X
45
cates the point at which
the treatment was intro-
duced. Pattern A suggests
no treatment effect; per-
formance was increasing A
before the treatment was Scores on the B
introduced and continued Dependent
to increase at the same rate Variable C
following introduction D
of the treatment. In fact,
Pattern A represents the O1 O2 O3 O4 O5 O6 O7
reverse situation to that
encountered by our sta- Pretest X Posttest
tistics professor with her Scores Treatment Scores
booklet. Patterns B and C
272 CHAPTER 10 • EXPERIMENTAL RESEARCH
than one) may be studied. The only restriction is situations in education where this condition can
that the number of groups be equal to the number be met. You cannot, for example, teach the same
of treatments. The order in which the groups re- geometric concepts to the same group using several
ceive the treatments is randomly determined. This different methods of instruction.
design is usually employed with intact groups when
administration of a pretest is not possible, although Factorial Designs
participants may be pretested. The pre-experimental
static group comparison also can be used in such Factorial designs are elaborations of single-variable
situations, but the counterbalanced design controls experimental designs to permit investigation of
for several additional threats to validity. two or more variables, at least one of which is
manipulated by the researcher. After a researcher
Figure 10.2 shows the sequence for three treat- has studied an independent variable using a single-
ment groups and three treatments. The first horizon- variable design, it is often useful to study that
tal line indicates that Group A receives Treatment 1 variable in combination with one or more other vari-
and is posttested, then receives Treatment 2 and is ables because some variables work differently
posttested, and finally receives Treatment 3 and is when paired with different levels of another vari-
posttested. The second line indicates that Group B able. For example, one method of math instruction
receives Treatment 3, then Treatment 1, and then may be more effective for high-aptitude students,
Treatment 2, and is posttested after each treatment. whereas a different method may be more effective
The third line indicates that Group C receives Treat- for low-aptitude students. The purpose of a facto-
ment 2, then Treatment 3, then Treatment 1, and is rial design is to determine whether the effects of
posttested after each treatment. To put it another an independent variable are generalizable across
way, the first column indicates that at Time 1, all levels or whether the effects are specific to par-
while Group A is receiving Treatment 1, Group B ticular levels.
is receiving Treatment 3 and Group C is receiving
Treatment 2. All three groups are posttested, and The term factorial refers to a design that has
the treatments are shifted as shown in the second more than one independent variable (or grouping
column—at Time 2, while Group A is receiving variable), also known as a factor. In the preceding
Treatment 2, Group B is receiving Treatment 1 example, method of instruction is one factor and
and Group C is receiving Treatment 3. The groups student aptitude is another. Method of instruction
are then posttested again, and the treatments are has two levels—there are two types of instruction;
again shifted so that at Time 3, Group A receives student aptitude also has two levels, high aptitude
Treatment 3, Group B receives Treatment 2, and and low aptitude. Thus, a 2 ϫ 2 (two by two) fac-
Group C receives Treatment 1. All groups are post- torial design has two factors, and each factor has
tested again. To determine the effectiveness of the two levels. This four-celled design is the simplest
treatments, the average performance of the groups possible factorial design. As another example, a
on each treatment can be calculated and compared. 2 ϫ 3 factorial design has two factors; one factor
In other words, the posttest scores for all the groups has two levels, and the other factor has three levels
for the first treatment can be compared to the post- (e.g., high, average, and low aptitude). A study with
test scores of all the groups for the second treat- three factors—homework (required homework,
ment, and so forth, depending on the number of voluntary homework, no homework), ability (high,
groups and treatments. Sophisticated analysis pro- average, low), and gender (male, female)—is a
cedures that are beyond the scope of this text can be 3 ϫ 3 ϫ 2 factorial design. Note that multiplying the
applied to determine both the effects of treatments factors yields the total number of cells (i.e., groups)
and the effects of the order of treatments. in the factorial design. For example, a 2 ϫ 2 design
will have four cells, and a 3 ϫ 3 ϫ 2 design will
A unique weakness of the counterbalanced have 18 cells.
design is potential multiple-treatment interference
that results when the same group receives more Figure 10.4 illustrates the simplest 2 ϫ 2 facto-
than one treatment. Thus, a counterbalanced de- rial design. One factor, type of instruction, has two
sign should be used only when the treatments are levels: personalized and traditional. The other factor,
such that exposure to one will not affect the effec- IQ, also has two levels: high and low. Each group
tiveness of another. Unfortunately, there are few represents a combination of a level of one factor
CHAPTER 10 • EXPERIMENTAL RESEARCH 273
FIGURE 10.4 • An example of the basic randomly assigned to either Group 3 or Group 4.
2 ϫ 2 factorial design This approach should be familiar; it involves strati-
fied sampling. In fact, this study does not necessarily
Type of Instruction require four classes; it could include only two classes,
the personalized class and the traditional class, and
Personalized Traditional each class could be subdivided to obtain similar num-
bers of high- and low-IQ students.
High Group 1 Group 2
IQ Group 3 Group 4 In a 2 ϫ 2 design, both variables may be ma-
nipulated, or one may be a manipulated variable and
Low the other a nonmanipulated variable. The nonma-
nipulated variable is often referred to as a control
and a level of the other factor. Thus, Group 1 is variable. Control variables are usually physical or
composed of high-IQ students receiving personalized mental characteristics of the subjects (e.g., gender,
instruction (PI), Group 2 is composed of high-IQ stu- years of experience, or aptitude); in the example
dents receiving traditional instruction (TI), Group 3 shown here, IQ is a nonmanipulated control variable.
is composed of low-IQ students receiving PI, and When describing and symbolizing factorial designs,
Group 4 is composed of low-IQ students receiving the manipulated variable is traditionally placed first.
TI. To implement this design, high-IQ students would Thus, a study with two factors, type of instruction
be randomly assigned to either Group 1 or Group 2, (three types, manipulated) and gender (male, fe-
and a similar number of low-IQ students would be male), would be symbolized as 3 ϫ 2, not 2 ϫ 3.
Figure 10.5 represents two possible outcomes
for an experiment involving a 2 ϫ 2 factorial
FIGURE 10.5 • Illustration of interaction and no interaction in a 2 ϫ 2 factorial
experiment
High Method 60 NO INTERACTION A
IQ AB 40 B
80 40 100 High
Low 80
60 20 60 A
70 30 40 B
20
High Method 70 0 High
IQ AB 30 Low
80 60
Low INTERACTION
20 40
50 50 100
80
60
40 B
20 A
0
Low
274 CHAPTER 10 • EXPERIMENTAL RESEARCH
design. The number in each box, or cell, repre- had simply compared two groups of subjects, one
sents the average posttest score for that group. group receiving Method A and one group receiving
Thus, in both examples, the high-IQ students un- Method B, without separating high- and low-IQ stu-
der Method A had an average posttest score of 80. dents in the factorial design the researcher would
The row and column numbers outside the boxes likely have concluded that Method A and Method B
represent average scores across boxes, or cells. were equally effective because the overall average
In the top example, the average score for high-IQ score for both Methods A and B was 50. The facto-
students was 60 (i.e., the average of the scores rial design allowed the researcher to see the inter-
for all high-IQ subjects regardless of treatment; action between the variables—the methods were
80 ϩ 40 ϭ 120/2 ϭ 60), and the average score differentially effective depending on the IQ level
for low-IQ students was 40. The average score for of the participants. The crossed lines in the bottom
students under Method A was 70 (i.e., the average graph in Figure 10.5 illustrate the interaction.
of the scores of all the subjects under Method A
regardless of IQ level; 80 ϩ 60 ϭ 140/2 ϭ 70), and Many factorial designs are possible, depending
for students under Method B, 30. The cell averages on the nature and the number of independent vari-
suggest that Method A was better than Method B ables. Theoretically, a researcher could simultane-
for high-IQ students (i.e., 80 vs. 40), and Method A ously investigate 10 factors in a 2 ϫ 2 ϫ 2 ϫ 2 ϫ 2 ϫ
was also better for low-IQ students (i.e., 60 vs. 20). 2 ϫ 2 ϫ 2 ϫ 2 ϫ 2 design. In reality, however, more
Thus, Method A was better, regardless of IQ level; than 3 factors are rarely used because each addi-
there was no interaction between method and IQ. tional factor increases the number of participants
The high-IQ students in each method outperformed needed to complete the study. A 2 ϫ 2 design with
the low-IQ students in each method, and the sub- 20 participants per cell (a relatively small number)
jects in Method A outperformed the subjects in requires at least 80 participants (2 ϫ 2 ϫ 20 ϭ 80).
Method B at each IQ level. The parallel lines in It is easy to see that as the number of cells in-
the top graph in Figure 10.5 illustrate the lack of creases, things quickly get out of hand. Reduc-
interaction. ing the number per cell doesn’t help because as
sample size decreases, so does representativeness.
The bottom example of Figure 10.5 shows an Moreover, interactions involving many factors are
interaction. For high-IQ students, Method A was difficult if not impossible to interpret. For example,
better (i.e., 80 vs. 60); for low-IQ students, Method B how would you interpret a five-way interaction
was better (i.e., 20 vs. 40). Even though high-IQ between teaching method, IQ, gender, aptitude,
students did better than low-IQ students regard- and anxiety?
less of method, how well they did depended on
which method they were in. Neither method was An example of experimental research appears
generally better; rather, one method was better at the end of this chapter. Identify the experimental
for students with high IQs, and one was better for design that was used. Also, don’t be concerned if
students with low IQs. Note that if the researcher you don’t understand the statistics; focus on the
problem, the procedures, and the conclusions.
CHAPTER 10 • EXPERIMENTAL RESEARCH 275
SUMMARY
EXPERIMENTAL RESEARCH: DEFINITION 9. After the groups have been exposed to
AND PURPOSE the treatment for some period, the researcher
measures the dependent variable and tests
1. In an experimental study, the researcher for a significant difference in performance.
manipulates at least one independent
variable, controls other relevant variables, and Manipulation and Control
observes the effect on one or more dependent
variables. 10. Direct manipulation by the researcher
of at least one independent variable is
2. The independent variable, also called the the characteristic that differentiates
experimental variable, cause, or treatment, experimental research from other types
is that process or activity believed to of research.
make a difference in performance. The
dependent variable, also called the criterion 11. Control refers to efforts to remove the
variable, effect, or posttest, is the outcome influence of any variable, other than the
of the study, the measure of the change or independent variable, that may affect
difference resulting from manipulation of the performance on the dependent variable.
independent variable.
12. Two different kinds of variables need to be
3. When conducted well, experimental studies controlled: participant variables, on which
produce the soundest evidence concerning participants in the different groups may differ,
hypothesized cause–effect relations. and environmental variables, variables in the
setting that may cause unwanted differences
The Experimental Process between groups.
4. The steps in an experimental study include THREATS TO EXPERIMENTAL VALIDITY
selecting and defining a problem, selecting
participants and measuring instruments, 13. Any uncontrolled extraneous variables
preparing a research plan, executing that affect performance on the dependent
procedures, analyzing the data, and variable are threats to the validity of an
formulating conclusions. experiment. An experiment is valid if results
obtained are due only to the manipulated
5. An experimental study is guided by at least independent variable and if they are
one hypothesis that states an expected causal generalizable to situations outside the
relation between two variables. experimental setting.
6. In an experimental study, the researcher 14. Internal validity is the degree to which
forms or selects the groups, decides how to observed differences on the dependent
allocate treatments to each group, controls variable are a direct result of manipulation
extraneous variables, and observes or of the independent variable, not some other
measures the effect on the groups at the variable. External validity is the degree
end of the study. to which study results are generalizable
to groups and environments outside the
7. The experimental group typically receives a experimental setting.
new treatment, and the control group either
receives a different treatment or is treated as 15. The researcher must strive for a balance
usual. between control and realism, but if a choice
is involved, the researcher should err on the
8. The two groups that are to receive different side of control.
treatments are equated on all other variables
that influence performance on the dependent
variable.
276 CHAPTER 10 • EXPERIMENTAL RESEARCH
Threats to Internal Validity 26. Multiple-treatment interference occurs when
the same subjects receive more than one
16. History refers to any event occurring during treatment in succession and when the effects
a study that is not part of the experimental from an earlier treatment influence a later
treatment but may affect performance on the treatment.
dependent variable.
27. Selection–treatment interaction occurs when
17. Maturation refers to physical, intellectual, and findings apply only to the (nonrepresentative)
emotional changes that naturally occur within groups involved and are not representative
individuals over a period of time and affect of the treatment effect in the extended
participants’ performance on a measure of the population.
dependent variable.
28. Specificity is a threat to generalizability when
18. Testing refers to the possibility that the treatment variables are not clearly pre-
participants show improved performance on rationalized, making it unclear to whom the
a posttest because they took a pretest. variables generalize.
19. Instrumentation refers to unreliability, or 29. Generalizability of results may be affected
lack of consistency, in measuring instruments by short-term or long-term events that occur
that may result in invalid assessment of while the study is taking place. This potential
performance. threat is referred to as interaction of history
and treatment effects.
20. Statistical regression refers to the tendency of
participants who score highest on a pretest to 30. Interaction of time of measurement and
score lower on a posttest and the tendency of treatment effects result from the fact that
those who score lowest on a pretest to score posttesting may yield different results
higher on a posttest. depending on when it is done.
21. Differential selection is the selection of 31. Treatment diffusion occurs when different
subjects who have differences at the start of a treatment groups communicate with and learn
study that may influence posttest differences. from each other.
It usually occurs when already-formed groups
are used. 32. A researcher’s influences on participants or on
study procedures are known as experimenter
22. Mortality, or attrition, refers to a reduction effects; these effects can be passive or active.
in the number of research participants as
individuals drop out of a study. Mortality 33. Reactive arrangements are threats to external
can affect validity because it may alter the validity that are associated with participants
characteristics of the treatment groups. performing atypically because they are aware
of being in a study. The Hawthorne, John
23. Selection may interact with factors related to Henry, and novelty effects are examples of
maturation, history, and testing. If already- reactive arrangements.
formed groups are included in a study,
one group may profit more (or less) from a 34. The placebo effect is sort of the antidote for
treatment or have an initial advantage (or the Hawthorne and John Henry effects. Its
disadvantage) because of maturation, history, application in educational research is that all
or testing factors. groups in an experiment should appear to be
treated the same.
Threats to External Validity
GROUP EXPERIMENTAL DESIGNS
24. Threats affecting to whom research results can
be generalized make up threats to population 35. The validity of an experiment is a direct
validity. function of the degree to which extraneous
variables are controlled.
25. Pretest–treatment interaction occurs when
subjects respond or react differently to a 36. Participant variables include organismic
treatment because they have been pretested. variables and intervening variables.
The pretest may provide information that Organismic variables are characteristics of the
influences the posttest results. subject or organism that cannot be altered but
CHAPTER 10 • EXPERIMENTAL RESEARCH 277
can be controlled for. Intervening variables 45. Single-variable designs involve one
intrude between the independent variable and independent variable (which is manipulated)
the dependent variable and cannot be directly and are classified as pre-experimental,
observed but can be controlled for. true experimental, or quasi-experimental,
depending on the control they provide for
Control of Extraneous Variables sources of internal and external invalidity.
37. Randomization is the best single way to Pre-Experimental Designs
control for extraneous variables and should be
used whenever possible; participants should 46. The one-shot case study involves one group
be randomly selected from a population and that is exposed to a treatment (X) and then
randomly assigned to groups, and treatments posttested (O). No relevant threat to validity is
should be randomly assigned to groups. controlled.
38. Certain environmental variables can be 47. The one-group pretest–posttest design involves
controlled by holding them constant for all one group that is pretested (O), exposed to a
groups. treatment (X), and tested again (O).
39. Matching commonly involves finding pairs of 48. The static-group comparison involves at least
similar participants and randomly assigning two groups; one receives a new or unusual
each member of a pair to a different group. treatment, and both groups are posttested.
Subjects who do not have a match must be Because participants are not randomly
eliminated from the study. assigned to groups and there are no pretest
data, it is difficult to determine whether the
40. Another way of controlling an extraneous treatment groups are equivalent.
variable is to compare groups that are
homogeneous with respect to that variable. True Experimental Designs
A similar but more satisfactory approach is to
form subgroups representing all levels of the 49. True experimental designs control for
control variable. nearly all threats to internal and external
validity. True experimental designs have
41. If the researcher is interested not just in one characteristic in common that no other
controlling the variable but also in seeing design has: random assignment of participants
if the independent variable affects the to groups. Ideally, participants should be
dependent variable differently at different randomly selected and randomly assigned
levels of the control variable, the best to treatments.
approach is to build the control variable right
into the design. 50. The pretest–posttest control group design
involves at least two groups, both of which
42. Participants can serve as their own controls are formed by random assignment. Both
if the same group is exposed to the different groups are administered a pretest, one
treatments, one treatment at a time. group receives a new or unusual treatment,
and both groups are posttested. A variation
43. The analysis of covariance is a statistical method of this design seeks to control extraneous
for equating randomly formed groups on one or variables more closely by randomly assigning
more variables. It adjusts scores on a dependent members of matched pairs to the treatment
variable for initial differences on some other groups.
variable related to the dependent variable.
51. The posttest-only control group design is the
Types of Group Designs same as the pretest–posttest control group
design except there is no pretest. Participants
44. Selection of a given design dictates such are randomly assigned to at least two groups,
factors as whether participants will be exposed to the independent variable, and
randomly selected and assigned to groups, posttested to determine the effectiveness of
whether the groups will be pretested,
and how data will be analyzed.
278 CHAPTER 10 • EXPERIMENTAL RESEARCH
the treatment. A variation of this design is and then repeatedly posttested. If a group
random assignment of matched pairs. scores essentially the same on a number
52. The Solomon four-group design involves of pretests and then significantly improves
random assignment of subjects to one of four following a treatment, the researcher has
groups. Two of the groups are pretested, and more confidence in the effectiveness of the
two are not; one of the pretested groups and treatment than if just one pretest and one
one of the unpretested groups receive the posttest were administered.
experimental treatment. All four groups are 57. The multiple time-series design is a variation
posttested. This design controls all threats to that involves adding a control group to the
internal validity. basic design. This variation eliminates all
53. The best way to analyze data resulting from threats to internal validity.
the Solomon four-group design is to use 58. In a counterbalanced design, all groups
a 2 ϫ 2 factorial analysis of variance. This receive all treatments but in a different order,
procedure indicates whether there is an the number of groups equals the number
interaction between the treatment and the of treatments, and groups are posttested
pretest. after each treatment. This design is usually
employed when intact groups are included
Quasi-Experimental Designs and when administration of a pretest is not
possible.
54. When it is not possible to assign subjects to
groups randomly, quasi-experimental designs Factorial Designs
are available to the researcher. They provide
adequate control of threats to validity. 59. Factorial designs involve two or more
independent variables, at least one of which
55. The nonequivalent control group design is is manipulated by the researcher. The 2 ϫ 2 is
like the pretest–posttest control group design the simplest factorial design. Factorial designs
except that the nonequivalent control group rarely include more than three factors.
design does not involve random assignment.
If differences between the groups on any 60. A factorial design is used to test whether
major extraneous variable are identified, the effects of an independent variable are
analysis of covariance can be used to generalizable across all levels or whether the
statistically equate the groups. effects are specific to particular levels (i.e.,
there is an interaction between the
56. In the time-series design, one group is variables).
repeatedly pretested, exposed to a treatment,
Go to the topic “Experimental Research” in the MyEducationLab (www.myeducationlab.com) for your course,
where you can:
◆ Find learning outcomes.
◆ Complete Assignments and Activities that can help you more deeply understand the chapter content.
◆ Apply and practice your understanding of the core skills identified in the chapter with the Building
Research Skills exercises.
◆ Check your comprehension of the content covered in the chapter by going to the Study Plan. Here you
will be able to take a pretest, receive feedback on your answers, and then access Review, Practice, and
Enrichment activities to enhance your understanding. You can then complete a final posttest.
Effects of Mathematical Word Problem–Solving Instruction on Middle School
Students with Learning Problems
YAN PING XIN ASHA K. JITENDRA ANDRIA DEATLINE-BUCHMAN
Purdue University Lehigh University Easton Area School District
ABSTRACT This study investigated the differential effects reasoning and problem solving (Cawley, Parmar, Yan, &
of two problem–solving instructional approaches—schema- Miller, 1998). Students with learning disabilities often
based instruction (SBI) and general strategy instruction manifest serious deficits in mathematics, especially prob-
(GSI)—on the mathematical word problem–solving per- lem solving (Carnine, Jones, & Dixon, 1994; Cawley &
formance of 22 middle school students who had learning Miller, 1989; Cawley, Parmar, Foley, Salmon, & Roy,
disabilities or were at risk for mathematics failure. Results 2001; Parmar, Cawley, & Frazita, 1996). Specifically, these
indicated that the SBI group significantly outperformed students perform at significantly lower levels than stu-
the GSI group on immediate and delayed posttests as well dents without disabilities on all problem types, especially
as the transfer test. Implications of the study are discussed problems that involve indirect language, extraneous in-
within the context of the new IDEA amendment and access formation, and multisteps (Briars & Larkin, 1984; Cawley
to the general education curriculum. et al., 2001; Englert, Culatta, & Horn, 1987; Lewis & Mayer,
1987; Parmar et al., 1996). While problems in reading and
Mathematics is integral to all areas of daily life; it affects basic computation skills may account for these students’
successful functioning on the job, in school, at home, poor performance, difficulties in problem representation
and in the community. The importance of mathemat- and failure to identify relevant information and opera-
ics literacy and problem solving is emphasized in the tion may exacerbate their poor performance (Hutchinson,
Goals 2000: Educate America Act of 1994 and National 1993; Judd & Bilsky, 1989; Parmar, 1992).
Council of Teachers of Mathematics’ Principles and Stan-
dards for School Mathematics (NCTM, 2000; Goldman, In addition, ineffective instructional strategies may ex-
Hasselbring, & the Cognition and Technology Group at plain the poor problem-solving performance of students
Vanderbilt, 1997). Increasing evidence suggests that high with learning disabilities. One commonly used instruc-
levels of mathematical and technical skills are needed for tional approach is the “key word” strategy, in which
most jobs in the 21st century. Therefore, it is important students are taught key words that cue them as to what
to ensure that all students, not just those planning to operation to use in solving problems. For example, stu-
pursue higher education, have sufficient skills to meet dents learn that altogether indicates the use of the addition
the challenges of the 21st century (National Education operation, whereas left indicates subtraction. Similarly, the
Goals Panel, 1997). In addition, one of the provisions of word times calls for multiplication, and among indicates
the 1997 amendments to the Individuals with Disabilities the need to divide. However, Parmar et al. (1996) argued
Education Act (IDEA) is that students with disabilities that “the outcome of such training is that the student
have meaningful access to the general education curricu- reacts to the cue word at a surface level of analysis and
lum. In fact, these students are held accountable to the fails to perform a deep-structure analysis of the interre-
same high academic standards required of all students lationships among the word and the context in which it
(No Child Left Behind Act, 2002). is embedded” (p. 427). That is, the focus is on whether
to add, subtract, multiply, or divide rather than whether
As part of the mathematics reform and standards-based the problem makes sense. Another commonly employed
reform movements, the NCTM (2000) developed the Prin- problem-solving strategy is the four-step (read, plan, solve,
ciples and Standards for School Mathematics. The focus and check) general heuristic procedure. Unfortunately,
of the NCTM standards is on “conceptual understanding this procedure may not facilitate problem solution for
rather than procedural knowledge or rule-driven compu- students with learning disabilities, especially when the
tation” (Maccini & Gagnon, 2002, p. 326). This emphasis domain-specific conceptual and procedural knowledge
has significant implications for classroom practice be- is not adequately elaborated upon (Hutchinson, 1993;
cause special education typically has focused on arith- Montague, Applegate, & Marquard, 1993).
metic computation rather than higher-order skills such as
For students with learning disabilities, explicit teaching
Address: Yan Ping Xin, Purdue University, Beering Hall of Liberal for conceptual understanding is critical to establish the
Arts and Education, Dept. of Educational Studies, 100 North Uni- necessary knowledge base for problem solution. Recent
versity St., West Lafayette, IN 47907-2098; email: [email protected] reviews provide empirical support for problem-solving
instruction, such as a schema-based strategy instruc-
tion, that emphasizes conceptual understanding of the
problem structure, or schemata (Xin & Jitendra, 1999).
279
Successful problem solvers typically create a complete the effects of the two word problem–solving instructional
mental representation of the problem schema, which, procedures—schema-based instruction (SBI) and general
in turn, facilitates the encoding and retrieval of informa- strategy instruction (GSI)—on the word problem–solving
tion needed to solve problems (Didierjean & Cauzinille- performance of middle school students with learning
Marmeche, 1998; Fuson & Willis, 1989; Marshall, 1995; problems.
Mayer, 1982). Problem schema acquisition allows the
learner to use the representation to solve a range of dif- Participants
ferent (i.e., containing varying surface features) but struc-
turally similar problems (Sweller, Chandler, Tierney, & Participants were 22 students with learning problems, in-
Cooper, 1990). cluding 18 who were school-identified as having a learn-
ing disability, 1 with severe emotional disorders, and 3
Schema-based strategy instruction is known to bene- who were at risk for mathematics failure, attending a
fit both special education students (e.g., Jitendra & Hoff, middle school in the northeastern United States. Specifi-
1996; Jitendra, Hoff, & Beck, 1999) and students at risk for cally, participant selection was based on (a) teacher iden-
math failure (e.g., Jitendra et al., 1998; Jitendra, DiPipi, & tification of students who were experiencing substantial
Grasso, 2001) in solving arithmetic word problems. problems in mathematics world problem solving and
However, previous research on the effects of schema- (b) a score of 70% or lower on the word problem–solving
based strategy instruction is limited, for the most part, to criterion pretest involving multiplication and division
algebra problems (Hutchinson, 1993) and addition and word problems. To determine sample size, a power anal-
subtraction (e.g., change, combine, additive compare) ysis using an alpha level of .05 and an effect size based
arithmetic problems. Although the effects of semantic on existing schema-based instruction research studies
representation training in facilitating problem solving (e.g., Jitendra et al., 1998) was conducted, which indi-
have been demonstrated with college students with and cated that a minimum of 10 participants in each group is
without disabilities, the studies are limited to a sample sufficient to obtain a power of .90 for a 2 ϫ 4 repeated-
of comparison problems only (Lewis, 1989; Zawaiza & measures analyses of variance (Friendly, 2000). Table 1
Gerber, 1993). Furthermore, neither the study by Lewis presents demographic information with respect to par-
nor the study by Zawaiza and Gerber emphasized key ticipants’ gender, grade, age, ethnicity, special education
components (compared, referent, and scalar function) classification, IQ level, and standardized achievement
pertinent to the compare problem schemata. In addi- scores in math and reading. It is important to note that IQ
tion, the rules for figuring out the operation (e.g., if the and achievement data from school records were available
unknown quantity is to the right of the given quantity on for only nine students.
the number line, then addition or multiplication should
be applied) cannot be directly applied to solve multipli- Procedure
cation or division compare problems when the relational
statement involves a fraction or when the unknown is Instructors were two doctoral students in special educa-
the scalar function (i.e., the multiple or partial relation tion and two experienced special education teachers. The
between two comparison quantities). two doctoral students taught the first cohort of 8 students
(4 in each treatment group), and the two special educa-
A more recent exploratory study by Jitendra, DiPipi, tion teachers taught the second cohort of 14 students (7 in
and Perron-Jones (2002) employed a single-subject de- each treatment group). Students in both cohorts were ran-
sign to reach four students with learning disabilities to domly assigned to the two treatment groups. To control
solve word problems involving multiplication and divi- for teacher effects, each pair of instructors (i.e., the two
sion using the schema-based strategy. However, one of doctoral students or the two special education teachers)
the limitations of the study is that “the single-subject were randomly assigned to the two conditions, and they
design employed in this investigation does not help switched treatment groups midway through the interven-
clarify whether the study findings are attributable to tion. The first author developed the teaching scripts for
specific schema-based nature of the instruction” (p. 37) both conditions and piloted them prior to employing
or to the generally carefully designed one-on-one inten- them in the study. Instructors received two 1-hour train-
sive instruction on two problem types. The purpose of ing sessions to familiarize them with lesson formats, the
the present investigation was to evaluate and compare suggested teacher wording, and lesson materials when
the effectiveness of two problem-solving instructional implementing the two instructional approaches.
approaches, schema-based and general strategy instruc-
tion, in teaching multiplication and division word prob- Students in both conditions received their assigned
lems to middle school students with learning disabilities strategy instruction three to four times a week, each
or at risk for mathematics failure. session lasting approximately an hour. The SBI group
received 12 sessions of instruction, with 4 sessions each
Method on solving multiplicative compare and proportion prob-
Design lems and 4 sessions on solving mixed word problems
that included both types. Students in the GSI group also
A pretest–posttest comparison group design with random received 12 sessions of instruction, but they solved both
assignment of subjects to groups was used to examine types of problems in each session. Unlike the SBI group,
280
Table 1 SBI group GSI group Both Conditions
Demographic Information
Variable Across both SBI and GSI conditions, the teacher first
modeled the assigned strategy with multiple examples.
Gender Explicit instruction was followed by teacher-guided prac-
tice and independent student work. Corrective feedback
Male 56 and additional modeling were provided as needed during
Female 65 practice sessions. It should be noted that students in both
groups were allowed to use calculators during instruc-
Grade tion and testing conditions, because computation skills
were not the focus of this study. Table 2 summarizes the
6 64 problem-solving strategy steps across two conditions.
7 26 Overall, both groups were taught to follow the four-
step general problem-solving procedure of reading to
8 31 understand, representing the problem, and planning,
solving, and checking. However, the fundamental differ-
Mean age in months (SD) 153.8 (8.6) 156.7 (8.7) ences between the two conditions involved the second
and third steps, with regard to how to plan and solve the
Ethnicity problem. Specifically, the SBI group was taught to iden-
tify the problem structure and use a schema diagram to
Caucasian 43 represent and solve the problem, whereas the GSI group
Hispanic 57 learned to draw semiconcrete pictures to represent infor-
African American 21 mation in the problem and facilitate problem solving. A
detailed description of the two instructional conditions,
Classification with an emphasis on how to “plan” and “solve” the prob-
lem is presented in the next section.
LD 10 8
SEN 01 Schema-Based Instruction Condition
NL 12
Instruction for the SBI group occurred in two phases:
IQa problem schemata instruction and problem solution in-
struction. During problem schemata instruction, students
Verbal 95 93 learned to identify the problem type or structure and
M 8.5 5.7 represent the problem using a schematic diagram. In this
SD phase, story situations with no unknown information
were presented. The purpose of presenting story situ-
Performance ations was to provide students with a complete repre-
sentation of the problem structure of a specific problem
M 92 92 type. In contrast, the problem solution instruction phase
SD 2.5 2.1 used story problems with unknown information. Below
is a general description of instruction employed to teach
Full Scale the two problem types investigated in this study.
M 92 92 Multiplicative Compare Problems. When teaching
SD 2.9 3.1 the multiplicative compare problem schema, instruction
emphasized several salient features. That is, students
Achievementb learned that a multiplicative compare problem always
includes (a) a referent set, including its identity and its
Math 84 88 corresponding quantity; (b) a compared set, including its
M 10.4 3.9 identity and corresponding quantity; and (c) a statement
SD that relates the compared set to the referent set. In short,
the multiplicative compare problem describes one object
Reading as the referent and expresses the other as a part or mul-
tiple of it. Students first learned to identify the problem
M 90 93 type using story situations such as the following: “Vito
SD 2.0 2.4 earned $12 from shoveling snow over the weekend. He
earned 1/3 as much as his friend Guy did. Guy earned
Note: SBI ϭ schema-based instruction; GSI ϭ general strategy $36 from shoveling snow.” This story situation, because
instruction; LD ϭ learning disabled; SED ϭ seriously emotionally the amount Vito earned (compared set) was compared
disturbed; NL ϭ not labeled. to what Guy earned (referent set), was deemed to be
a IQ scores were obtained from the Wechsler Intelligence Scales for 281
Children-Revised (Wechsler, 1974).
b Achievement scores in math and reading were obtained from the
Metropolitan Achievement Test (Balow, Farr, & Hogan, 1992), with
the exception of scores for one student that were obtained from
the Stanford Achievement Test, 9th ed. (1996). IQ and achievement
scores were available for only 9 of the 22 students.
students in the GSI group did not receive instruction
in recognizing the two different word problem types.
Students in the two groups solved the same number and
type of problems.