The words you are searching are inside this book. To get more targeted content, please make full-text search by clicking here.

Home Explore Adaptive thinking: rationality in the real world

View in Fullscreen

Gerd Gigerenzer

Like this book? You can publish your book online for free in a few minutes!

Download PDF

Related Publications

Discover the best professional documents and content resources in AnyFlip Document Base.

Published by autoservicingnc1, 2020-10-26 06:52:41

Adaptive thinking: rationality in the real world

Pages:

Gerd Gigerenzer

THE MODULARITY OF SOCIAL INTELLIGENCE 233

proposed as a normative or descriptive model of how animals infer the pres-
ence of predators and prey (Stephens & Krebs,-1986), as well as how humans
reason, categorize, and memorize (Anderson, 1990). The problem with this
view is that in any sufficiently rich environment, Bayesian computations be-
come mathematically so complex that one needs to assume that minds are
"Laplacean demons" that have unlimited computational power, time, and
knowledge. To find a way out of this problem, researchers often make unreal-
istic assumptions about the structure of natural environments, namely as-
sumptions that reduce the Bayesian computations. Anderson (1990), for in-
stance, found himself forced to make the assumption that environmental
features would be generally independent, in order to save the fiction of the
Bayesian homunculus in the mind (Gigerenzer, 1991c). Despite their psycho-
logical implausibility, Laplacean demons are bustling in contemporary theories
of the mind: as models of choice, categorization, estimation, and inference,
among others. The rationale seems to be this: Cognition is rational, Bayes's
rule defines rationality, ergo cognition works with Bayes's rule. The same ra-
tionale seems to hold for other statistical tools that turned into theories of
mind, such as multiple regression, Neyman-Pearson decision techniques, and
analysis of variance (Chapter 1).

The second view, the one I propose here, is that modules of social intelli-
gence, including the triggering algorithms, work with fast and frugal strategies
instead of the costly "optimal" algorithms. There are several reasons that favor
simple and specifically designed heuristics rather than expensive and general
ones. First, there is, in fact, no single method of inference—statistical or log-
ical—that works best in all real-world contexts. Second, as mentioned before,
in real-world situations, "optimal" computations can quickly become so com-
plex that one is forced to make highly simplifying assumptions about the en-
vironment. Third, algorithms for social intelligence need to work under con-
straints of limited time and knowledge—for instance, one may not have the
time to search for further information. Fourth, the means and ends of social
intelligence are broader than consistency (coherence) and accuracy—the ac-
cepted norms of logic and statistics. Social intelligence can involve being in-
consistent (e.g., adaptive unpredictability may be optimal in competitive sit-
uations: the opponent will be unable to predict one's behavior), taking high
risks in trying to come out first (that is, options with low probabilities rather
than those that maximize expected value), responding quickly rather than ac-
curately (e.g., to make too long a pause in a conversation in order to think of
the best answer can be embarrassing and seen as impolite), and making deci-
sions that one can justify or defend afterward (Todd, 2000a, b).

The key argument against fast and frugal heuristics is that their simplicity
raises the suspicion that they are really bad. But this need not be the case:
Simple principles can be quick-and-clean rather than quick-and-dirty. Con-
sider a simple heuristic for choice called "Take The Best" (Gigerenzer & Gold-
stein, 1996). Take The Best infers which of two alternatives scores higher on
some criterion, such as which of two food items is more dangerous or which
of two cities has a higher population. Take The Best bases decisions on one

234 SOCIAL RATIONALITY

single reason, namely on the first good reason on which two alternatives differ.
The first good reason can be simply that the individual does not recognize (has
never heard of) one of the two alternatives. This "recognition heuristic" seems
to operate in domains in which recognition is correlated with the variable that
needs to be inferred. For instance, rats who can choose between food that they
recognize and food that is new to them do not accept the new food unless they
have smelled it on the breath of a fellow rat (Gallistel et al., 1991). The sur-
prising result is that simple heuristics such as Take The Best can make as
accurate inferences about real-world environments as costly statistical algo-
rithms of the Laplacean demon type (Martignon & Laskey, 1999).

If short-sighted evolution has equipped us with adaptive heuristics rather
than with the collected works of logic and probability theory, this result in-
dicates that we need to rethink human rationality. The challenges are to un-
derstand what these heuristics are and to describe the structure of the envi-
ronments in which they can perform well and in which they cannot. My
proposal is that both the triggering algorithms and the mechanisms of the mod-
ule can be modeled as fast and frugal heuristics (Gigerenzer et al., 1999).

Modules Combine Cognitive, Emotional, and Motivational
Tools to Guide Inference and Behavior

In each module, various cognitive, emotional, behavioral, and motivational
processes are wired together. A social contract module, for instance, includes
perceptual machinery to recognize different individuals; a long-term memory
that stores the history of past exchanges with other individuals in order to
know when to cooperate, when to defect, and when to punish for defection;
knowledge about what constitutes a benefit and what a cost for oneself; and
emotional reactions such as anger that signal to others that one will go ruth-
lessly after cheaters (Cosmides & Tooby, 1992; Gigerenzer, 1995). Intelligent
modules can use cognitive, emotional, or motivational processes vicariously
as means to achieve a goal. For instance, emotional reactions (such as disgust
in matters of food) can substitute for learning from experience when events
are too rare or too deadly for individual learning.

The very challenge of a modular concept of social intelligence is that it
crosses the established boundaries of horizontal faculties. Thereby, the notion
of modularity questions the institutionalized subdisciplines of psychology. It
opens up a conception of intelligence that integrates cognition with, rather
than setting it apart from, other adaptive functions such as motivations and
emotions.

Summary

In 1976, Nick Humphrey wrote a stimulating essay on the social function of
intellect that provided "the single most important seed" (Byrne & Whiten,

THE MODULARITY OF SOCIAL INTELLIGENCE 235

1988, p. 1) for Machiavellian intelligence. Humphrey's paper contains thought-
provoking ideas about social intelligence but also the seed of a sterile research
methodology. In the postscript we are told that his "central thesis" demands
"that there should be a positive correlation across species between 'social com-
plexity' and 'individual intelligence' " (Humphrey, 1976/1988, p. 26). A labo-
ratory test of social skill, Humphrey proposes, is urgently needed, which
"ought, if I am right, to double as a test of 'high-level intelligence' " (p. 26).
This proposal, as innocent as it looks, steers toward repeating a grave error
that has swamped research on (nonsocial) intelligence. The error is to start
with no theory but with seductive everyday concepts such as social skill and
creative intelligence, then go on to design tests to measure these vague con-
cepts, and to pray to heaven that these tests miraculously transform loose
thinking into precise mechanisms of intelligence.1 Despite prayers that were
backed up by statistics, some 90 years of factor analyzing and correlating IQ
tests has not noticeably increased our understanding of the mechanisms of
human intelligence. I fear that Humphrey's proposal to look for correlations
between some tests for social complexity, social skill, and individual intelli-
gence will be doomed to the same failure.

The alternative is to start boldly and theoretically. The challenge is to design
the possible mechanisms of social intelligence and to test these by means of
experiment, observation, and simulation. The mechanisms of a modular social
intelligence that I have outlined, as speculative as they are, can serve as a start.
We will have to take up some hard questions. What are the domains (proper
and actual) for a given species? What is the mechanism of a module? What is
the algorithm that triggers a module? If we join forces, we can do it.

1. The notion of test intelligence, left undefined in its content, has had many faces,
including the moral and the social (Daston, 1992). For instance, the creators of the first
intelligence tests, Binet and Simon (1914), asked questions about social skills, such as
"Why should one judge a person by his acts rather than by his words?" In the first
edition of the Stanford-Binet Test, Louis Terman (1916) expressed the intimate link
between lack of intelligence and morally inappropriate behavior in no uncertain terms:
"Every feeble-minded woman is a prostitute." In the 1937 revision of the text (with M. A.
Merrill), this sentence was deleted. Piece by piece, IQ tests became pure and puritan.

This page intentionally left blank

V

COGNITIVE ILLUSIONS AND
STATISTICAL RITUALS

The "discovery" of cognitive illusions was not the first assault on human

rationality. Sigmund Freud's attack is probably the best known: According to
him, the unconscious wishes and desires of the human id are a steady
source of intrapsychical conflict that manifests itself in all kinds of irrational
fears, beliefs, and behavior. But the cognitive-illusion assault is stronger than
the psychoanalytic one. It does not need to invoke a conflict between ra-
tional judgment and unconscious wishes and desires to explain humans' ap-
parent irrationality: Judgment is itself fundamentally deficient. Homo sap-
iens appears to be a misnomer. During the last few decades, cognitive
illusions have become fodder for classroom demonstrations and textbooks.
Isn't it fun to show how dumb everyone else is, and after all, aren't they?

In the spring of 1990, I gave a talk in the Department of Psychology at
Stanford University entitled "Beyond heuristics and biases: How to make
cognitive illusions disappear." The first chapter in this section is an updated
version of this talk. It became the fountainhead of an ongoing, heated debate
over the nature of human rationality and the litany of sins people seem to
commit routinely against reason (e.g., Gigerenzer, 1996a; Kahneman & Tver-
sky, 1996), the so-called "rationality wars" (Samuels, Stich, & Bishop, in
press). Cognitive illusions have been linked to perceptual illusions, suggest-
ing that they are "inevitable illusions" (Piattelli-Palmarini, 1994). The politi-
cal implications of this view are not hard to see. Given the message that or-
dinary citizens are unable to estimate uncertainties and risks, one might
conclude that a government would be well advised to keep these nitwits out
of important decisions regarding new technologies and environmental risks.
In Chapter 12,1 criticize the narrow norms that make humans look irrational
and show how to make inevitable illusions "evitable."

Research on cognitive illusions is but one example of the more general phe-
nomenon of replacing statistical thinking with narrow, simplistic norms. The
statistical practices institutionalized in many social and medical sciences are
another case in point; they have little to do with statistical thinking and instead

238 COGNITIVE ILLUSIONS AND STATISTICAL RITUALS

promote statistical rituals. Textbooks teach our students the equivalent ofcom-
pulsive hand washing, the result being confusion and anxiety.

I once asked a well-known author who was busy preparing the latest edi-
tion of his best-selling statistical text for psychologists why he promoted the
usual incoherent mishmash of Fisherian and Neyman-Pearsonian prescrip-
tions for testing hypotheses (see Chapter 13). He did not try to deny the
problem, but he told me whom to blame for it. First, there was his publisher,
who had insisted that he supply a statistical cookbook and take out anything
that hinted at the existence of alternative tools for statistical inference,
which he did. Next, there were his fellow researchers, who did not aim to
truly understand statistics but to get their papers published. Finally, he
passed the blame on to the editors who demanded a statistical ritual and to
his university administration, which determined salary increases by counting
the number of papers published. When I asked him in what statistical meth-
ods he himself believed, he said that deep in his heart he was a Bayesian. I
was shocked. What a Faustian pact—an author successfully sells a method
in which he does not believe, which students and researchers then naively
mistake for the moral guidelines of doing science.

This is not to say that the many textbook authors who borrowed from
him are as aware of the confusion behind the ritual of null hypothesis test-
ing as he was. In my experience, many authors are innocent because igno-
rant, which is one way to maintain one's intellectual integrity. I wrote Chap-
ter 13 as an antidote to mindless statistics for both students and future
textbook writers.

The larger social and intellectual background for this chapter can be
found in The empire of chance: How probability changed science and every-
day life (Gigerenzer et al., 1989). This book was written by an interdiscipli-
nary group of scholars who studied the probabilistic revolution in the sci-
ences at the Center for Interdisciplinary Research in Bielefeld, a place where
nothing could distract us from work. Geoffrey Loftus reviewed the book in
Contemporary Psychology in 1991. When he became the editor of Memory &
Cognition in 1993, in his editorial statement he asked researchers to stop
submitting manuscripts with legions of p-values, t-values, and F-values and
instead present sound descriptive statistics, such as figures with error bars
around means. I admire him for having the courage to stand up against
mindless null hypothesis testing. A few years later, I asked Geoffrey how his
crusade was going. To Geoffrey's surprise, the resistance was coming from
the researchers. Most of them insisted on going through with the ritual. As
this case illustrates, editors alone cannot be blamed for psychologists' con-
tinued reliance on misguided statistical procedures. I might add that I have
never had a problem publishing experimental papers without null hypothe-
sis tests.

The story of null hypothesis testing in psychology is reminiscent of Hans
Christian Andersen's tale of the emperor's new clothes. In a sense, the proce-
dure has no clothes: Its outcome, the p-value, does not inform the reader
about the size of the effect, the probability that the null hypothesis is true,

COGNITIVE ILLUSIONS AND STATISTICAL RITUALS 239

the probability that the alternative hypothesis is true, or the probability that
the result is replicable. Nevertheless, studies in the United States, Great Brit-
ain, and Germany indicate that some 80% to 90% of academic psychologists
"see" one or more of these attractive garments on the p-value.

I have been asked what we should do instead of null hypothesis testing.
The answer is: not a new ritual. Chapter 6 illustrated one alternative. First,
the data is tested against multiple hypotheses—alternative models of cogni-
tive strategies—rather than against one null hypothesis. Second, multiple hy-
potheses are tested against the judgments of each individual participant
rather than against the average across individuals, thus enabling detection of
multiple strategies. Third, hypotheses are tested against both outcome and
process data rather than outcome data only.

The most important thing is to define candidate hypotheses before start-
ing the business of hypothesis testing. Null hypothesis testing encourages
theoretical laziness. To use it, one does not need to specify one's research
hypothesis or a substantive alternative except "chance." This scant require-
ment allows surrogates for theories to grow like weeds (Chapter 14). We
need statistical thinking, not statistical rituals.

This page intentionally left blank

12

How to Make Cognitive Illusions Disappear

Social psychology was transformed by the "cognitive revolution." Cognitive
imperialism has been both praised and lamented. But a second revolution has
transformed most of the sciences so fundamentally that it is now hard to see
that it could have been different before. It has made concepts such as proba-
bility, chance, and uncertainty indispensable for understanding nature, society,
and the mind. This sweeping conceptual change has been called the "proba-
bilistic revolution" (Kriiger, Daston, & Heidelberger, 1987; Kriiger, Gigerenzer,
& Morgan, 1987). The probabilistic revolution differs from the cognitive rev-
olution in its genuine novelty and its interdisciplinary scope. Statistical me-
chanics, Mendelian genetics, Brownian motion, radioactive decay, random
drift, randomized experimental design, statistical inference—these are some of
the fruits of that transformation. Social psychology was no exception. It cur-
rently bears the marks of both the cognitive revolution and the probabilistic
revolution.

Probabilistic and statistical concepts were piggybacked onto cognitive con-
cepts. Some of the most popular theories and research programs owed their
genesis to an analogy between social cognition and "intuitive statistics." In
1967, for instance, Harold Kelley proposed that the layperson attributes a cause
to an effect in the same way as a statistician of the Fisherian school would, by
(unconsciously) calculating an analysis of variance (ANOVA). Research on the
ANOVA mind soon became mainstream social psychology (Kelley &Michaela,
1980). It is well documented in the history of science that statistics trans-
formed almost everything it touched. So has causal attribution (Chapter 1).Just
as statistical calculations are those of an individual statistician, attribution and
social cognition were investigated as the calculations of individual minds, con-
firming the individualism in social psychology (Newcombe & Rutter, 1982).

More recently, Bayesian statistics, rather than Fisherian statistics, has been
used as a yardstick to evaluate social cognition, and as measured by this new
yardstick, many people's judgments seemed to be flawed by fallacies and errors
in statistical reasoning. "Hot" motivational terms were replaced by the "cold"

241

242 COGNITIVE ILLUSIONS AND STATISTICAL RITUALS

cognitive language of intuitive statistics. Self-serving perceptions and attribu-
tions, ethnocentric beliefs, and many types of human conflict were analyzed
as passionless information-processing errors, due to basic shortcomings in in-
tuitive statistical reasoning (e.g., Borgida &Brekke, 1981; Nisbett &Ross, 1980;
Sherman, Judd, & Park, 1989). Social cognitive psychologists started to study
(what they believed to be) errors in probabilistic reasoning, such as the base-
rate fallacy, the conjunction fallacy, and overconfidence bias, and adopted the
explanatory language of Kahneman and Tversky's "heuristics," such as rep-
resentativeness and availability. Some, such as Strack (1988), even pointed to
Kahneman and Tversky's heuristics as primary evidence of the end of the "cri-
sis" of social psychology and of new, rising confidence and decisive progress
in the field.

Heuristics and Biases

The "heuristics and biases" program of Kahneman, Tversky, and others has
generated two main results concerning judgment under uncertainty: (1) a list
of so-called biases, fallacies, or errors in probabilistic reasoning, such as the
base-rate fallacy and the conjunction fallacy, and (2) explanations of these bi-
ases in terms of cognitive heuristics such as representativeness. Table 12.1
gives a taste of the conclusions drawn from this program.

Kahneman and Tversky (1982) see the study of systematic errors in proba-
bilistic reasoning, also called "cognitive illusions," as similar to that of visual
illusions. "The presence of an error of judgment is demonstrated by comparing
people's responses either with an established fact (e.g., that the two lines are
equal in length) or with an accepted rule of arithmetic, logic, or statistics"
(p. 493). Their distinction between "correct" and "erroneous" judgments un-
der uncertainty has been echoed by many social psychologists: "We follow
conventional practice by using the term 'normative' to describe the use of a
rule when there is a consensus among formal scientists that the rule is appro-
priate for the particular problem" (Nisbett & Ross, 1980, p. 13).

Social psychology is not the only area in which the "heuristics and biases"
program has made strong inroads. Experimental demonstrations of "fallacious"
judgments have entered law (e.g., Saks & Kidd, 1980), economics (e.g., Frey,
1990), management science (e.g., Bazerman, 1990), medical diagnosis (e.g.,
Casscells, Schoenberger, &Grayboys, 1978), behavioral auditing (see Shanteau,
1989), philosophy (e.g., Stich, 1990), and many other fields. There is no doubt
that understanding judgment under uncertainty is essential in all these fields.
It is the achievement of the "heuristics and biases" program to have finally
established this insight as a central topic of psychology. Earlier pioneers who
studied intuitive statistics (Hofstatter, 1939; Peirce & Jastrow, 1884; Wendt,
1966) had little impact. Even Ward Edwards and his colleagues (e.g., Edwards,
1968), who started the research from which Kahneman and Tversky's "heuris-

HOW TO MAKE COGNITIVE ILLUSIONS DISAPPEAR 243

Table 12.1 A sample of conclusions from the heuristics-and-biases program

In making predictions and judgments under uncertainty, people do not appear to fol-
low the calculus of chance or the statistical theory of prediction. Instead, they rely on
a limited number of heuristics which sometimes yield reasonable judgments and some-
times lead to severe and systematic errors.

Daniel Kahneman &- Amos Tversky, 1973, p. 237

It appears that people lack the correct programs for many important judgmental tasks.
. . . We have not had the opportunity to evolve an intellect capable of dealing concep-
tually with uncertainty.

Paul Slavic, Baruch Fischhoff, & Sarah Lichtenstein, 1976, p. 174

The genuineness, the robustness, and the generality of the base-rate fallacy are matters
of established fact.

Maya Bar-Hillel, 1980, p. 215

The biases of framing and overconfidence just presented suggest that individuals are
generally affected by systematic deviations from rationality.

Max Bazerman & M. A. Neale, 1986, p. 317

[Overconfidence bias] has proved so robust that it is hard to acquire much insight into
the psychological processes underlying it.

Baruch Fischhoff, 1988, p. 172

. . . mental illusions should be considered the rule rather than the exception.
Richard H. Thaler, 1991, p. 4

[We are] a species that is uniformly probability-blind, from the humble janitor to the
Surgeon General. . . . We should not wait until A. Tversky and D. Kahneman receive a
Nobel prize for economics. Our self-liberation from cognitive illusions ought to start
even sooner.

Massimo Piattelli-Palmarini, 1991, p. 35

tics and biases" program emerged, had no comparable influence on cognitive
and social psychology.

Despite its influence, I will argue that the "heuristics and biases" program
is merely an important transitional stage, which must be transformed if long-
term progress is to be made. I will review some serious shortcomings of that
research program and show how they can be overcome.

In this chapter I do three things. First, I discuss the validity of the normative
yardstick that is used to define people's judgments as systematic errors in prob-
abilistic reasoning. I will argue that most so-called errors or cognitive illusions
are, contrary to the assertions in the literature, in fact not violations of prob-
ability theory. In their normative claims, Tversky and Kahneman, and social
psychologists following in their footsteps, have neglected conceptual distinc-
tions that are fundamental to probability and statistics. Second, I show that if
we pay attention to these conceptual distinctions, we can make apparently
stable "cognitive illusions" disappear, reappear, or even invert. Third, the in-
teresting fact that intuitive reasoning is highly sensitive to conceptual distinc-
tions made by statisticians (but ignored by many psychologists) leads to a re-
vised understanding of judgment under uncertainty.

244 COGNITIVE ILLUSIONS AND STATISTICAL RITUALS

Why Biases Are Not Biases

In the "heuristics and biases" program, a bias or error in probabilistic reason-
ing is defined as a systematic discrepancy between a person's judgment and a
norm. What is that norm? It is often referred to as "the normative theory of
prediction" (Kahneman & Tversky, 1973, p. 243),as the "normative principles
of statistical prediction" (Ajzen, 1977, p. 304), or simply as an "accepted rule"
of statistics. Many have understood this rhetoric to imply that there exists
precisely one "correct" answer to the cab problem, engineer-lawyer problem,
Linda problem, and other problems posed to participants—an answer sanc-
tioned by the authority of the eminent mathematicians, probabilists, and stat-
isticians of this century. The claim that all these problems have one correct
answer is crucial. If they did not have one and only one answer, it would make
little sense first to identify "errors" and "cognitive illusions" and then to use
these cognitive illusions to understand the principles of inductive reasoning,
in the way that visual illusions are used to understand the principles of normal
perception. This two-step program, identifying errors and explaining them, in
analogy to perceptual research, is the basic idea behind the heuristics-and-
biases program (Kahneman & Tversky, 1982, p. 493).

But what does the "heuristics and biases" investigation of judgment under
uncertainty have to do with probability and statistics? The short answer to this
question is: all too little. The probabilistic rules against which cognitive and
social psychologists have measured the proficiency of their participants are in
fact a highly (and, I shall argue, often misleadingly) selected sample of those
routinely used, consulted, and discussed by working probabilists and statisti-
cians. When claiming "errors" and "fallacies," cognitive and social psychol-
ogists have largely ignored conceptual and technical distinctions fundamental
to probability and statistics.

What in the heuristics-and-biases literature is called the "normative theory
of probability" or the like is in fact a very narrow kind of neo-Bayesian view
that is shared by some theoretical economists and cognitive psychologists and
to a lesser degree by practitioners in business, law, and artificial intelligence.
It is not shared by proponents of the frequentist view of probability that dom-
inates today's statistics departments, nor by proponents of many other views;
it is not even shared by all Bayesians, as I shall show shortly. By this narrow
standard of "correct" probabilistic reasoning, the most distinguished proba-
bilists and statisticians of our century—figures of the stature of Richard von
Mises and Jerzy Neyman—would be guilty of "biases" in probabilistic reason-
ing.1 Let me illustrate this point with some of the best-known demonstrations
of "fallacies."

1. Despite the widespread rhetoric of a single "normative theory of prediction," it
should be kept in mind that the problem of inductive reasoning still has no universal
solution (the "scandal of philosophy") but many competing ones. The controversies

HOW TO MAKE COGNITIVE ILLUSIONS DISAPPEAR 245

Overconfidence Bias

Confidence in general knowledge is typically studied with questions of the
following kind:

Which city has more inhabitants?
(a) Hyderabad, (b) Islamabad

How confident are you that your answer is correct?
50% / 60% / 70% / 80% / 90% / 100%

The participant chooses what he or she believes is the correct answer and
then rates his or her confidence that the answer is correct. After many partic-
ipants answer many questions, the experimenter counts how many answers in
each of the confidence categories were actually correct. The typical finding is
that in all the cases in which participants said, "I am 100% confident that my
answer is correct," the relative frequency of correct answers was only about
80%; in all the cases in which they said, "I am 90% confident that my answer
is correct," the relative frequency of correct answers was only about 75%, and
so on (for an overview, see Lichtenstein, Fischhoff, & Phillips, 1982). This
systematic discrepancy between confidence and relative frequency is termed
"overconfidence."

Little has been achieved in explaining this "bias." A common proposal is
to explain "biases" by other, deeper mental flaws. For instance, Koriat, Lich-
tenstein, and Fischhoff (1980) proposed that the overconfidence bias is caused
by a "confirmation bias." Their explanation was this: After one alternative is
chosen, the mind searches for further information that confirms the answer

between the Fisherians, the Neyman-Pearsonians, and the Bayesians are evidence of this
unresolved rivalry. For the reader who is not familiar with the fundamental issues, two
basic themes may help introduce the debate (for more, see Hacking, 1965). The first
issue relevant for our topic is whether probability is additive (that is, satisfies the Kol-
mogorov axioms, e.g., that the probabilities of all possible events sum up to 1) or not.
The above-mentioned points of view (including that of the heuristics-and-biases pro-
gram) subscribe to additivity, whereas L. J. Cohen's (e.g., 1982) Baconian probabilities
are nonadditive (for more on nonadditive theories, see Shafer, 1976). In my opinion,
Cohen correctly criticizes the normative claims in the heuristics-and-biases program
insofar as not all uses of "probability" that refer to single events must be additive—but
this does not imply that Baconian probability is the only alternative, nor that one should
assume, as Cohen did, that all minds reason rationally (or at least are competent to do
so) in all situations. I do not deal with this issue in this chapter (but see Gigerenzer,
1991d). The second fundamental issue is whether probability theory is about relative
frequencies in the long run or (also) about single events. For instance, the question "What
is the relative frequency of women over 60 who have breast cancer?" refers to frequen-
cies, whereas "What is the probability that Ms. Young has breast cancer?" refers to a
single event. Bayesians usually assume that (additive) probability theory is about single
events, whereas frequentists hold that statements about single cases have nothing to do
with probability theory (they may be dealt with by cognitive psychology, but not by
probability theory).

246 COGNITIVE ILLUSIONS AND STATISTICAL RITUALS

given, but not for information that could falsify it. This selective information
search artificially increases confidence. The key idea in this explanation is that
the mind is not a Popperian. Despite the popularity of the confirmation bias
explanation in social psychology, there is little or no support for this hypoth-
esis in the case of confidence judgments (see Chapter 7).

As with many "cognitive illusions," overconfidence bias seems to be a ro-
bust fact waiting for a theory. This "fact" was quickly generalized to account
for human disasters of many kinds, such as deadly accidents in industry (Spet-
tell & Liebert, 1986), confidence in clinical diagnosis (Arkes, 1981), and short-
comings in management and negotiation (Bazerman, 1990) and in the legal
process (Saks & Kidd, 1980), among others.

The Normative Issue Is overconfidence bias really a "bias" in the sense of a
violation of probability theory? Let me rephrase the question: Has probability
theory been violated if one's degree of belief (confidence) in a single event (i.e.,
that a particular answer is correct) is different from the relative frequency of
correct answers one generates in the long run? The answer is "no." It is in fact
not a violation according to several interpretations of probability.

Let us look first at the now dominant school of probability: the frequentists
(the frequentist interpretation of probability has been dominant since about
1840; see Daston, 1988; Porter, 1986). Most readers of this chapter will have
been trained in the frequentist tradition and, for instance, will have been taught
that the probabilities of Type I and Type II errors are long-run frequencies of
errors in repeated experiments, not probabilities of single outcomes or hy-
potheses. For a frequentist like the mathematician Richard von Mises, the term
"probability," when it refers to a single event, "has no meaning at all for us"
(1928/1957, p. 11). For predictions of single events, as studied in present-day
overconfidence research, he put the issue in crystal-clear terms: "Our proba-
bility theory has nothing to do with questions such as: 'Is there a probability
of Germany being at some time in the future involved in a war with Liberia?' "
(p. 9). In this view, probability theory is about frequencies, not about single
events. To compare the two means comparing apples with oranges.

Even the major opponents of the frequentists—subjectivists such as Bruno
de Finetti—would not generally think of a discrepancy between confidence
and relative frequency as a "bias," albeit for different reasons. For a subjectiv-
ist, probability is about single events, but rationality is identified with the
internal consistency of subjective probabilities. As de Finetti emphasized,
"however an individual evaluates the probability of a particular event, no ex-
perience can prove him right, or wrong; nor, in general, could any conceivable
criterion give any objective sense to the distinction one would like to draw,
here, between right and wrong" (1931/1989, p. 174).

Other theories and interpretations of probability are also at odds with the
claim that overconfidence is a bias, that is, a violation of probability theory.
But I will stop here and summarize the normative issue. A discrepancy be-
tween confidence in single events and relative frequencies in the long run is
not an error or a violation of probability theory from many experts' points of

HOW TO MAKE COGNITIVE ILLUSIONS DISAPPEAR 247

view. It only looks like it from a narrow interpretation of probability that blurs
the distinction between single events and frequencies fundamental to proba-
bility theory. (The choice of the word "overconfidence" for the discrepancy
put the "fallacy" message into the term itself.)

How to Make the Cognitive Illusion Disappear If there are any robust cognitive
biases at all, overconfidence in one's knowledge would seem to be a good
candidate. "Overconfidence is a reliable, reproducible finding" (von Winter-
feldt & Edwards, 1986, p. 539). "Can anything be done? Not much" (Edwards
& von Winterfeldt, 1986, p. 656). "Debiasing" methods, such as warning the
participants of the overconfidence phenomenon before the experiment and of-
fering them money to avoid it, have had little or no effect (Fischhoff, 1982).

Setting the normative issue straight has important consequences for under-
standing confidence judgments. Let us go back to the metaphor of the mind as
an intuitive statistician. I now take the term "statistician" to refer to a statis-
tician of the dominant school in this (and in the last) century, not one adopting
the narrow perspective some psychologists and economists have suggested.
Assume that the mind is a frequentist. Like a frequentist, the mind should be
able to distinguish between single-event confidences and frequencies in the
long run.

This view has testable consequences. Ask people for their estimated fre-
quencies of correct answers and compare them with true frequencies of correct
answers, instead of comparing the latter frequencies with confidences. We are
now comparing apples with apples. Ulrich Hoffrage, Heinz Kleinbolting, and
I carried out such experiments. Participants answered several hundred ques-
tions of the Islamabad-Hyderabad type (see above), and, in addition, estimated
their frequencies of correct answers.

Table 12.2 (top row) shows the usual "overconfidence bias" when single-
event confidences are compared with actual relative frequencies of correct
answers. In both experiments, the difference was around 13 to 15 percentage
points, which is a large discrepancy. After each set of 50 general knowledge

Table 12.2 How to make the overconfidence bias disappear

Difference between Experiment 1 Experiment 2
(n = 80) (n = 97)

Mean confidence and relative +13.8 +15.4
frequency of correct answers —2.4 —4.2
("overconfidence bias")

Estimated frequency and
frequency of correct answers

Note: To make values for frequency and confidence judgments comparable, all frequen-
cies were transformed to relative frequencies. Values shown are differences multiplied
by a factor of 100. Positive values denote "overconfidence" (Gigerenzer, Hoffrage, &
Kleinbolting, 1991).

248 COGNITIVE ILLUSIONS AND STATISTICAL RITUALS

questions, we asked the same participants, "How many of these 50 questions
do you think you got right?" Comparing their estimated frequencies with actual
frequencies of correct answers made "overconfidence" disappear. Table 12.2
(second row) shows that estimated frequencies were practically identical with
actual frequencies, with even a small tendency toward underestimation. The
"cognitive illusion" was gone. Similar results were obtained when participants
estimated the relative frequencies of correct answers in each confidence cate-
gory. In all cases in which participants said they were "100% (90%,
80%,. . . ) confident," they estimated that, in the long run, they had a lower
percentage of answers correct, and their estimates were close to the true rela-
tive frequencies of correct answers (May, 1987, reported similar results). Elim-
inating the experimenter's normative confusion between single events and fre-
quencies made the participants' "overconfidence bias" disappear.

The general point is (i) a discrepancy between probabilities of single events
(confidences) and long-run frequencies need not be framed as an "error" and
called "overconfidence bias," and (ii) judgments need not be "explained" by
a flawed mental program at a deeper level, such as "confirmation bias."Rather,
people seem to be able intuitively to make conceptual distinctions similar to
those that professional statisticians make. How they do it can be accounted for
by the theory of "probabilistic mental models" (PMM), which explains both
confidence and frequency judgments in terms of frequentist probability cues
(Chapter 7). PMM theory is a frequentist theory of judgment and uncertainty;
it can predict overconfidence, good calibration, and underestimation within
the same participant.

Conjunction Fallacy

The original demonstration of the "conjunction fallacy" was with problems of
the following kind (Tversky & Kahneman, 1983, p. 299):

Linda is 31 years old, single, outspoken and very bright. She majored in
philosophy. As a student, she was deeply concerned with issues of dis-
crimination and social justice, and also participated in antinuclear dem-
onstrations.

Participants were asked which of two alternatives was more probable:

Linda is a bank teller (T)
Linda is a bank teller and is active in the feminist movement(T&F)

Eighty-five percent of the participants chose T&F in the Linda problem (see
Table 12.3). Tversky and Kahneman, however, argued that the "correct" an-
swer is T, because the probability of a conjunction of two events, such asT&F,
can never be greater than that of one of its constituents. They explained this
"fallacy" as induced by the representativeness heuristic. They assumed that
judgments were based on the match (similarity, representativeness) between
the description of Linda and the two alternatives T and T&F. That is, since

HOW TO MAKE COGNITIVE ILLUSIONS DISAPPEAR 249

Table 12.3 Linda problem: How to make the conjunction
fallacy disappear

Linda problem Single-event Frequency
probability
Tversky & Kahneman (1983) —
Which is more probable? 85 —
Probability ratings 82 —
Probability ratings T* 57
Betting 56

Fiedler (1988) 91 22
Exp. 1 83 17
Exp. 2

Hertwig & Gigerenzer (1999) 83 0
Studies 1 and 3 88 13
Studies 2 and 4

Note: Numbersare violations (in %) of the conjunction rule. The various versions
of the Linda problem are (i) which is more probable (see text), (ii) probability
ratings on a 9-point scale, (iii) probability ratings using the alternative "Linda is
a bank teller whether or not she is active in the feminist movement"(T*) instead
of "Linda is a bank teller" (T),(iv)hypotheticalbetting, that is, participants were
asked "If you could win $10 by betting on an event, which of the following would
you choose to bet on?" Fiedler asked participantsto rank order T, T&F, and other
alternatives with respect to their probability. In his first frequency version the
population size was always 100, in the second it varied. Hertwig and Gigerenzer
asked participants to rank order T, T&F,and F with respect to their probability,
or estimate their frequency. Tversky and Kahneman (1983, p. 309) had reported
a facilitating effect of frequency judgments for a different problem.

Linda was described as if she were a feminist and T&Fcontains the term "fem-
inist," people believe that T&F is more probable.

This alleged demonstration of human irrationality in the Linda problem has
been widely publicized in psychology, philosophy, economics, and beyond.
Stephen J. Gould (1992, p. 469) put the message clearly:

I am particularly fond of [the Linda] example, because I know that the
[conjunction] is least probable, yet a little homunculus in my head con-
tinues to jump up and down, shouting at me, "but she can't just be a
bank teller; read the description." . . . Why do we consistently make this
simple logical error? Tversky and Kahneman argue, correctly I think, that
our minds are not built (for whatever reason) to work by the rules of
probability.

I suggest that Gould should have had more trust in the rationality of his ho-
munculus.

The Normative Issue Is the "conjunction fallacy" a violation of probability
theory, as has been claimed in the literature? Has a person who chooses T&F
as the more probable alternative violated probability theory? Again, the answer

250 COGNITIVE ILLUSIONS AND STATISTICAL RITUALS

is "no." Choosing T&Fis not a violation of probability theory, and for the same
reason given previously. For a frequentist, this problem has nothing to do with
probability theory. Participants were asked for the probability of a single event
(that Linda is a bank teller), not for frequencies. For instance, the statistician
Barnard (1979) commented thus on subjective probabilities for single events:
"If we accept it as important that a person's subjective probability assessments
should be made coherent, our reading should concentrate on the works of
Freud and perhaps Jung rather than Fisher and Neyman" (p. 171).

Note that problems that are claimed to demonstrate the "conjunction fal-
lacy" are structurally different from "confidence" problems. In the former, sub-
jective probabilities (that Linda is a bank teller or a bank teller and a feminist)
are compared with one another; in the latter, they are compared with frequen-
cies.

To summarize the normative issue, what is called the "conjunction fallacy"
looks like a violation of some subjective theories of probability, including Bay-
esian theory. It is not, however, a violation of a major view of probability, the
frequentist conception.

How to Make the Cognitive Illusion Disappear What if the mind were a fre-
quentist? If the untutored mind is as sensitive to the distinction between single
cases and frequencies as a statistician of the frequentist school is, then we
should expect dramatically different judgments if we pose the above problem
in a frequentist mode, such as the following:

There are 100 persons who fit the description above (i.e., Linda's).
How many of them are:
(a) bank tellers
(b) bank tellers and active in the feminist movement.

Participants are now asked for frequency judgments rather than for single-
event probabilities. If the mind solves the Linda problem by using a represen-
tativeness heuristic, changes in information representation should not matter
because they do not change the degree of similarity. The description of Linda
is still more representative of (or similar to) the conjunction "teller and femi-
nist" than of "teller." Participants therefore should still exhibit the conjunction
fallacy. Table 12.3, however, shows that with frequency judgments, the "con-
junction fallacy" largely disappears. The effect is dramatic, from some 80% to
90% conjunction violations in probability judgments to 10% to 20% in fre-
quency judgments, with one study even reporting 0%.

What accounts for this striking effect of frequency judgments? Hertwig and
Gigerenzer (1999) analyzed how participants understood the phrase "which is
more probable?", for instance, by asking them to paraphrase the problem to
another person who is not a native speaker of the language in which the prob-
lem was presented. The results indicate that most participants did not under-
stand "probability" in the sense of mathematical probability but as one of the
many other legitimate meanings that are listed in, for example, the Oxford
English Dictionary (e.g., meaning credibility, typicality, or that there is evi-

HOW TO MAKE COGNITIVE ILLUSIONS DISAPPEAR 251

dence). The term frequency, unlike probability, narrows down the spectrum of
possible interpretations to meanings that follow mathematical probability.

The results in Table 12.3 are consistent with the earlier work by Inhelder
and Piaget (1969), who showed children a box containing wooden beads, most
of them brown, but a few white. They asked the children, "Are there more
wooden beads or more brown beads in this box?" By the age of eight, a majority
of children responded that there were more wooden beads, indicating that they
understand conjunctions (class inclusions). Note that Inhelder and Piaget
asked children for frequency judgments, not probability judgments.

Base-Rate Fallacy

Among all cognitive illusions, the "base-rate fallacy" has probably received
the most attention. The neglect of base rates seems in direct contradiction to
the widespread belief that judgments are unduly affected by stereotypes (Land-
man & Manis, 1983), and for this and other reasons it has generated a great
deal of interesting research on the limiting conditions for the "base-rate fal-
lacy" in attribution and judgment (e.g., Ajzen, 1977; Borgida & Brekke, 1981).
For instance, in their review, Borgida and Brekke argue for the pervasiveness
of the "base-rate fallacy" in everyday reasoning about social behavior, ask the
question "Why are people susceptible to the base-rate fallacy?" (1981, p. 65),
and present a list of conditions under which the "fallacy" is somewhat re-
duced, such as "vividness," "salience," and "causality" of base-rate informa-
tion.

My analysis is different. Again I first address the normative claims that
people's judgments are "fallacies" using two examples that reveal two different
aspects of the narrow understanding of good probabilistic reasoning in much
of this research.

The first is from Casscells, Schoenberger, and Grayboys (1978, p. 999) and
presented by Tversky and Kahneman (1982b, p. 154) to demonstrate the gen-
erality of the phenomenon:

If a test to detect a disease whose prevalence is 1/1000 has a false positive
rate of 5%, what is the chance that a person found to have a positive
result actually has the disease, assuming you know nothing about the
person's symptoms or signs?

Sixty students and staff at Harvard Medical School answered this medical
diagnosis problem. Almost half of them judged the probability that the person
actually had the disease to be 0.95 (modal answer), the average answer was
0.56, and only 18% of participants responded 0.02. The latter was considered
to be the correct answer. Note the enormous variability in judgments. Little
has been achieved in explaining how people make these judgments and why
the judgments are so strikingly variable.

The Normative Issue But do statistics and probability give one and only one
"correct" answer to that problem? The answer is again "no." And for the same

252 COGNITIVE ILLUSIONS AND STATISTICAL RITUALS

reason, as the reader will already have guessed. As in the case of confidence
and conjunction judgments, participants were asked for the probability of a
single event, that is, that "a person found to have a positive result actually has
the disease." If the mind is an intuitive statistician of the frequentist school,
such a question has no necessary connection to probability theory. Further-
more, even for a Bayesian, the medical diagnosis problem has several possible
answers. One piece of information necessary for a Bayesian calculation is miss-
ing: the test's long-run frequency of correctly diagnosing persons who have the
disease (admittedly a minor problem if we can assume a high "true positive
rate"). A more serious difficulty is that the problem does not specify whether
or not the person was randomly drawn from the population to which the base
rate refers. Clinicians, however, know that patients are usually not randomly
selected—except in screening and large survey studies—but rather "select"
themselves by exhibiting symptoms of the disease. In the absence of random
sampling, it is unclear what to do with the base rates specified. The modal
response, 0.95, would follow from applying the Bayesian principle of indif-
ference (i.e., same prior probabilities for each hypothesis), whereas the answer
0.02 would follow from using the specified base rates and assuming random
sampling. In fact, the range of actual answers corresponds quite well to the
range of possible solutions.

How to Make the Cognitive Illusion Disappear The literature overflows with
assertions of the generality and robustness of the "base-rate fallacy," such as:
"the base-rate effect appears to be a fairly robust phenomenon that often results
from automatic or unintentional cognitive processes" (Landman & Manis,
1983, p. 87); and "many (possibly most) subjects generally ignore base rates
completely" (Pollard & Evans, 1983, p. 124; see also Table 12.1). Not only are
the normative claims often simplistic and, therefore, misleading, but so too are
the robustness assertions.

What happens if we do something similar as for the "overconfidence bias"
and the "conjunction fallacy," that is, rephrase the medical diagnosis problem
in a frequency format? Cosmides and Tooby (1996) did so. They compared the
original problem (above) with a frequency format, in which the same infor-
mation was given:

One out of 1000 Americans has disease X. A test has been developed to
detect when a person has disease X. Every time the test is given to a
person who has the disease, the test comes out positive. But sometimes
the test also comes out positive when it is given to a person who is
completely healthy. Specifically, out of every 1000 people who are per-
fectly healthy, 50 of them test positive for the disease.

Imagine that we have assembled a random sample of 1000 Americans.
They were selected by a lottery. Those who conducted the lottery had
no information about the health status of any of these people. How many
people who test positive for the disease will actually have the disease?

out of

HOW TO MAKE COGNITIVE ILLUSIONS DISAPPEAR 253

In this frequentist version of the medical diagnosis problem, both the in-
formation and the question are phrased in terms of frequencies. (In addition,
the two pieces of information missing in the original version [see above] are
supplied. In numerous other versions of the medical diagnosis problem, Cos-
mides and Tooby showed that the striking effect [see Table 12.4] on partici-
pants' reasoning is mainly due to the transition from a single-event problem
to a frequency format, and only to a lesser degree to the missing information.)
Participants were Stanford University undergraduates.

If the question was rephrased in natural frequencies, as shown above, then
the Bayesian answer of 0.02—that is, the answer "one out of 50 (or 51)"—was
given by 76% of the participants. The "base-rate fallacy" disappeared. Bycom-
parison, the original single-event version elicited only 12% Bayesian answers
in Cosmides and Tooby's study. Chapter 6 provides an explanation for this
effect.

Cosmides and Tooby identified one condition in which almost every par-
ticipant found the Bayesian answer of 0.02. Participants received the frequen-
tist version of the medical diagnosis problem (except that it reported a random
sample of "100 Americans" instead of "1000 Americans"), and in addition a
page with 100 squares (10 X 10). Each of these squares represented one Amer-
ican. Before the frequentist question "How many people who test positive . . . "
was put, participants were asked to (i) circle the number of people who will
have the disease and (ii) to fill in squares to represent people who will test
positive. After that, 23 out of 25 participants came up with the Bayesian an-
swer (see frequency format, pictorial, in Table 12.4).

All three examples point in the same direction: The mind acts as if it were
a frequentist; it distinguishes between single events and frequencies in the long
run—just as probabilists and statisticians do. Despite the fact that researchers
in the "heuristics and biases" program routinely ignore this distinction fun-
damental to probability theory when they claim to have identified "errors," it

Table 12.4 How to make the "base-rate fallacy" disappear: The
medical diagnosis problem

Medical diagnosis problem Bayesian
answers
N (%)

Original single-event version 60 18
(Casscells, Schoenberger, & Grayboys,
1978) 25 12
50 76
Single-event version, replication 25 92
(Cosmides & Tooby, 1996)

Frequency format
(Cosmides & Tooby, 1996)

Frequency format, pictorial
(Cosmides & Tooby, 1996)

254 COGNITIVE ILLUSIONS AND STATISTICAL RITUALS

would be foolish to label these judgments "fallacies." These results not only
point to a truly new understanding of judgment under uncertainty, but they
also seem to be relevant for teaching statistical reasoning.

Selected versus Random Sampling: More on the
Base-Rate Fallacy

Another conceptual distinction routinely used by probabilists and statisticians
is that between random sampling and selected sampling. Again, little attention
has been given to that distinction when intuitive statistical reasoning is inves-
tigated. The original medical diagnosis problem is silent about whether the
patient was randomly selected from the population. That this crucial infor-
mation is missing is not atypical. For instance, in the "Tom W." problem (Kah-
neman & Tversky, 1973), no information is given about how the personality
sketch of Tom W. was selected, whether randomly or not. The same holds for
the personality sketches of Gary W. and Barbara T. in Ajzen's (1977) base-rate
studies.

But the issue is not necessarily resolved simply by asserting random sam-
pling verbally in the problem. Consider the following famous demonstration
of base-rate neglect in which random sampling is actually mentioned. A group
of students had to solve the engineer-lawyer problem (Kahneman &Tversky,
1973, pp. 241-242):

A panel of psychologists have interviewed and administered personality
tests to 30 engineers and 70 lawyers, all successful in their respective

fields. On the basis of this information, thumbnail descriptions of the 30
engineers and 70 lawyers have been written. You will find on your forms
five descriptions, chosen at random from the 100 available descriptions.
For each description, please indicate your probability that the person
described is an engineer, on a scale from 0 to 100.

Two of these thumbnail descriptions were:

Jack is a 45-year-old man. He is married and has four children. He is

generally conservative, careful, and ambitious. He shows no interest in

political and social issues and spends most of his free time on his many

hobbies which include home carpentry, sailing, and mathematical puz-

zles. The probability that Jack is one of the 30 engineers in the sample

of 100 is %.

Dick is a 30-year-old man. He is married with no children. A man of
high ability and high motivation, he promises to be quite successful in
his field. He is well liked by his colleagues.

A second group of students received the same instructions and the same
descriptions, but were told that the base rates were 70 engineers and 30 law-
yers (as opposed to 30 engineers and 70 lawyers). Kahneman and Tversky
found that the mean response in both groups of students was for the most part
the same, and concluded that base rates were largely ignored. Their explana-

HOW TO MAKE COGNITIVE ILLUSIONS DISAPPEAR 255

tion was that participants use a representativeness heuristic, that is, they judge
the probability by the similarity (representativeness)between a description and
their stereotype of an engineer. Kahneman and Tversky believed that their
participants were violating "one of the basic principles of statistical predic-
tion," the integration of prior probability with specific evidence by Bayes's
rule. The result was given much weight: "The failure to appreciate the rele-
vance of prior probability in the presence of specific evidence is perhaps one
of the most significant departures of intuition from the normative theory of
prediction" (p. 243).2

The Normative Issue The phrase "the normative theory of prediction," or
probability, is standard rhetoric in the "heuristics and biases" program. But
what is this normative theory? Certainly it is not frequentist—to name, for
example, only the most popular theory of probability. So let us infer what the
authors mean by "the normative theory" from what they want their partici-
pants to do. This seems to be simply to apply a formula—Bayes's rule—to the
engineer-lawyer problem. But there is more to good probabilistic reasoning
than applying formulas mechanically. There are assumptions to be checked
(see Mueser, Cowan, &Mueser, 1999). Is the structure of the problem the same
as the structure of the statistical model underlying the formula?

One important structural assumption is random sampling. If the descrip-
tions of Jack, Dick, and the others were not randomly sampled but selected,
the base rates of engineers and lawyers specified were indeed irrelevant. In
fact, the descriptions were made up and not randomly sampled from a popu-
lation with the base rates specified—although the participants were told the
contrary. Whether the single word "random" in the instruction is enough to
commit participants to this crucial structural assumption is a problem in it-
self—particularly because we cannot assume that people are familiar with sit-
uations in which profession guessing is about randomly drawn people. For
instance, both in the United States and in Germany there is a popular TV
program in which a panel of experts guesses the profession of a candidate,
who answers only "yes" or "no" to their questions. Here, the experts would
perform badly if they started out with the known base rates of professions, say
in the United States, and revised them according to Bayes's rule. The candi-
dates were selected, not randomly drawn.

2. The terms "prior probabilities" and "base rates" are frequently used interchange-
ably in the psychological literature. But these concepts are not identical. It is the prior
probabilities that are fed into Bayes's rule, and these priors may be informed by base
rates. Base rates are just one piece of information among several that a person can con-
sider relevant for making up her prior probabilities. Equating prior probabilities with
one particular kind of base-rate information would be a narrow understanding of Bayes-
ian reasoning. Such reasoning might be defensible in those situations in which one
knows very little, but not in real-life situations in which one can base judgments on rich
knowledge.

256 COGNITIVE ILLUSIONS AND STATISTICAL RITUALS

Random Sampling Increases Use of Base Rates One way to understand partic-
ipants' judgments is to assume that the engineer-lawyer problem activates ear-
lier knowledge associated with profession guessing, which can be used as an
inferential framework—a "mental model"—to solve the problem.3 But, as I
have argued, we cannot expect random sampling to be part of this mental
model. If my analysis is correct, then base-rate use can be increased if we take
care to commit the participants to the crucial property of random sampling—
that is, break apart their mental models and insert the new structural assump-
tion. In contrast, if the true explanation is that participants rely on the repre-
sentativeness heuristic, then the participants should continue to neglect base
rates.

There is a simple method of making people aware of random sampling in
the engineer-lawyer problem, which we used in a replication of the original
study (Gigerenzer, Hell, &Blank, 1988). The participants themselves drew each
description (blindly) out of an urn, unfolded the description, and gave their
probability judgments.There was no need to tell them about random sampling
because they did it themselves. This condition increased the use of base rates.
Participants' judgments were closer to Bayesian predictions than to base-rate
neglect. When we used, for comparison, the original study's version of the
crucial assumption—as a one-word assertion—neglect of base rates appeared
again (although less intensely than in Kahneman and Tversky's study).

Worthless Specific Evidence and the Base-Rate Fallacy The description of
"Dick" (see above) is a particularly interesting case. It was constructed to be
totally uninformative for distinguishing engineers from lawyers. Kahneman
and Tversky (1973) reported that the median probabilities were the same (0.50)
in both base-rate groups. That people neglect base rates even when only
"worthless specific evidence" is given has been taken by the authors to dem-
onstrate the "strength" (p. 242) of the representativeness heuristic. This strik-
ing result led to many a speculation:

The fact that the base rate is ignored even when the individuating infor-
mation is useless (for example, the target is 'ambitious' or 'well liked')
suggests that the preference for specific-level evidence is so great that
the base rate or high-level default information is not even retrieved once
the subject tries to make a prediction on the basis of the specific infor-
mation. (Holland et al., 1986, p. 218)

Such statements need to be corrected. First, if the crucial structural as-
sumption of random sampling is made clear, base rates are no longer ignored
in participants' judgments about the "uninformative" description of Dick, just
as for "informative" descriptions such as Jack's. Second, and equally striking,

3. I use the term "mental model" in a sense that goes beyond Johnson-Laird's (1983).
As in the theory of probabilistic mental models (Chapter 7), a mental model is an infer-
ential framework that generalizes the specific task to a reference class (andprobability
cues defined on it) that a person knows from his or her environment.

HOW TO MAKE COGNITIVE ILLUSIONS DISAPPEAR 257

I show that even with Kahneman and Tversky's original "verbal assertion
method," that is, a one-word assertion of random sampling, there is in fact no
support for the claim that judgments about an uninformative description are
guided by a general representativeness heuristic—contrary to assertions in the
literature.

Table 12.5 lists all studies of the uninformative description "Dick" that I
am aware of—all replications of Kahneman and Tversky's (1973) verbal asser-
tion method. The two base-rate groups were always 30% and 70% engineers.
According to Kahneman and Tversky's argument, the difference between the
two base-rate groups should approach the difference between the two base
rates, that is, 40% (or somewhat less, if the description of Dick was not per-
ceived as totally uninformative by the participants). The last column shows
their result mentioned above, a zero difference, which we (Gigerenzer, Hell, &
Blank, 1988) could closely replicate. Table 12.5 also shows, however, that sev-
eral studies found substantial mean differences up to 37%, which comes very
close to the actual difference between base-rate groups.

Seen together, the studies seem to be as inconsistent as it is possible to be:
Every result between zero difference (base-rate neglect) and the actual base-
rate difference has been obtained. This clearly contradicts the rhetoric of ro-
bustness and generality of the base-rate fallacy, such as: "Regardless of what
kind of information is presented, subjects pay virtually no attention to the base
rate in guessing the profession of the target" (Holland et al., 1986, p. 217). And
it contradicts the explanation of the so-called fallacy: the proposed general
representativeness heuristic.

Table 12.5 How to make the "base-rate fallacy" disappear: The uninformative
description "Dick" in the engineer—lawyer problem

Study No. of "Dick" Mean
descriptions encountered difference
first (relative between base
frequency) rate groups8

Gigerenzer, Hell, & Blank

(1988)b 6 1/6 1.2

Kahneman & Tversky (1973)c 5 1/5 0.0

Wells & Harvey (1978) 2 1/2 18.0

Ginosssar & Trope (1987)d 1 1 24.0

Ginossar & Trope (1980)c 1 1 31.0

Gigerenzer, Hell, & Blank (1988)e 1 1 37.0

a. Entries are {p70 (E I D)—p30(E I D)} X 100, where p70 (E ID) is the mean probability judgment that
"Dick" is an engineer, given the description and the 70% base rate.
b. Order of descriptionssystematically varied.

c. Medians (no means reported).
d. Three descriptions were used, but "Dick" was always encountered first.

e. Separate analysis for all participants who encountered "Dick" first.

258 COGNITIVE ILLUSIONS AND STATISTICAL RITUALS

How to explain these apparently inconsistent results? Table 12.5 gives us a
clue. There is a striking correlation between the number of descriptions each
participant read and judged and the mean difference between base-rate groups.
The key variable seems to be the relative frequency with which participants
encountered "Dick" first, which is a direct function of the number of descrip-
tions. In all studies in which only Dick was used (i.e., the number of descrip-
tions was 1), or in which a separate analysis was performed for all participants
who encountered Dick first, there is a strong base-rate effect. If Dick and one
informative description (Jack) were used, as in Wells and Harvey (1978), then
the base-rate effect is in between, because of averaging across participants who
encountered Dick either before or after the informative description. Thus Table
12.5 supports the following conclusions. (1) Contrary to claims in the litera-
ture, participants did make use of the base rates if only uninformative infor-
mation ("Dick") was presented. (2) The neglect of base rates occurred only in
a specific condition, that is, when the participants had encountered one or
more informative descriptions before they judged "Dick"—in other words,
when "Dick" occurred in the second, third, fourth, or later position. (3) The
more descriptions a participant encountered, the less often "Dick" was in the
first position, and—because of averaging across positions (such as in Kahne-
man and Tversky's study)—the smaller the difference was between base-rate
groups.

This result should not occur if the intuitive statistician operates with a rep-
resentativeness heuristic. Again, an explanatory framework using mental mod-
els based on knowledge about a particular domain (here, profession guessing)
seems to be superior. If an informative description is encountered first, a men-
tal model is activated that contains probability cues for professions, such as
hobbies and political attitudes. Once the mental model is activated, the mind
uses it as an inferential framework for similar-looking problems, that is, when
"Dick" is encountered as the second or subsequent problem. Carrying over
mental models to similar problems is analogous to perceptual judgment. We
watch the first few steps and then proceed on the hypothesis that the rest are
like the first (Gregory, 1974). This practice can sometimes make us stumble,
but this kind of uncertain and "risky" inference is what makes our perceptual
apparatus superior to any computer available today.

If the uninformative description is encountered first, however, then such a
mental model is not activated, because its probability cues would not discrim-
inate, and participants fall back on the only information available, the base
rates (which are, as I argued above, not part of the mental model of profession
guessing).

To summarize: (1) There is little justification for calling participants' judg-
ments in the engineer—lawyer problem an "error" in probabilistic reasoning,
because (aside from the frequentist argument) participants were not committed
to random sampling. (2) If one lets the participants do the random drawing,
base-rate neglect disappears. (3) That participants are sensitive to the distinc-
tion between random and selected (nonrandom) drawings shows again that the
framework of so-called "heuristics and biases" is much too narrow for under-

HOW TO MAKE COGNITIVE ILLUSIONS DISAPPEAR 259

standing judgments under uncertainty (for similar results see Ginossar & Trope
1987; Grether, 1980; Hansen & Donoghue, 1977; Wells & Harvey, 1977; but see
Nisbett & Borgida, 1975).

Note that the critical variable here is the content of a problem. There seems
to be a class of contents for which participants know from their environment
that base rates are relevant (as do birds; see Caraco, Martindale, & Whittam,
1980) or that random sampling is common (though they need not represent
these concepts explicitly), whereas in other contents this is not the case. Pro-
fession guessing seems to belong to the latter category. In contrast, predictions
of sports results, such as those of soccer games, seem to belong to the former.
For instance, we found that participants revised information about the pre-
vious performance of soccer teams (base rates) in light of new information
(half-time results) in a way that is indistinguishable from Bayesian statistics
(Gigerenzer, Hell, & Blank, 1988). Here verbal assertion of random drawing
was sufficient—there was no need for strong measures to break apart mental
models.

Heuristics

The concept of a "heuristic" has various meanings and a long history—from
Descartes's 21 heuristic rules for the direction of the mind to Duncker's heu-
ristic methods that guide the stepwise reformulation of a problem until it is
solved (Groner, Groner, & Bischof, 1983). The cognitive revolution has rein-
troduced the concept of a heuristic into psychology, in particular in the work
of Herbert Simon (1957). Because of limited information-processing abilities,
Simon argued, humans have to construct simplified models of the world. Heu-
ristics are a product of these: They are shortcuts that can produce efficient
decisions. Simon understood heuristics such as satisficing (i.e., selecting the
first option available that meets an aspiration level) as adaptive strategies in a
complex environment, in which alternatives for action are not given but must
be sought out.

In the 1970s, Kahneman and Tversky borrowed the term "heuristic" from
artificial intelligence to explain "errors" in probabilistic reasoning: "People
rely on a limited number of heuristic principles which reduce the complex
tasks of assessing probabilities and predicting values to simpler judgmental
operations. In general, these heuristics are quite useful, but sometimes they
lead to severe and systematic errors" (Tversky & Kahneman, 1974, p. 1124).
Although they repeatedly asserted that these heuristics are useful, almost all
of their work focused on how they lead to "errors." The three heuristics pro-
posed in the early 1970s—representativeness, availability, and anchoring and
adjustment—were a first, promising step to connect the rather atheoretical
Bayesian research of the 1960s with cognitive theory. But in the 30 years of
"heuristics and biases" research since then, a lack of theoretical progress is
possibly the most striking result. The absence of a general theory or even of
specific models of underlying cognitive processes has been repeatedly criti-

260 COGNITIVE ILLUSIONS AND STATISTICAL RITUALS

cized (e.g., Jungermann, 1983; Wallsten, 1983), but to no avail. Why is this? I
believe that particular features of the use of the term "heuristic" have led to
the present conceptual dead end, and more research in a cul-de-sac will not
help. In my opinion, these features are the following.

The Function of Heuristics

In artificial intelligence research one hopes that heuristics can make computers
smart; in the "heuristics and biases" program one hopes that heuristics can
tell why humans are not smart. The fundamental problem with the latter is
that most "errors" in probabilistic reasoning that one wants to explain by heu-
ristics are in fact not errors, as I have argued above. Thus heuristics are meant
to explain what does not exist. Rather than explaining a deviation between
human judgment and allegedly "correct" probabilistic reasoning, future re-
search has to get rid of simplistic norms that evaluate human judgment instead
of explaining it.

Simon, and earlier Egon Brunswik, has emphasized that cognitive functions
are adaptations to a given environment and that we have to study the structure
of environments in order to infer the constraints they impose on reasoning.
Heuristics such as representativeness have little to say about how the mind
adapts to the structure of a given environment.

Redescription as a Substitute for Theorizing

Several of the explanations using heuristics are hardly more than redescrip-
tions of the phenomena reported. Take, for instance, the explanation of base-
rate neglect in the engineer—lawyer problem (and similar base-rate problems)
by the representativeness heuristic. Representativeness here means the per-
ceived similarity between a personality description and the participants'
stereotype of an engineer. In the vocabulary of Bayes's rule, this similarity is
a likelihood: that is, the probability of a description given that the person is
an engineer. Now we can see that Bayes's rule, in particular its concepts of
base rates (prior probabilities) and likelihoods, provides the vocabulary for
both the phenomenon and its purported explanation. The phenomenon is ne-
glect of base rates and use of likelihoods. The "explanation" is that participants
use representativeness (likelihoods) and do not use base rates. What is called
a representativeness heuristic here is nothing more than a redescription of the
phenomenon (Gigerenzer & Murray, 1987, pp. 153-155).

Heuristics Are Largely Undefined Concepts

Representativeness means similarity. Although there are numerous specific
models of similarity (including Tversky, 1977), the relationship between the
representativeness heuristic and these models has never been worked out.
Fiedler (1983), for instance, has analyzed the theoretical weakness of explain-
ing estimated frequencies of events by the availability heuristic. All three heu-

HOW TO MAKE COGNITIVE ILLUSIONS DISAPPEAR 261

ristics, representativeness, availability, and anchoring and adjustment, are
largely undefined concepts and can post hoc be used to explain almost every-
thing. After all, what is similar to what (representativeness),what comes into
your mind (availability), and what comes first (anchoring) have long been
known to be important principles of the mind.

More Undefined Concepts, Less Theory

Instead of giving up the program of explaining deviations of human judgment
from simplistic norms by means of redescription and largely undefined heu-
ristics, the last 25 years have witnessed the effort to keep that program going
and to add further undefined concepts such as "causal base rates" and "viv-
idness" to account for contradictory results (for an analysis of the "causal base
rate" concept, see Gigerenzer & Murray, 1987, pp. 157-162). Heuristics such
as representativeness are by now riddled with exceptions, but all this tinkering
has not given us much purchase in understanding judgment under uncertainty.

Beyond Heuristics and Biases

I have argued that what have been widely accepted to be the "normative prin-
ciples of statistical prediction" (e.g., Ajzen, 1977, p. 304), against which hu-
man judgment has been evaluated as "fallacious," are a caricature of the pres-
ent state of probability theory and statistics. I have shown that several so-called
fallacies are in fact not violations of probability theory. Conceptual distinctions
routinely used by probabilists and statisticians were just as routinely ignored
in the normative claims of "fallacies." Most strikingly, in the experimental
research reviewed, "fallacies" and "cognitive illusions" tend to disappear if
we pay attention to these fundamental distinctions. I am certainly not the first
to criticize the notion of "robust fallacies." The only novelty in my research
is that the variables that bring "cognitive illusions" under experimental control
are those important from the viewpoint of probability and statistics (as op-
posed to, say, whether participants were given more or less "vivid" or "caus-
ally relevant" information).

Together, these results point to several ways to develop an understanding
of judgment under uncertainty that goes beyond the narrow notion of a "bias"
and the largely undefined notion of a "heuristic."

Use Different Statistical Models as Competing
Explanatory Models

The existence of different statistical models of inference is a rich resource for
developing theories about intuitive inference. This resource has been rarely
touched, possibly because of the misleading normative view that statistics
speaks with one voice.

262 COGNITIVE ILLUSIONS AND STATISTICAL RITUALS

For instance, despite the quantity of empirical data that has been gathered
on the cab problem, the lack of a theory of the cognitive processes involved
in solving it is possibly the most striking result. Tversky and Kahneman
claimed that the cab problem has one "correct answer" (1980, p. 62). They
attempted to explain the extent to which people's judgments deviated from
that "norm" by largely undefined terms such as "causal base rates." But sta-
tistics gives several interesting answers to the cab problem, rather than just
one "correct" answer (e.g., Birnbaum, 1983; Gigerenzer, 1998c; Levi, 1983). If
progress is to be made and people's cognitive processes are to be understood,
one should no longer try to explain the difference between people's judgments
and Tversky and Kahneman's "normative" Bayesian calculations. People's
judgments have to be explained. Statistical theories can provide highly inter-
esting models of these judgments. The only theoretically rich account of the
cognitive processes involved in solving the cab problem (or similar "eyewit-
ness testimony" problems) was in fact derived from a frequentist framework:
Birnbaum (1983) combined Neyman-Pearson theory with psychological mod-
els of judgments such as range-frequency theory.

Future research should use competing statistical theories as competing ex-
planatory models, rather than pretending that statistics speaks with one voice
(see also Cohen, 1982; Wallendael & Hastie, 1990).

Explore the Metaphor of the Mind as a Frequentist

I reported earlier the striking effect of participants' judging frequencies rather
than probabilities for single events. These results suggest that the mind distin-
guishes between frequencies and other meanings of probability, just as a stat-
istician of the frequentist school does. Because "cognitive illusions" tend to
disappear in frequency judgments, it is tempting to think of the intuitive sta-
tistics of the mind as frequentist statistics.

Processing of frequencies seems to be fairly automatic, like encoding of time
and space (e.g., Hasher & Zacks, 1979)—whereas probabilities are in evolu-
tionary terms recent tools of the mind that seem to be processed less auto-
matically. The theory of probabilistic mental models (Chapter 7) seems to be
the first frequentist theory of confidence judgments that integrates Brunswik's
frequency-learning view with the notion of mental models. The general theo-
retical point is that both single-case and frequency judgments are explained
by learned frequencies (the probability cues), albeit by frequencies that relate
to different reference classes and different networks of cues—in short, to dif-
ferent mental models.

Intuitive Statisticians Need to Check the Structure
of the Environment

Good judgment under uncertainty is more than mechanically applying a for-
mula, such as Bayes's rule, to a real-world problem. The intuitive statistician,
like his professional counterpart, must first check the structure of the environ-

HOW TO MAKE COGNITIVE ILLUSIONS DISAPPEAR 263

ment (or of a problem) in order to decide whether to apply a statistical algo-
rithm at all, and if so, which (see Gigerenzer & Murray, 1987, pp. 162-174).
There is no good (applied) probabilistic reasoning that ignores the structure of
the environment and mechanically uses only one (usually mathematically con-
venient) algorithm. I illustrate this point with a thought experiment by Nisbett
and Ross (1980, p. 15), which I have shortened and slightly changed here (in
respects unimportant to my argument).

(i) You wish to buy a new car. Today you must choose between two
alternatives: to purchase either a Volvo or a Saab. You use only one cri-
terion for that choice, the car's life expectancy. You have information
from Consumer Reports that in a sample of several hundred cars the
Volvo has the better record. Just yesterday a neighbor told you that his
new Volvo broke down. Which car do you buy?

Nisbett and Ross comment that after the neighbor's information "the num-
ber of Volvo-owners has increased from several hundred to several hundred
and one" and that the Volvo's record "perhaps should be changed by an iota"
(p. 15). The moral of their thought experiment is that good probabilistic rea-
soning is applying an algorithm (here, updating of base rates) to the world.
There is some truth to this message of resisting the temptation of the vivid and
personal, but that is only half the story. Good intuitive statistics is more than
calm calculation; first and foremost, the structure of the environment has to
be examined. I will now vary the content of Nisbett and Ross's thought exper-
iment to make the point intuitively immediate. Here is the same problem, but
with a different content (Gigerenzer, 1990):

(ii) You live in a jungle. Today you must choose between two alterna-
tives: to let your child swim in the river, or to let it climb trees instead.
You use only one criterion for that choice, your child's life expectancy.
You have information that in the last 100 years there was only one ac-
cident in the river, in which a child was eaten by a crocodile, whereas
a dozen children have been killed by falling from trees. Just yesterday
your neighbor told you that her child was eaten by a crocodile. Where
do you send your child?

If good probabilistic reasoning means applying the same algorithm again
and again, the neighbor's testimony should make no difference. The base rates
would be updated by the testimony from one to two cases in 100 years, and
by this reasoning one would send the child into the river. The mind of a parent,
however, might use the new information to reject the updating algorithm in-
stead of inserting the new information into the algorithm. A parent may sus-
pect that the small river world has changed—crocodiles may now inhabit the
river.

Why do we have different intuitions for the Volvo and the crocodile prob-
lems? In the Volvo problem, the prospective buyer may assume that the Volvo
world is stable and that the important event (good or bad Volvo) can be con-
sidered as an independent random drawing from the same reference class. In
the crocodile problem, the parents may assume that the river world has

264 COGNITIVE ILLUSIONS AND STATISTICAL RITUALS

changed and that the important event (being eaten or not) can no longer be
considered as an independent random drawing from the same reference class.
Updating "old" base rates may be fatal for the child.

The question of whether some part of the world is stable enough to use
statistics has been posed by probabilists and statisticians since the inception
of probability theory in the mid-seventeenth century—and the answers have
varied and will vary, as is well documented by the history of insurance (Das-
ton, 1987). Like the underwriter, the layperson has to check structural as-
sumptions before entering into calculations. For instance, the following struc-
tural assumptions are all relevant for the successful application of Bayes's rule:
independence of successive drawings, random sampling, an exhaustive and
mutually exclusive set of hypotheses, and independence between prior prob-
abilities and likelihoods.

How can the intuitive statistician judge whether these assumptions hold?
One possibility is that the mind generalizes the specific content to a broader
mental model that uses implicit domain-dependent knowledge about these
structural assumptions. If so, then the content of problems is of central im-
portance for understanding judgment—it embodies implicit knowledge about
the structure of an environment.

The Surplus Structure of the Environment

Analyzing the environment (problem) using structural properties of a given
statistical model is one way to understand its structure. But natural environ-
ments often have surplus structure, that is, a structure that goes beyond prior
probabilities and likelihoods (the Bayesian structure) or entailment and con-
tradiction (the structure of binary prepositional logic). Surplus structure in-
cludes space and time (Bjorkman, 1984), cheating options, perspective, and
social contracts (Cosmides, 1989), among others. Surplus structure is the rea-
son that the notion of "structural isomorphs" has only limited value.

The idea of studying inductive reasoning using structural isomorphs (i.e.,
use a particular statistical model or formal principle and construct problems
that all have the same formal structure but different contents) is implicit in
much research on reasoning; it postulates that if two problems have different
contents, but the same formal structure (say, Bayesian probability-revision
structure), then judgments should be the same. But the structure of natural
environments is usually richer than what Bayes's rule has to offer, and two
structural isomorphs may differ on relevant surplus structure. If we understand
reasoning as an adaptation to the environment, then it should be sensitive to
surplus structure.

One way to deal with this is to devise theories that combine statistical the-
ory and psychological principles—just as the most distinguished statisticians
of this century, R. A. Fisher, J. Neyman, and E. S. Pearson, emphasized that
good statistical reasoning always consists of mathematics and personal judg-
ment. Birnbaum (1983) gave several examples of how the Neyman-Pearson
theory can be combined with psychological principles to give a theoretically

HOW TO MAKE COGNITIVE ILLUSIONS DISAPPEAR 265

rich account of intuitive inference. Developing such integrated models is a
challenging task for future research on judgments under uncertainty.

The Social Context of Judgment and Decision

Judgment under uncertainty occurs in a social environment in which there are
other "players" who make a person's response more or less rational. Here is
an anecdote to illustrate this point.

A small town in Wales has a village idiot. He once was offered the choice
between a pound and a shilling, and he took the shilling. People came
from everywhere to witness this phenomenon. They repeatedly offered
him a choice between a pound and a shilling. He always took the shil-
ling.

Seen as a single choice (and by all monotone utility functions), this choice
would seem irrational. Seen in its social context, in which a surprising choice
increases the probability of getting to choose again, this behavior looks differ-
ent.

The following are several aspects of the social context of judgment and
decision that have been explored recently. First, human judgment seems to be
domain specific rather than guided by some general mental logic. In particular,
reasoning about social contracts seems to have its own laws. The striking
changes of judgment depending on people's perspective and cheating options
in a social contract were shown by Cosmides (1989) and Gigerenzer and Hug
(1992). Second, the role of conversational principles in social interactions,
such as that participants assume the experimenter's contribution will be co-
operative (Adler, 1984; Grice, 1975), has sometimes been acknowledged by,
but never been integrated into, the judgment under uncertainty literature.
Third, humans share knowledge and decisions, and sharing imposes con-
straints on information processing and judgment as postulated by shareability
theory (Freyd, 1983). Fourth, research on group decision making and judg-
ments negotiated by two or more people is still largely disconnected from
"individualistic" social cognition research (see Scholz, 1983).

Conclusion

A key metaphor for understanding inductive reasoning is probability theory.
Since its origins in the mid-seventeenth century and throughout the Enlight-
enment, probability theory was viewed as a mathematical codification of ra-
tionality. In Pierre Laplace's famous phrase: probability theory is "only good
sense reduced to calculus" (1814/1951, p. 196). When there was a striking
discrepancy between the judgment of reasonable people and what probability
theory dictated—as with the famous St. Petersburg paradox—then the mathe-
maticians went back to the blackboard and changed the equations (Daston,
1980).

266 COGNITIVE ILLUSIONS AND STATISTICAL RITUALS

Those good old days have gone, although the eighteenth-century link be-
tween probability and rationality is back in vogue in cognitive and social psy-
chology. If, in studies on social cognition, researchers find a discrepancy be-
tween human judgment and what probability theory seems to dictate, the
blame is now put on the human mind, not on the statistical model.

I have used classical demonstrations of overconfidence bias, conjunction
fallacy, and base-rate neglect to show that what have been called "errors" in
probabilistic reasoning are in fact not violations of probability theory. They
only look so from a narrow understanding of good probabilistic reasoning that
ignores conceptual distinctions fundamental to probability and statistics.
These so-called cognitive illusions largely disappear when one pays attention
to these conceptual distinctions. The intuitive statistician seems to be highly
sensitive to them—a result unexpected from the view that "mental illusions
should be considered the rule rather than the exception" (see Table 12.1).

Why do cognitive illusions largely disappear? The examples in this chapter
have illustrated three reasons:

1. Polysemy: not all probabilities are mathematical probabilities. Asking
a frequency as opposed to a probability question can reduce the mul-
tiple meanings (polysemy) of the English terms "probable" and
"likely." Frequency questions clarify that the question is actually
about mathematical probability and not about one of the other legit-
imate meanings (see the Oxford English Dictionary), which are often
suggested by the cover story of a problem. Reducing polysemy seems
to be the major reason the conjunction fallacy in the Linda problem
largely disappears (Hertwig & Gigerenzer, 1999).

2. A mathematical probability refers to a reference class, which may
differ depending on the task. Asking a frequency (as opposed to a
probability) question can systematicaly cue different reference classes
(and therefore, different probabilistic mental models). Changing ref-
erence classes seems to be the reason overconfidencebias appears in
probability judgments and disappears in frequency judgments (Chap-
ter 7).

3. Natural frequencies facilitate Bayesian reasoning. When information
is represented in natural frequencies rather than in conditional prob-
abilities (or relative frequencies), Bayesian computations become sim-
pler. Using natural frequencies is a powerful tool to reduce people's
mental confusion and foster Bayesian reasoning (Chaper 6).

This is not to say that frequencies always improve judgment. For instance,
the theory of probabilistic mental models specifies conditions under which
frequency judgments systematically underestimate actual frequencies, and
Chapter 6 explains why natural frequencies but not other kinds of frequencies
facilitate Bayesian reasoning. The question is not whether or not, or how often,
"cognitive illusions" disappear, but why. We need precise models of heuristics
that make surprising (and falsifiable) predictions, not vague terms that, post
hoc, explain everything and nothing. Future progress will be in understanding,
not debunking, human thinking.

13

The Superego, the Ego, and the Id
in Statistical Reasoning

Piaget worked out his logical theory of cognitive development, Kohler the Ge-
stalt laws of perception, Pavlov the principles of classical conditioning, Skin-
ner those of operant conditioning, and Bartlett his theory of remembering and
schemata—all without rejecting null hypotheses. But by the time I took my
first course in psychology at the University of Munich in 1969, null hypothesis
tests were presented as the indispensable tool, as the sine qua non of scientific
research. Post-World War II German psychology mimicked a revolution of re-
search practice that had occurred between 1940 and 1955 in American psy-
chology.

What I learned in my courses and textbooks about the logic of scientific
inference was not without a touch of moralizing, a scientific version of the Ten
Commandments: Thou shalt not draw inferences from a nonsignificant result.
Thou shalt always specify the level of significance before the experiment; those
who specify it afterward (by rounding up obtained p values) are cheating. Thou
shalt always design thy experiments so that thou canst perform significance
testing.

The Inference Revolution

What happened between the time of Piaget, Kohler, Pavlov, Skinner, and Bart-
lett and the time I was trained? In Kendall's (1942) words, statisticians "have
already overrun every branch of science with a rapidity of conquest rivalled
only by Attila, Mohammed, and the Colorado beetle" (p. 69).

What has been termed the probabilistic revolution in science (Gigerenzer et
al., 1989) reveals how profoundly our understanding of nature changed when
concepts such as chance and probability were introduced as fundamental the-
oretical concepts. The work of Mendel in genetics, that of Maxwell and Boltz-
mann on statistical mechanics, and the quantum mechanics of Schrodinger
and Heisenberg that built indeterminism into its very model of nature are key
examples of that revolution in thought.

267

268 COGNITIVE ILLUSIONS AND STATISTICAL RITUALS

Psychology did not resist the probabilistic revolution, and psychologists in
turn actively contributed to the growth of statistics. But psychology is none-
theless a peculiar case. In psychology and in other social sciences, probability
and statistics were typically not used to revise the understanding of our subject
matter from a deterministic to some probabilistic view (as in physics, genetics,
or evolutionary biology) but rather to mechanize the experimenters' infer-
ences—in particular, their inferences from data to hypothesis. Of course, there
have been several attempts to revise our theories as well—for example, to
transform Piaget's logical determinism into a more Darwinian view, in which
variability and irregularity are seen as the motor of evolution rather than as
an annoyance (Gruber, 1977; Gruber & Voneche, 1977) or to transform Skin-
ner's theory into a probabilistic learning theory (Estes, 1959). But the real,
enduring transformation came with statistical inference, which became insti-
tutionalized and used in a dogmatic and mechanized way. This use of statis-
tical theory contrasts sharply with physics, in which statistics and probability
are indispensable in theories about nature, whereas mechanized statistical in-
ference such as null hypothesis testing is almost unknown.

So what happened with psychology? David Murray and I described the
striking change in research practice and named it the inference revolution in
psychology (Gigerenzer & Murray, 1987). It happened between approximately
1940 and 1955 in the United States and led to the institutionalization of one
brand of inferential statistics as the method of scientific inference in university
curricula, textbooks, and the editorial policies of major journals.

The ground for the inference revolution was prepared by a dramatic shift
in experimental practice. During the 1920s, 1930s, and 1940s, the established
tradition of experimenting with single participants—from Wundt to Pavlov—
was replaced in the United States by the treatment group experiment, in which
group means are compared. For instance, between 1915 and 1950, the per-
centage of empirical studies reporting only group data in the American Journal
of Psychology rose from 25% to 80%, and the reporting of only individual
data decreased from 70% to 17% (Danziger, 1990). Danziger argued that this
shift was in part due to the pressure felt by American academic psychologists
to legitimize their work through showing its practical utility. The Wundtian
type of experiment was useless to educational administrators, the largest market
for psychological products. The treatment group experiment, however, ap-
peared to fit their needs exactly, for example, by allowing them to compare
mean performance in two classrooms that were using different instruction
methods. After this change in experimental practice, null hypothesis testing
of group means appeared to be tailor-made to the new unit of research, the
group aggregate. Consistent with Danziger's argument, the institutionalization
of both the treatment group and null hypothesis testing spread from the applied
fields to the laboratories (Lovie, 1979). The contrast with Germany is inform-
ative. German academic psychologists of the early 20th century had to legit-
imize their work before a different tribunal, the values of a well-entrenched
intellectual elite (Danziger, 1990). In contrast to the United States, the German
educational system, driven by tradition rather than by experimentation, pro-

THE SUPEREGO, THE EGO, AND THE ID IN STATISTICAL REASONING 269

vided only a limited market for psychologists. No comparable shift in exper-
imental practice happened in German psychology. It was only after World War
II that a new generation of German psychologists began to assimilate the meth-
odological imperatives imported from their colleagues in the United States.

The figures are telling. Before 1940, null hypothesis testing using analysis
of variance or t tests was practically nonexistent: Rucci and Tweney (1980)
found only 17 articles in all from 1934 through 1940. By 1955, more than 80%
of the empirical articles in four leading journals used null hypothesis testing
(Sterling, 1959). Today, the figure is close to 100%. By the early 1950s, half of
the psychology departments in leading U.S. universities had made inferential
statistics a graduate program requirement (Rucci &Tweney, 1980). Editors and
experimenters began to measure the quality of research by the level of signif-
icance obtained. For instance, in 1962, the editor of the Journal of Experimen-
tal Psychology, A. W. Melton (1962, pp. 553-554), stated his criteria for ac-
cepting articles. In brief, if the null hypothesis was rejected at the .05 level but
not at the .01 level, there was a "strong reluctance" to publish the results,
whereas findings significant at the .01 level deserved a place in the journal.
The Publication Manual of the American Psychological Association (APA,
1974) prescribed how to report the results of significance tests (but did not
mention other statistical methods) and used, as Melton did, the label negative
results synonymously with "not having rejected the null" and the label posi-
tive results with "having rejected the null."

It is likely that Piaget's, Kohler's, Bartlett's, Pavlov's, and Skinner's experi-
mental work would have been rejected for publication under such editorial
policies—these men did not set up null hypotheses and try to refute them.
Some of them were actively hostile toward institutionalized statistics. For his
part, Skinner (1972) disliked the intimate link Fisher had established between
statistics and the design of experiments: "What the statistician means by the
design of experiments is design which yields the kind of data to which his
techniques are applicable" (p. 122). And, "They have taught statistics in lieu
of scientific method" (p. 319). Skinner continued to investigate one or a few
pigeons under well-controlled conditions, rather than run 20 or more pigeons
under inevitably less well-controlled conditions to obtain a precise estimate
for the error variance. In fact, the Skinnerians were forced to found a new
journal, the Journal of the Experimental Analysis of Behavior, in order to pub-
lish their kind of experiments (Skinner, 1984, p. 138). Their focus was on ex-
perimental control, that is, on minimizing error beforehand, rather than on
large samples, that is, on measuring error after the fact.

This is not an isolated case, nor one peculiar to behaviorists. The Journal of
Mathematical Psychology is another. One of the reasons for launching this new
journal was again to escape the editors' pressure to perform institutionalized
null hypothesis testing.1 One of its founders, Luce (1988), called the institution-

1. R. Duncan Luce, personal communication, April 4, 1990. See also Luce's (1989)
autobiography, p. 270 and pp. 281-282.

270 COGNITIVE ILLUSIONS AND STATISTICAL RITUALS

alized practice a "wrongheaded view about what constituted scientific pro-
gress" and "mindless hypothesis testing in lieu of doing good research: measur-
ing effects, constructing substantive theories of some depth, and developing
probability models and statistical procedures suited to these theories" (p.582).

Who is to blame for the present state of mindless hypothesis testing? Fisher
was blamed by Skinner, as well as by Meehl: "Sir Ronald has befuddled us, mes-
merized us, and led us down the primrose path. I believe that the almost univer-
sal reliance on merely refuting the null hypothesis . . . is ... one of the worst
things [that] ever happened in the history of psychology" (Meehl, 1978, p. 817).

I share the sentiments expressed by Luce and Meehl. But to blame Fisher,
as Meehl and Skinner did, gives us at best a spurious understanding of the
inference revolution. Fisher declared that a significance test of a null hypoth-
esis is only a "weak" argument. That is, it is applicable only in those cases in
which we have very little knowledge or none at all. For Fisher, significance
testing was the most primitive type of argument in a hierarchy of possible
statistical analyses (see Gigerenzer et al., 1989, chap. 3). In this chapter I argue
the following points:

1. What has become institutionalized as inferential statistics in psy-
chology is not Fisherian statistics. It is an incoherent mishmash of
some of Fisher's ideas on the one hand and some of the ideas of Ney-
man and E. S. Pearson on the other. I refer to this blend as the "hybrid
logic" of statistical inference. Fisher, Neyman, and Pearson would all
have rejected it, although for different reasons.

2. The institutionalized hybrid carries the message that statistics is sta-
tistics is statistics, that is, that statistics is a single integrated structure
that speaks with a single authoritative voice. This entails the claim
that the problem of inductive inference in fact has an algorithmic
answer (i.e., the hybrid logic) that works for all contents and contexts.
Both claims are wrong, and it is time to go beyond this institution-
alized illusion. We must write new textbooks and change editorial
practices. Students and researchers should be exposed to different
approaches (not one) to inductive inference and be trained to use
these in a constructive (not mechanical) way. A free market of several
good ideas is better than a state monopoly for a single confused idea.

The "Parents" and Their Conflicts

To understand the structure of the hybrid logic that has been taught in psy-
chology for some 50 years, I briefly sketch those ideas of Fisher on the one
hand and Neyman and Pearson on the other that are relevant to understanding
the hybrid structure of the logic of inference.

Fisher's first book, Statistical Methods for Research Workers, published in
1925, was successful in introducing biologists and agronomists to the new
techniques. However, it had the agricultural odor of issues like the weight of
pigs and the effect of manure, and, such alien topics aside, it was technically
far too difficult to be understood by most psychologists.

THE SUPEREGO, THE EGO, AND THE ID IN STATISTICAL REASONING 271

Fisher's second statistical book, The Design of Experiments, first published
in 1935, was most influential on psychology. At the very beginning of this
book, Fisher rejected the theory of inverse probability (Bayesian theory) and
congratulated the Reverend Bayes for having been so critical of his own theory
as to withhold it from publication (Bayes's treatise was published posthu-
mously in 1763). Bayes's rule is attractive for researchers because it allows one
to calculate the probability p(H\ D) of a hypothesis H given some data D, also
known as inverse probability. A frequentist theory, such as Fisher's null hy-
pothesis testing or Neyman-Pearson theory, however, does not. It deals with
the probabilities p(D I H) of some data D given a hypothesis H, such as the
level of significance.

Fisher was not satisfied with an approach to inductive inference based on
Bayes's rule. The use of Bayes's rule presupposes that a prior probability dis-
tribution over the set of possible hypotheses is available. For a frequentist,
such as Fisher, this prior distribution must theoretically be verifiable by actual
frequencies, that is, by sampling from its reference set. These cases are rare.
But if we are ignorant and have no a priori distributional information, then
every researcher can express that ignorance in different numbers leading, for
Fisher, to an unacceptable subjectivism. As we shall see, however, Fisher
wanted to both reject the Bayesian cake and eat it, too.

Fisher proposed several alternative tools for inductive inference. In The
Design of Experiments (1935), he started with null hypothesis testing, also
known as significance testing, and he gave that tool the most space in his book.
It eventually became the backbone of institutionalized statistics in psychology.
In a test of significance, one confronts a null hypothesis with observations, to
find out whether the observations deviate far enough from the null hypothesis
to conclude that the null is implausible. The specific techniques of null hy-
pothesis testing, such as the t test (devised by Cosset, using the pseudonym
"Student," in 1908) or the Ftest (Ffor Fisher, e.g., in analysis of variance) are
so widely used that they may be the lowest common denominator of what
psychologists today do and know.

The topic of this chapter is the logic of inference rather than specific tech-
niques. Just as with Bayes's rule, the problems we encounter do not concern
the formula—the rule is a simple consequence of the definition of conditional
probability. The problems arise with its application to inductive inference in
science. To what aspect of inductive inference does a particular algorithm, or
technique, refer? What do the calculations mean? These are questions that
pertain to what I call the logic of inference.

Concerning my account of Fisher's logic of significance testing, one thing
must be said in advance: Fisher's writings and polemics had a remarkably
elusive quality, and people have read his work quite differently. During
Fisher's long and acrimonious controversy with Neyman and Pearson, which
lasted from the 1930s to his death in 1962, he changed, and sometimes even
reversed, parts of his logic of inference. Thus the following brief account of
Fisher's logic of inference represents one possible reading (for a more detailed
analysis, see Gigerenzer et al., 1989, chap. 3).

272 COGNITIVE ILLUSIONS AND STATISTICAL RITUALS

How Do We Determine the Level of Significance?

In The Design, Fisher suggested that we think of the level of significance as a
convention: "It is usual and convenient for experimenters to take 5 per cent
as a standard level of significance, in the sense that they are prepared to ignore
all results which fail to reach this standard" (1935, p. 13). Fisher's assertion
that 5% (in some cases, 1%) is a convention that is adopted by all experi-
menters and in all experiments and that nonsignificant results are to be ignored
became part of the institutionalized hybrid logic.

But Fisher had second thoughts, which he stated most clearly in the mid-
1950s. These did not become part of the hybrid logic. One of the reasons for
that revision was his controversy with Neyman and Pearson and Neyman's
(e.g., 1950) insistence that one has to specify the level of significance (which
is denoted as a in Neyman-Pearson theory) before the experiment, in order to
be able to interpret it as a long-run frequency of error. Neyman and Pearson
took the frequentist position more seriously than Fisher. They argued that the
meaning of a level of significance such as 5% is the following: If the null
hypothesis is correct and if the experiment is repeated many times, then the
experimenter will wrongly reject the null in 5% of the cases. To reject the null
if it is correct is called an error of the first kind (Type I error) in Neyman-
Pearson theory, and its probability is called alpha (a). In his last book, Statis-
tical Methods and Scientific Inference (1956), Fisher ridiculed this definition
as "absurdly academic, for in fact no scientific worker has a fixed level of
significance at which from year to year, and in all circumstances, he rejects
hypotheses; he rather gives his mind to each particular case in the light of his
evidence and his ideas" (p. 42). Fisher rejected the Neyman-Pearson logic of
repeated experiments (repeated random sampling from the same population)
and thereby rejected his earlier proposal to have a conventional standard level
of significance, such as .05 or .01. What researchers should do, according to
Fisher's second thoughts, is to publish the exact level of significance, say, p —
.03 (not p < .05), and communicate this result to their fellow research workers.
This means that the level of significance is determined after the experiment,
not, as Neyman and Pearson proposed, before the experiment.

Thus the phrase "level of significance" has three meanings: (a) the standard
level of significance, a conventional standard for all researchers (early Fisher),
(b) the exact level of significance, a communication to research fellows, deter-
mined after the experiment (late Fisher), and (c) the alpha level, the relative
frequency of Type I errors in the long run, to be decided on using cost-benefit
considerations before the experiment (Neyman & Pearson). The basic differ-
ence is this: For Fisher, the exact level of significance is a property of the data
(i.e., a relation between a body of data and a theory); for Neyman and Pearson,
alpha is a property of the test, not of the data. Level of significance and alpha
are not the same thing.

Neyman and Pearson thought their straightforward long-run frequentist in-
terpretation of the significance test—and the associated concepts of power and
of stating two statistical hypotheses (rather than only one, the null)—would

THE SUPEREGO, THE EGO, AND THE ID IN STATISTICAL REASONING 273

be an improvement on Fisher's theory and make it more consistent. Fisher
disagreed. Whereas Neyman and Pearson thought of mathematical and con-
ceptual consistency, Fisher thought of ideological differences. He accused Ney-
man, Pearson, and their followers of confusing technology with knowledge:
Their focus on Type I and Type II errors, on cost-benefit considerations that
determine the balance between the two, and on repeated sampling from the
same population has little to do with scientific practice, but it is characteristic
for quality control and acceptance procedures in manufacturing. Fisher (1955,
p. 70) compared the Neyman—Pearsonians to the Soviets, their five-year plans,
and their ideal that "pure science can and should be geared to technological
performance." He also compared them to Americans, who confuse the process
of attaining knowledge with speeding up production or saving money. (Inci-
dentally, Neyman was born in Russia and went to Berkeley, California, after
Fisher made it difficult for him to stay on at University College in London).

What Does a Significant Result Mean?

The basic differences are these: Fisher attached an epistemic interpretation to
a significant result, which referred to a particular experiment. Neymanrejected
this view as inconsistent and attached a behavioral meaning to a significant
result that did not refer to a particular experiment but to repeated experiments.
(Pearson found himself somewhere in between.)

In The Design, Fisher talked about how to "disprove" a null hypothesis
(e.g., 1935, pp. 16—17). Whatever phraseology he used, he always held that a
significant result affects our confidence or degree of belief that the null hy-
pothesis is false. This is what I refer to as an epistemic interpretation: Signif-
icance tells us about the truth or falsehood of a particular hypothesis in a
particular experiment. Here we see very clearly Fisher's quasi-Bayesian view
that the exact level of significance somehow measures the confidence we
should have that the null hypothesis is false. But from a more consistent fre-
quentist viewpoint, as expressed by Neyman, a level of significance does not
tell us anything about the truth of a particular hypothesis; it states the relative
frequency of Type I errors in the long run.

Neyman (1957, 1977) called his frequentist interpretation behavioristic:To
accept or reject a hypothesis is a decision to take a particular action. Imagine
a typical application of Neyman—Pearson theory: quality control. Imagine you
have chosen the probability of Type I errors (false alarms) to be .10 and that
of Type II errors (misses) to be .01, because misses are much more costly to
your firm than false alarms. Every day you take a random sample from the
firm's production. Even if the production is normal, you will expect a signifi-
cant result (false alarm) on 10% of all days. Therefore, if a significant result
occurs, you will act as if the null hypothesis were false, that is, stop production
and check for a malfunction; but you will not necessarily believe that it is
false—because you expect a lot of false alarms in the long run.

Fisher rejected Neyman's arguments for "inductive behavior" as "childish"
(1955, p. 75), stemming from "mathematicians without personal contact with

274 COGNITIVE ILLUSIONS AND STATISTICAL RITUALS

the Natural Sciences" (p. 69). And he maintained his epistemic view: "From
a test of significance . . . we have a genuine measure of the confidence with
which any particular opinion may be held, in view of our particular data"
(p. 74). For all his anti-Bayesian talk, Fisher adopted a very similar-sounding
line of argument (Johnstone, 1987).

Does "Significant" Imply That There Is a Causal Effect?

Of course not. It is useful to distinguish between the statistical null hypothesis
and the substantive null hypothesis.2 Only the latter refers to the absence of a
particular cause. What is rejected in significance testing is the statistical hy-
pothesis, not the existence or absence of a cause. But in Fisher's writings we
can read both "yes" and "no" as answers to the aforementioned question.
Sometimes Fisher formulated the null hypothesis as "the treatment has no
effect, period," whereas in other places he formulated it as a statistical null
hypothesis (see Gigerenzer et al., 1989, pp. 95-97). In the famous Tea-Tasting
Experiment (Fisher, 1935), for instance, he stated clearly that we cannot con-
clude from a significant result (disproving the null) that the opposite hypoth-
esis (which is not formulated as an exact statistical hypothesis in null hypoth-
esis testing) is proven. (This experiment was designed to test a lady's claim
that she could tell whether the milk or the tea infusion was first added to a
cup.) That is, we cannot infer the existence of a causal process from a signif-
icant result—here, that the lady can discriminate between whether the milk
or the tea infusion was first added to the cup. For instance, there exist other
causal mechanisms (someone told the lady in which cups the tea infusion had
been poured first) that are consistent with rejecting the null hypothesis.

What Does a Nonsignificant Result Mean?

In The Design, Fisher (1935) proposed asymmetry: A null hypothesis can be
disproved, but "never proved or established" (p. 16), so "experimenters . . . are
prepared to ignore all [nonsignificant] results" (p. 13). This has been under-
stood by many textbook writers as saying that no conclusions can be drawn
from a nonsignificant result. And several textbook authors laid down the com-
mandment that I was taught: "Thou shalt not draw inferences from a nonsig-
nificant result." This made nonsignificance appear a negative, worthless, and
disappointing result. In Neyman-Pearson theory, in contrast, there is symme-
try, and a conclusion is drawn from nonsignificance: Act as if the null hy-
pothesis were true. The reason is that Neyman and Pearson start with a dis-
junction of two symmetric hypotheses (either H0 or Ha is true) and proceed by
induction through elimination.

2. On the distinction between statistical and substantive hypotheses, see Hager and
Westermann (1983) and Meehl (1978).

THE SUPEREGO, THE EGO, AND THE ID IN STATISTICAL REASONING 275

Fisher (1955) again had second thoughts: "It is a fallacy . . . to conclude
from a test of significance that the null hypothesis is thereby established; at
most it may be said to be confirmed or strengthened" (p. 73). Thus, although
nonsignificant results cannot establish null hypotheses, according to his sec-
ond thoughts, we can do more than just "ignore" them: We may say that a
nonsignificant result "confirms" but does not "establish" the null hypothesis.
Now Fisher suggested that a nonsignificant result might indeed support the
null hypothesis, but he did not explain how.

Power

In null hypothesis testing, only one kind of error is defined: rejecting the null
hypothesis when it is in fact true. In their attempt to supply a logical basis for
Fisher's ideas and make them consistent, Neyman and Pearson replaced
Fisher's single null hypothesis by a set of rival hypotheses. In the simplest
case, two hypotheses, H0 and H,, are specified, and it is assumed that one of
them is true. This assumption allows us to determine the probability of both
Type I errors and Type II errors, indicated in Neyman-Pearson theory by a
and (3,respectively. If Ha is rejected although Ha is true, a Type II error has
occurred, ot is also called the size of a test, and 1 —P is called its power. The
power of a test is the long-run frequency of accepting Ht if it is true. The
concept of power makes explicit what Fisher referred to as "sensitivity."

Fisher (1935) pointed out two ways to make an experiment more sensitive:
by enlarging the number of repetitions and by qualitative methods, such as
experimental refinements that minimize the error in the measurements
(pp. 21-25). Nevertheless, he rejected the concept of Type II error and calcu-
lations of power on the grounds that they are inappropriate for scientific in-
duction. In his view, calculations of power, although they look harmless, reflect
the "mental confusion" between technology and scientific inference (Fisher,
1955, p. 73). If someone designs a test for acceptance procedures in quality
control, the goal of which is to minimize costs due to decision errors, calcu-
lations of power based on cost-benefit considerations in situations of repetitive
tests are quite appropriate. But scientific inference and discovery, in Fisher's
view, are about gaining knowledge, not saving money.

Fisher always rejected the concept of power. Neyman, for his part, pointed
out that some of Fisher's tests "are in a mathematical sense 'worse than use-
less,' " because their power is less than their size (see Hacking, 1965, p. 99).
Even in the Tea-Tasting Experiment, used by Fisher to introduce the logic of
null hypothesis testing in The Design, the power is only a little higher than
the level of significance (.05) or cannot be calculated at all, depending on the
conditions (see Neyman, 1950).

Random Sampling from Known Populations?

Acceptance procedures involve random sampling from a known population
(say, a firm's daily production). They also allow for repeated random sampling

276 COGNITIVE ILLUSIONS AND STATISTICAL RITUALS

(every day a random sample may be taken). Recall that Neyman and Pearson
based their theory on the concept of repeated random sampling, which denned
the probability of Type I and Type II errors as long-run frequencies of wrong
decisions in repeated experiments.

Fisher, in contrast, held that in scientific applications there is no known
population from which repeated sampling can be done. There are always many
populations to which a sample may belong. "The phrase 'repeated sampling
from the same population' does not enable us to determine which population
is to be used to define the probability level, for no one of them has objective
reality, all being products of the statistician's imagination" (Fisher, 1955,
p. 71). Fisher proposed to view any sample (such as the sample of participants
in a typical psychological experiment, which is not drawn randomly from a
known population) as a random sample from an unknown hypothetical infinite
population. "The postulate of randomness thus resolves into the question, 'Of
what population is this a random sample?' which must frequently be asked by
every practical statistician" (Fisher, 1922, p. 313). But how can the practical
statistician find out? The concept of an unknown hypothetical infinite popu-
lation has puzzled many: "This is, to me at all events, a most baffling concep-
tion" (Kendall, 1943, p. 17).

Mechanical Scientific Inference

One way of reading The Design suggests that null hypothesis testing is a fairly
mechanical procedure: Set up a null hypothesis, use a conventional level of
significance, calculate a test statistic, and disprove the null hypothesis, if you
can. Fisher later made clear that he did not mean it to be so. For instance, he
pointed out that the choice of the test statistic and deciding which null hy-
potheses are worth testing cannot be reduced to a mechanical process. You
need constructive imagination and much knowledge based on experience
(Fisher, 1933). Statistical inference has two components: informed judgment
and mathematical rigor.

Similarly, Neyman and Pearson always emphasized that the statistical part
has to be supplemented by a subjective part. As Pearson (1962) put it: "We left
in our mathematical model a gap for the exercise of a more intuitive process
of personal judgment in such matters—to use our terminology—as the choice
of the most likely class of admissible hypotheses, the appropriate significance
level, the magnitude of worthwhile effects and the balance of utilities"
(pp. 395-396).

In Neyman and Pearson's theory, once all judgments are made, the decision
(reject or accept) results mechanically from the mathematics. In his later writ-
ings, Fisher opposed these mechanical accept/reject decisions, which he be-
lieved to be inadequate in science, in which one looks forward to further data.
Science is concerned with the communication of information, such as exact
levels of significance. Again, Fisher saw a broader context, the freedom of the
Western world. Communication of information (but not mechanical decisions)

THE SUPEREGO, THE EGO, AND THE ID IN STATISTICAL REASONING 277

recognizes "the right of other free minds to utilize them in making their own
decisions" (Fisher, 1955, p. 77).

But Neyman reproached Fisher with the same sin—mechanical statistical
inference. As a statistical behaviorist, Neyman (1957) looked at what Fisher
actually did in his own research in genetics, biology, and agriculture, rather
than at what he said one should do. He found Fisher using .01 as a conven-
tional level of significance, without giving any thought to the choice of a par-
ticular level dependent on the particular problem or the probability of an error
of the second kind; he accused Fisher of drawing mechanical conclusions,
depending on whether or not the result was significant. Neyman urged a
thoughtful choice of the level of significance, not using .01 for all problems
and contexts.

Both camps in the controversy accused the other party of mechanical,
thoughtless statistical inference; thus I conclude that here at least they
agreed—statistical inference should not be automatic.

These differences between what Fisher proposed as the logic of significance
testing and what Neyman and Pearson proposed as the logic of hypothesis
testing suffice for the purpose of this chapter. Both have developed further
tools for inductive inference, and so have others, resulting in a large toolbox
that contains maximum likelihood, fiducial probability, confidence interval ap-
proaches, point estimation, Bayesian statistics, sequential analysis, and ex-
ploratory data analysis, to mention only a few. But it is null hypothesis testing
and Neyman-Pearson hypothesis-testing theory that have transformed exper-
imental psychology and part of the social sciences.

The Offspring: Hybrid Logic

The conflicting views presented earlier are those of the parents of the hybrid
logic. Not everyone can tolerate unresolved conflicts easily and engage in a
free market of competing ideas. Some long for the single truth or search for a
compromise that could at least suppress the conflicts. Kendall (1949) com-
mented on the desire for peace negotiations among statisticians:

If some people asserted that the earth rotated from east to west and others
that it rotated from west to east, there would always be a few well-
meaning citizens to suggest that perhaps there was something to be said
for both sides, and maybe it did a little of one and a little of the other;
or that the truth probably lay between the extremes and perhaps it did
not rotate at all. (p. 115)

The denial of the existing conflicts and the pretense that there is only one
statistical solution to inductive inference were carried to an extreme in psy-
chology and several neighboring sciences. This one solution was the hybrid
logic of scientific inference, the offspring of the shotgun marriage between
Fisher and Neyman and Pearson. The hybrid logic became institutionalized in

278 COGNITIVE ILLUSIONS AND STATISTICAL RITUALS

experimental psychology (see Gigerenzer, 1987a), personality research (see
Schwartz & Dangleish, 1982), clinical psychology and psychiatry (see Meehl,
1978), education (see Carver, 1978), quantitative sociology (see Morrison &
Henkel, 1970), and archaeology (see Cowgill, 1977; Thomas, 1978), among oth-
ers. Nothing like this happened in physics, chemistry, or molecular biology.

The Hybrid Logic Is Born

Before World War II, psychologists drew their inferences about the validity of
hypotheses by many means—ranging from eyeballing to critical ratios. The
issue of statistical inference was not of primary importance. Note that this was
not because techniques were not yet available. On the contrary: already in
1710, John Arbuthnot proved the existence of God by a kind of significance
test, astronomers had used them during the 19th century for rejecting outliers
(Swijtink, 1987), and Fechner (1897) wrote a book on statistics including in-
ference techniques, to give just a few examples. Techniques of statistical in-
ference were known and sometimes used, but experimental method was not
yet dominated by and almost equated with statistical inference.

Through the work of the statisticians George W. Snedecor at Iowa State
College, Harold Hotelling at Columbia University, and Palmer Johnson at the
University of Minnesota, Fisher's ideas spread in the United States. Psychol-
ogists began to cleanse the Fisherian message of its agricultural odor and its
mathematical complexity and to write a new genre of textbooks featuring null
hypothesis testing. Guilford's Fundamental Statistics in Psychology and Edu-
cation, first published in 1942, was probably the most widely read textbook
in the 1940s and 1950s. In the preface, Guilford credited Fisher for the logic
of hypothesis testing taught in a chapter that was "quite new to this type of
text" (p. viii). The book does not mention Neyman, E. S. Pearson, or Bayes.
What Guilford teaches as the logic of hypothesis testing is Fisher's null hy-
pothesis testing, deeply colored by "Bayesian" thinking: Null hypothesis test-
ing is about the probability that the null hypothesis is true. "If the result comes
out one way, the hypothesis is probably correct, if it comes out another way,
the hypothesis is probably wrong" (p. 156). Null hypothesis testing is said to
give degrees of doubt such as "probable" or "very likely" a "more exact mean-
ing" (p. 156). Its logic is explained via surprising headings such as "Probability
of hypotheses estimated from the normal curve" (p. 160).

Guilford's logic is not consistently Fisherian, nor does it consistently use
"Bayesian" language of probabilities of hypotheses. It wavers back and forth
and beyond. Phrases such as "we obtained directly the probabilities that the
null hypothesis was plausible" and "the probability of extreme deviations from
chance" are used interchangeably for the same thing: the level of significance.
And when he proposed his own "somewhat new terms," his intuitive Bayesian
thinking becomes crystal clear. A p value of .015 for a hypothesis of zero
difference in the population "gives us the probability that the true difference
is a negative one, and the remainder of the area below the point, or .985, gives
us the probability that the true difference is positive. The odds are therefore

THE SUPEREGO, THE EGO, AND THE ID IN STATISTICAL REASONING 279

.985 to .015 that the true difference is positive" (p. 166). In Guilford's hands,
p values that specify probabilities p(D I H ) of some data (or test statistic) D
given a hypothesis H turn miraculously into Bayesian posterior probabilities
p(H\D) of a hypothesis given data.

Guilford's confusion is not an exception. It marks the beginning of a genre
of statistical texts that vacillate between the researcher's "Bayesian" desire for
probabilities of hypotheses and what Fisher is willing to give them.

This first phase of teaching Fisher's logic soon ran into a serious compli-
cation. In the 1950s and 1960s, the theory of Neyman and E. S. Pearson also
became known. How were the textbook writers to cope with two logics of
scientific inference? How should the ideological differences and personal in-
sults be dealt with? Their solution to this conflict was striking. The textbook
writers did not side with Fisher. That is, they did not go on to present null
hypothesis testing as scientific inference and add a chapter on hypothesis test-
ing outside science, introducing the Neyman-Pearson theory as a logic for
quality control and related technological problems. Nor did they side with
Neyman and Pearson, teaching their logic as a consistent and improved version
of Fisher's and dispensing entirely with Fisherian null hypothesis testing.

Instead, textbook writers started to add Neyman-Pearsonian concepts on
top of the skeleton of Fisher's logic. But acting as if they feared Fisher's re-
venge, they did it without mentioning the names of Neyman and Pearson. A
hybrid logic of statistical inference was created in the 1950s and 1960s. Neither
Fisher nor Neyman and Pearson would have accepted this hybrid as a theory
of statistical inference. The hybrid logic is inconsistent from both perspectives
and burdened with conceptual confusion. Its two most striking features are (a)
it hides its hybrid origin and (b) it is presented as the monolithic logic of
scientific inference. Silence about its origin means that the respective parts of
the logic are not identified as part of two competing and partly inconsistent
theoretical frameworks. For instance, the idea of testing null hypotheses with-
out specifying alternative hypotheses is not identified as part of the Fisherian
framework, and the definition of the level of significance and the power of a
test as long-run frequencies of false and correct decisions, respectively, in re-
peated experiments is not identified as part of the Neyman-Pearson frame-
work. And, as a consequence, there is no mention of the fact that each of these
parts of the hybrid logic were rejected by the other party, and why, and what
the unresolved controversial issues are.

The Structure of Hybrid Logic

To capture the emotional tensions associated with the hybrid logic, I use a
Freudian analogy.3

3. Here I am elaborating on a metaphor suggested by Acree (1978). In a different
context, Devereux (1967) talked about the relation between anxiety and elimination of
subjectivity by method.

280 COGNITIVE ILLUSIONS AND STATISTICAL RITUALS

The Neyman-Pearson logic of hypothesis testing functions as the Superego
of the hybrid logic. It demands the specification of precise alternative hypoth-
eses, significance levels, and power in advance to calculate the sample size
necessary, and it teaches the doctrine of repeated random sampling. The fre-
quentist Superego forbids epistemic statements about particular outcomes or
intervals, and it outlaws the interpretation of levels of significance as the de-
gree of confidence that a particular hypothesis is true or false.

The Fisherian theory of significance testing functions as the Ego. The Ego
gets things done in the laboratory and gets papers published. The Ego deter-
mines the level of significance after the experiment, and it neither specifies
power nor calculates the sample size necessary. The Ego avoids precise pre-
dictions from its research hypothesis; that is, it does not specify the exact
predictions of the alternative hypothesis but claims support for it by rejecting
a null hypothesis. The Ego makes abundant epistemic statements about par-
ticular results. But it is left with feelings of guilt and shame for having violated
the rules.

Censored by both the frequentist Superego and the pragmatic Ego are state-
ments about probabilities of hypotheses given data. These form the Bayesian
Id of the hybrid logic. Some direct measure of the validity of the hypotheses
under question—quantitatively or qualitatively—is, after all, what researchers
really want.

The Freudian metaphor suggests that the resulting conceptual confusion in
the minds of researchers, editors, and textbook writers is not due to limited
intelligence. The metaphor brings the anxiety and guilt, the compulsive and
ritualistic behavior, and the dogmatic blindness associated with the hybrid
logic into the foreground. It is as if the raging personal and intellectual conflicts
between Fisher and Neyman and Pearson and between these frequentists and
the Bayesians were projected onto an "intrapsychic" conflict in the minds of
researchers. And the attempts of textbook writers to solve this conflict by de-
nying it have produced remarkable emotional, behavioral, and cognitive dis-
tortions.

Anxiety and Guilt

Editors and textbook writers alike have institutionalized the level of signifi-
cance as a measure of the quality of research. As mentioned earlier, Melton,
after 12 years editing one of the most prestigious journals in psychology, said
in print that he was reluctant to publish research with significance levels below
.05 but above .01, whereas p < .01 made him confident that the results would
be repeatable and deserved publication (1962, pp. 553-554). In Nunnally's In-
troduction to Statistics for Psychology and Education (1975), the student is
taught similar values and informed that the standard has been raised: "Up until
20 years ago, it was not uncommon to see major research reports in which
most of the differences were significant only at the 0.05 level. Now, such re-
sults are not taken very seriously, and it is more customary today to see results
reported only if they reach the 0.01 or even lower probability levels" (p. 195).

THE SUPEREGO, THE EGO, AND THE ID IN STATISTICAL REASONING 281

Not accidentally, both Melton and Nunnally show the same weak understand-
ing of the logic of inference and share the same erroneous belief that the level
of significance specifies the probability that a result can be replicated (dis-
cussed later). The believers in the divinatory power of the level of significance
set the standards.

The researcher's Ego knows that these publish-or-perish standards exist in
the outside world and knows that the best way to adapt is to round up the
obtained p value after the experiment to the nearest conventional level, say to
round up the value p = .006 and publish p < .01. But the Superego has higher
moral standards: If you set alpha to 5% before the experiment, then you must
report the same finding (p = .006) as "significant at the 5% level." Mostly, the
Ego gets its way but is left with feelings of dishonesty and of guilt at having
violated the rules. Conscientious experimenters have experienced these feel-
ings, and statisticians have taken notice. The following comment was made in
a panel discussion among statisticians; Savage remarked on the statisticians'
reluctance to take responsibility for once having built up the Superego in the
minds of the experimenters:

I don't imagine that anyone in this room will admit ever having taught
that the way to do an experiment is first carefully to record the signifi-
cance level then do the experiment, see if the significance level is at-
tained, and if so, publish, and otherwise, perish. Yet, at one time we
must have taught that; at any rate it has been extremely well learned in
some quarters. And there is many a course outside of statistics depart-
ments today where the modern statistics of twenty or thirty years ago is
taught in that rigid way. People think that's what they're supposed to do
and are horribly embarrassed if they do something else, such as do the
experiment, see what significance level would have been attained, and
let other people know it. They do the better thing out of their good in-
stincts, but think they're sinning. (Barnard, Kiefer, LeCam, & Savage,
1968, p. 147)

Statistics has become more tolerant than its offspring, the hybrid logic.

Denial of the Parents

The hybrid logic attempts to solve the conflict between its parents by denying
its parents. It is remarkable that textbooks typically teach hybrid logic without
mentioning Neyman, E. S. Pearson, and Fisher—except in the context of tech-
nical details, such as specific tables, that are incidental to the logic. In 25 out
of 30 textbooks I have examined, Neyman and E. S. Pearson do not appear to
exist. For instance, in the introduction to his Statistical Principles of Experi-
mental Design (1971), Winer credits Fisher with inspiring the "standard work-
ing equipment" (p. 3) in this field, but a few pages later he presents the Ney-
man-Pearson terminology of Type I error, Type II error, power, two precise
statistical hypotheses, cost-benefit considerations, and rejecting and accepting
hypotheses. Yet nowhere in the book do the names of Neyman and E. S. Pear-
son appear (except in a "thank you" note to Pearson for permission to repro-

282 COGNITIVE ILLUSIONS AND STATISTICAL RITUALS

duce tables), although quite a few other names can be found in the index. No
hint is given to the reader that there are different ways to think about the logic
of inference. Even in the exceptional case of Hays's textbook (1963), in which
all parents are mentioned by their names, the relationship of their ideas is
presented (in a single sentence) as one of cumulative progress, from Fisher to
Neyman and Pearson (p. 287).4 Both Winer's and Hays's are among the best
texts, without the confusions that abound in Guilford's, Nunnally's, and a mass
of other textbooks. Nevertheless, even in these texts the parents' different ways
of thinking about statistical inference and the controversial issues are not
pointed out.

Denial of Conflicts between Parents

Thus the conflicting views are almost unknown to psychologists. Textbooks
are uniformly silent. (Some statistics teachers protest that airing these disputes
would only confuse students. I believe that pointing out the conflicting views
would make statistics much more interesting to students who enjoy thinking
rather than being told what to do next.) As a result of this silence, many a text
muddles through the conflicting issues leaving confusion and inconsistency in
its wake—at least, among the more intelligent and alert students. For instance,
Type I and Type II errors are often defined in terms of long-run frequencies of
erroneous decisions in repeated experiments, but the texts typically stop short
of Neyman's behavioral interpretation and fall back to epistemic interpreta-
tions of the two errors as levels of confidence about the validity of the hy-
potheses. In fact, the poorer texts overflow with amazing linguistic contortions
concerning what a level of significance means. For instance, within three pages
of text, Nunnally explained that "level of significance" means all of the fol-
lowing: (1) "If the probability is low, the null hypothesis is improbable"
(p. 194); (2) "the improbability of observed results being due to error" (p. 195);
(3) "the probability that an observed difference is real" (p. 195); (4) "the
statistical confidence . . . with odds of 95 out of 100 that the observed differ-
ence will hold up in investigations" (p. 195); (5) the degree to which experi-
mental results are taken "seriously" (p. 195); (6) "the danger of accepting a
statistical result as real when it is actually due only to error" (p. 195); (7) the
degree of "faith [that] can be placed in the reality of the finding" (p. 196); (8)
"the null hypothesis is rejected at the 0.05 level"; and (9) "the investigator can
have 95 percent confidence that the sample mean actually differs from the
population mean" (p. 196). And, after the last two versions, the author assured
his readers: "All of these are different ways to say the same thing" (Nunnally,
1975, p. 196).

Nunnally did not spell out the differences between the logics of Fisher,
Neyman and Pearson, and the Bayesians. He avoided the conflicting interpre-

4. In the third edition (1981), however, Hays's otherwise excellent text falls back to
common standards: J. Neyman and E. S. Pearson no longer appear in the book.

Pages:

Click to View FlipBook Version