The words you are searching are inside this book. To get more targeted content, please make full-text search by clicking here.
Discover the best professional documents and content resources in AnyFlip Document Base.
Search
Published by soedito, 2017-08-12 02:31:31

14_Introductory_econometric_800

14_Introductory_econometric_800

The Algebra of Expectations 275

More generally, for random variables X1, X2, and so on up to Xn and constants
w1 up to wn,

nn

E wi Xi = wi E (Xi ).

i=1 i=1

In words, the expected value of a weighted sum of random variables is the
weighted sum of their respective expected values. Because of this property,
E() is said to be a linear operator. Loosely speaking, the expectations oper-
ator can be carried through the expression. The square and natural log are
examples of nonlinear operators. After all, (2 + 3)2 does not equal 22 + 32
and ln(100+100) is not equivalent to ln(100) + ln(100).

We now turn to formulas for the variance of sums of random variables.
Although the formulas for the expected values listed above apply to all ran-
dom variables, independent or not, the variance depends on whether the
random variables are independent. If not, the formula for the variance is
more complicated. Here are facts about the variance and SD for sums and
weighted sums of independent random variables. We stress that these facts
apply only to independent random variables.

V (X + Y ) = V (X ) + V (Y )
V (a X + bY ) = a2V (X ) + b2V (Y ) .

More generally, for mutually independent random variables X1 to Xn and
constants w1 to wn,

nn

V wi Xi = wi2V (Xi ).

i=1 i=1

and

n n

SD wi Xi = wi2V(Xi ).

i =1 i =1

We can apply these facts about the expected value of a weighted sum of
random variables and the variance of a weighted sum of random variables
to a situation we have already studied: repeated draws with replacement
from a box. Each draw from the box is the realization of a random variable.
Making the draws with replacement means that we are always drawing from
the same box, and so the random variables have the same distribution (in the
standard terminology, they are identically distributed). Because the draws are
with replacement and each time a draw is made every ticket is equally likely
to come out of the box, the draws are independent. In looking at the draws

276 Review of Statistical Inference

from the box, we are considering either a sum (with weights wi all equal to
1) or an average (with weights wi all equal to 1/n) of random variables.

In this very important special case, the results (for the Sample Average)
translate to

E(Sample Average) = E n1
i=1 n Xi

n1
= i=1 n E (Xi )

= n of the Box
Average
n

= Average of the Box.

This result relies on the expected value of each and every draw from the box,
or E(Xi ) for the ith draw from the box, being the average of the contents of
the box. As for the SD of the average of independent draws from the same
box,

V n1 = n 1
i=1 n Xi i =1 n2 V (Xi )

= 1
V (Box)
n

= 1 SD (Box)2 .
n

The second line follows from the variances being all equal to each other
because the draws are from the same box. Taking the square root, we arrive
at

SD n1 = SD√(Box) ,
i=1 n Xi n

or, put in more familiar language, the SD of the Sample Average is the SD
of the box divided by the square root of the number of draws.

Summary

This concludes our presentation of the expectations operator and the algebra
of expectations. We will occasionally use E() in the rest of this book, but our
primary mode of exposition will be Monte Carlo simulation. The algebra
of expectations is used to derive formulas or analytical expressions for the
expected value and SD of a data generation process. The Monte Carlo method
can demonstrate, but never prove, that a result is true.

Exercises 277

10.8. Conclusion

This chapter has served as a quick review of statistical inference but also as an
introduction to one of the basic metaphors of this book, the box model. We use
box models to clarify our thinking about the data generation process. A great
danger in econometrics is the failure to explicitly model the way in which
the data were generated. Absent careful work at this initial, crucial stage,
subsequent analyses, no matter how much time they take or sophisticated
they appear, are worthless.

This chapter has also developed an algebraic framework usable for working
out the expected values of sample statistics. This framework, called the alge-
bra of expectations, can be extremely useful. You should think of the algebra
of expectations as a complementary tool to the Monte Carlo simulation skills
you will develop through the rest of this book. You can use the algebra of
expectations to work out the expected value or SD of a statistic of interest
and Monte Carlo simulation to check your work. Alternatively, you can use
Monte Carlo simulation to clarify your view of the data generating process
and to suggest results that should be provable via the algebra of expectations.

With Chapter 9’s explanation of Monte Carlo simulation and this chapter’s
review of statistical inference, we are ready to embark on the study of regres-
sion for inference. We will develop a series of box models to firmly ground
the DGP that underlies the probabilistic interpretation of regression.

10.9. Exercises

Use the BoxModel.xls workbook to analyze the properties of a five-sided die. Unlike
a conventional six-sided die, the five-sided die has five faces, with 1, 2, 3, 4, and 5
dots. Suppose we throw it 25 times and average the 25 throws.

1. Draw the box model for this DGP.

2. Properly configure BoxModel.xls to represent this DGP. What does BoxModel.xls

display as the average and SD of the box?

3. What are the exact chances of getting an average of 3.6 or more? Describe your

procedure.

4. Does Monte Carlo simulation give similar results? Describe your procedure.

5. If the DGP changes so that we take 100 draws instead of 25 draws, what happens

to the chances of getting an average of 3.6 or more? Describe your procedure.

HINT: You can directly change the Setup sheet in BoxModel.xls.
Draw a Sample
6. Return to the Setup sheet and click the from the Box button. Suppose you did not

know the contents of the box. With your sample, test the claim that the average

of the box is 3.6. Describe your procedure.

7. Open the Consistency.xls workbook. Is eSampleAverage an unbiased estimator of

eAverageBox? If not, is eSampleAverage a consistent estimator of eAverageBox? Describe

your procedure.

8. In a new workbook, we drew a standard normal random variable (average 0 and

SD 1) in cells A1 and A2 and added them together in cell A3. Then we ran a

1,000-repetition Monte Carlo and got the results in Figure 10.9.1.

278 Review of Statistical Inference

Sheet1!$A$1 Sheet1!$A$3 Notes

Average −0.008 Average 0.021

SD 1.0092 SD 1.3976

Max 3.619 Max 4.225

Min −4.046 Min −4.449

Histogram of Sheet1!$A$1 and Sheet1!$A$3

Sheet1!$A$1
Sheet1!$A$3

−4.5 −2.5 −0.5 1.5 3.5

Figure 10.9.1. Monte Carlo simulation of adding two normal RVs.

Use the algebra of expectations to show that the Monte Carlo results for the aver-
age and SD for cell A3, the sum of the two standard normally distributed random
variables, are reasonable.

References

As we have mentioned before, we borrowed the box model metaphor from an
excellent statistics textbook and its corresponding instructor’s manual:

Freedman, D., R. Pisani, and R. Purves (1998). Statistics, Third Edition. (New
York: W.W. Norton & Company).

Freedman, D., R. Pisani, and R. Purves (1998). Instructor’s Manual for Statistics,
Third Edition. (New York: W.W. Norton & Company).

The following two econometrics textbooks do a good job of covering statistical
inference:

Amemiya, T. (1994). Introduction to Statistics and Econometrics. Cambridge, MA:
Harvard University Press.

Goldberger, A. S. (1998). Introductory Econometrics. Cambridge, MA: Harvard
University.

Appendix: The Normal Approximation

The normal approximation is used to compute an estimate of a particular area
under distributions which more or less resemble the normal curve. For example,
many biological characteristics like height and weight have histograms for the
general population which look a lot like the normal curve. The method can thus be
used to find quick, roughly accurate answers to questions about how many people
fall into certain height or weight categories.

Appendix: The Normal Approximation 279

Center (Average or EV) 202 Upper cutoff in SUs −1.024
Spread (SD or SE) 41
Upper Cutoff in actual units 160 Lower cutoff in SUs Minus Infinity
Lower Cutoff in actual units
Area 15.28%

−3
−2.8
−2.5
−2.3

−2
−1.8
−1.5
−1.3

−1
−0.8
−0.5
−0.3

0
0.25

0.5
0.75

1
1.25

1.5
1.75

2
2.25

2.5
2.75

3

Figure 10.A.1. Applying the Normal Approximation.
Source: [BoxModel.xls]NormalApprox.

Consider the following example: one wishes to know the fraction of the adult
male U.S. population which has a serum total cholesterol level below 160
milligrams per deciliter of blood. Cholesterol levels are pretty much normally
distributed, so the normal approximation should be reasonably accurate.

Here is the procedure for using the normal approximation: First, one identifies
the interval of interest. Second, one converts the interval in question into standard
units. Standard units measure how far a particular value is from the average in
terms of standard deviations. The mean total cholesterol level for adult males is 202
and the SD is 41.8 Thus, a male with an actual level of 243 has a cholesterol level of
1 in terms of standard units while a male whose cholesterol level is –2 in standard
units has an actual level of 120. Finally, one finds the area under the standard
normal curve (with mean 0 and SD 1) for the interval in question. This is the
approximate fraction of the population which falls into that interval.

Let us apply the normal approximation to this example. A serum total cholesterol
level of 160 converts to about –1 in terms of standard units. The area beneath the
normal curve from negative infinity to –1 is about 16 percent. The National Center
for Health Statistics reports that a level of 160 was in fact the 15th percentile for the
sample. The normal approximation works quite well in this particular case.

In this book, the normal approximation proves useful in answering questions
about sampling distributions. The central limit theorem says that the sampling
distribution (or probability histogram) for many sample statistics approaches the
normal curve as the sample size increases. The normal approximation is used to

8 These are figures derived from a survey conducted between 1988 and 1994 of 6587 males, ages 20 to 74
years old by the National Center for Health Statistics. The survey is the National Health and Nutrition
Examination Survey. See www.cdc.gov/nchs/about/major/nhanes/datatblelink.htm.

280 Review of Statistical Inference

estimate P-values for observed sample statistics. The P-value is the area under the
sampling distribution for the null hypothesis corresponding to results as extreme as
or more extreme than the one observed in the sample.

The NormalApprox sheet in BoxModel.xls can be used to implement the normal
approximation. Figure 10.A.1 below shows how this sheet can be filled in to work
the example in this appendix. The P Value Calculator add-In (introduced in
Section 10.5) can also be used to apply the normal approximation.

11

The Measurement Box Model

It has generally been customary certainly to regard as an axiom the hypothesis that
if any quantity has been determined by several direct observations, made under the
same circumstances and with equal care, the arithmetical mean of the observed values
affords the most probable value. . . .

Carl Friedrich Gauss1

11.1. Introduction
Regression is the dominant method of empirical analysis in economics. It has
two basic applications: description and inference. The first eight chapters of
this book use regression for description. Chapters 9 and 10 introduce and
review tools for making statistical inference. We are now ready to see how
regression is used when the data are a sample from a population.

The next few chapters prepare the ground for the study of regression as a
tool for inference and forecasting. Inference in general means reasoning from
factual knowledge or evidence. In statistics, we have a sample drawn from a
population and use the sample to infer something about the population.

For example, suppose we have data on 1,178 people in the United States
in 1989 selected at random from the adult working population. We have the
level of experience and the wages of these people. Part 1 discusses the use
of regression to provide a summary of the bivariate wage-experience data.
Statistical inference aims at a much more ambitious goal. Instead of simply
describing the relationship for those 1,178 people, we wish to discover the
relationship between wage and experience for all of the adult workers in the
United States. Our aim is to make educated guesses about the population
based on information gathered from the sample.

Throughout our study of regression applied to inferential questions, we
will emphasize the importance that chance and sampling error play in our
educated guesses, which we will call estimates. Although the details require

1 Gauss (1857, Article 177), cited by Lee (n.d., p. 96).

281

282 The Measurement Box Model

concentration and effort, the main idea – that an estimate based on a partic-
ular sample is likely to be off the true, unknown, population value – is not
difficult to grasp.

We stress the importance of understanding the role of chance in an infer-
ential setting because regression for inference requires an explicit model of
the chance process. We do not want the student to memorize a list of rules
that must be met or, worse, assumed. Instead, our goal is true understanding
of different models of chance and their implications for regression analysis in
an inferential setting. Thus, much of the presentation in the rest of the book
is built on the idea of sampling and sampling error. Although proceeding
with caution over some difficult terrain, we do count on prior knowledge of
elementary statistical inference.

In this chapter we discuss a simple model for the data generation process
first used by astronomers as a way of using combining measurements of celes-
tial bodies to estimate their true orbits. The problem these scientists faced
was that, despite strong theoretical evidence that planets ought to orbit along
smooth curves, their measurements did not all fit on a single curve. They real-
ized that the data resulted from imperfect measurements of the exact location
of the planets. The scientists’ task was somehow to reconcile the data to come
up with a single best estimate of the true orbit. In this endeavor astronomers
realized that, in general, it was a good practice to make use of all the obser-
vations. The question was how. The solution ultimately depended on arriving
at a satisfactory model of the data generation process.

We begin with this model in a book dedicated to econometrics because it
serves as an easily understandable bridge from the data generation processes
of basic statistics (what we have called the coin-flip and polling box mod-
els) to the classical econometric model of Chapter 13. Sections 11.2 through
11.5 discuss a univariate problem in which we measure a single quantity
repeatedly. We will show how the basic models of the data generating pro-
cess reviewed in Chapter 10 can be modified to work out the properties of
the sample average in this measurement problem. In Section 11.6, a crucial
conceptual leap is made by extending the measurement box model to the
problem of the relationship between two variables estimated via a bivariate
regression.

Chapters 11 through 13 present three different descriptions of the data gen-
eration process. In Chapter 13, we point out that, mathematically speaking,
the measurement box model of this chapter and the classical econometric
model of Chapter 13 are identical. Why do we distinguish between them?
We do so because we wish to stress that one must have a coherent, plausible
explanation for the data generation process before one proceeds to statistical
inference. The measurement box model of this chapter assigns very different
roles to chance error than does the classical econometric model.

Introducing the Problem 283

This chapter also demonstrates two complementary approaches applied
throughout the rest of the book: the box model, which facilitates compre-
hension of the data generating process, and Monte Carlo simulation, which
enables us to approximate the distribution of estimates obtained according
to a specified data generating process.

11.2. Introducing the Problem

We will start with a hypothetical example designed to illustrate the problem
of estimating a physical quantity using more than one measurement. Suppose
you wanted to know the distance between two mountain peaks because such
knowledge was extremely important to you. Each mountain reached a sharp
point in the horizon. You took a picture of the two peaks and then ingeniously
used geometry to calculate angles and so forth and eventually came up with
an answer of 107.23 miles.

The answer seemed reasonable and everything was OK, but then a nagging
doubt occurred: 107.23 miles seemed fairly precise (given the two decimal
points), but that is equal to 566,174.4 feet, which, in turn, is equal to 6,794,092.8
inches. Thinking about the distance in millions of inches made you doubt
your measurement. Surely, you thought, the measured distance could not
have been that precise.

Figuring that there is only one way to find out, you measured again. You
took another picture, carefully measured the distance on the photo with a
fancy image scanner hooked to a computer, applied the same complicated
geometric algorithm, and got . . . 106.41 miles.

“What will I do now?” you pondered. Having little else to do and a large
quantity of film available, you decided to measure again and again and again!
All told, you measured that distance 25 times. You did exactly the same thing
every time, taking care to record each step accurately in a log book and
double check your calculations. Not once did you obtain exactly the same
measurement. Figure 11.2.2 contains the data you collected.

We are facing a problem of statistical inference. We do not know the true,
exact distance between the mountain peaks. It exists, but our measuring

Distance Measured = 107.23 miles

Peak A Peak B

Figure 11.2.1. A distance measuring problem.

284 The Measurement Box Model

Observation Distance Measured (miles)

1 107.23
2 106.41
3 105.97
4 106.13
5 108.35
6 105.60
7 105.55
8 105.64
9 106.80
10 105.57
11 108.77
12 108.56
13 108.65
14 105.99
15 105.48
16 106.83
17 107.12
18 105.51
19 106.19
20 106.71
21 106.59
22 107.71
23 106.82
24 106.18
25 105.95

Figure 11.2.2. Hypothetical distance measurements.

strategy is imperfect. The best we can do is use the data in the figure to
infer an answer.

The first thing we have to do is figure out why the numbers are different.
It is not that the mountains are moving. They may be on shifting, tectonic
plates, but that could not possibly account for variation in the observed dis-
tances of a mile or so. Neither is it a case of mistake – like writing down the
numbers incorrectly. The spread in the observed distances is being caused
by the measuring strategy itself. Even when applied perfectly, there is ran-
domness in the measurement process. This is a general property of measure-
ment. The variation in observed measures has come to be called measurement
error.

The Idea of Measurement Error

Sometimes, when you measure something, you can get an exact answer like
the number of eggs in a carton or the number of days a person is out of work.
Other times, however, you are measuring quantitative, continuous variables
like your height or weight for which an exact answer is simply impossible.
You cannot just say, “I am precisely 6 feet tall” because that is not exactly

The Measurement Box Model 285
right. No one is exactly 6 feet tall as in

6.0000000000. . . .

If we very, very carefully tried to measure height, say to five decimal places,
by using special equipment, we would come up with a slightly different num-
ber for each measurement such as

6.00134, 6.00146, 6.00121, 6.00130, and so on.

Measurement error is pure chance error, which cannot be removed. The
wind, air pressure, and dust generate extremely small random variations that
give ever so slightly different answers. Clearly, more accurate devices can
reduce measurement error, but it is impossible to eliminate measurement
error entirely.

We must emphasize: in this context, measurement error does not mean
there is a mistake in the measuring process. Measurement error does not
refer to poor coding, a misreading, or various other “silly mistakes.” In most
situations, even if you measure as carefully as possible, you will still obtain dif-
ferent results each time the measurement is made. All of your measurements
will still be different from the truly, perfectly, ideally exact answer.2

Summary

Once we realize that our measurements contain a component driven by pure
chance, we are led to modeling the chance process generating the observed
data. We are about to see that the situation just described has much in com-
mon with other chance processes such as coin flipping, free-throw shooting,
games with dice, and polling voters. All of these situations are characterized
by a common core idea of the role of chance in generating the observed
outcome.

What we have to do to interpret the distance data as generated by a chance
process is to model the chance process at work explicitly. For that, we need a
box model. This is the topic of the next section.

11.3. The Measurement Box Model

Box models are visual analogies that help us understand the chance process
at work. We must understand the way the data are generated before we

2 In this chapter we discuss errors in measuring the dependent variable. Econometricians more commonly
use the term “measurement error” when referring to errors in measuring independent variables. This
situation, which is also called “errors-in-variables,” results in complications that are beyond the scope
of this book.

286 The Measurement Box Model

Draws with
Replacement

01 010 0

Average = 0.5 Coin-Flip Box Model
SD = 0.5

190 million tickets Draws
without
? Replacement
??
4372
?

Average = ?
SD = ?

Polling Box Model

Figure 11.3.1. The coin-flip and polling box models.

can begin to apply the logic of statistical inference. Without a box model of
the chance process that generated the data, you cannot use the methods of
inference described in this chapter. Inferential analysis should always begin
with an explicit statement of the chance process you assert generated your
data.

Chapter 10 reviews two basic types of box models used to describe different
data generation processes, which we summarize in Figure 11.3.1.3 Both of the
box models in Figure 11.3.1 assert that chance is at work in the observed
results. Because chance is also at work via the idea of measurement error in
our problem of measuring the distance between two peaks, we should be able
to model the process – just like we have modeled other chance processes.

A Box Model for Measurement Error

The new box model is associated with Carl Friedrich Gauss (German, rhymes
with “house,” 1777–1855), who tackled the problem of how to combine astro-
nomical measurements to obtain good estimates of the orbits of celestial

3 We recommend a review of the issues underlying these box models because we are about to introduce
a new box model. There are essential similarities we consider extremely helpful in understanding the
material that follows.

The Measurement Box Model 287

bodies. We will describe the model in words and then draw a picture of what
is going on before turning to a more mathematical presentation. Our idea
is that the words and picture will help you develop intuition about the mea-
surement box model that will enhance your understanding of the material.

Having decided that chance is at work in the observed measurements, what
do we have to do to link this idea to a box model? We have to be able to say
that the measurements are like draws from a certain kind of box. If it is
possible to say this, then it is possible to perform statistical inference based
on the specific box model to which we have made the analogy. Without a box
model, this is not possible.

Here is how the measurement error model can be applied to this situation.
Each observation (distance measured in miles in our example) is equal to the
true distance plus a number written on a ticket that was drawn at random
with replacement from the error box. Thus, the observed distance is actually
a composite number because it is made up of two parts: the true distance plus
the random draw. The data are interpreted as follows:

Measurement#1 = True Distance + 1st Draw from Error Box
Measurement#2 = True Distance + 2nd Draw from Error Box

...
Measurement#25 = True Distance + 25th Draw from Error Box

The box, from which a random draw is being taken each time we measure,
has the following characteristics:

r an unknown, possibly infinite number of tickets;
r the average of the tickets is zero; and
r the SD of the box is unknown.

Note that each measurement has exactly the same true distance compo-

nent. Yet the observed measurements are different because each measure-

ment has a different random draw value added to the same true distance

value.

The sheet Measuring in the Measure.xls workbook enables you to disentan-

gle the true distance and random draw values. Click on the Take a button
Measurement

a few times. Excel takes a random draw each time you measure and adds it

to the true distance to generate the observed distance. This is the heart of

the measurement box model. If necessary it is possible to click on the Reset

button to clear out all the measurements and start again.

The Measurement Box Model in a Picture

Figure 11.3.2 captures the essential features of the measurement box model.
Each measurement is like taking a draw from the box and adding it to the

288 The Measurement Box Model

ERROR BOX

NOTE: Tickets are drawn with replacement.

SAMPLE

observed measurement = exact value + 1
1

observed measurement = exact value + 2 A possibly infinite
2 number of tickets with
different values.
observed measurementn= exact value +
n Average of the box = 0

Figure 11.3.2. The measurement box model.

true, exact value. We assume the average of the box is zero and that each
draw is independent of every other draw. Violations of these assumptions
are discussed in Chapters 18, 19, and 20. Finally, note that no assumption is
made about the exact distribution of the errors. In other words, the histogram
of the box contents could have many different shapes so long as the average
of the box is zero.4 More detailed comments about the assumptions can be
found in the pages that follow.

Although the average of the box producing the errors is zero, the sample
average of the errors actually drawn is almost certainly not zero. The smaller
in absolute value the sample average of the errors, the closer our estimate is
to the true parameter.

The Measurement Box Model in Equation Form

More formally, the measurement box model is represented like this:

yi = µ + εi , i = 1, . . . , n,
E (εi ) = 0, i = 1, . . . , n
SD[εi ] = σ , i = 1, . . . , n
εi is distributed independently of ε j , ∀i, j, i = j,

where
µ is the true, unknown, exact distance between two mountain peaks,
yi is the ith observation (measurement),
εi is the ith measurement error, and
σ is some nonnegative constant.

4 One technical point: We must also assume that the fourth moment of the error distribution is finite. The
fourth moment of a random variable is the expected value of the fourth power of the random variable.
In practice this means we are ruling out error term distributions like the Cauchy, which has no expected
value and no variance.

The Measurement Box Model 289

We can see each yi, but we do not know µ, σ , or each individual value of εi .
The εi ’s are like tickets drawn from the box. There are n observations in total
corresponding to the n draws from the box. The number of measurements (n)
tells us the number of draws from the box. The first equation, yi = µ + εi ,
i = 1, . . . , n, tells us that each observation is the sum of the true distance
plus a random variable. The second equation, E (εi ) = 0, i = 1, . . . , n, tells
us that the expected value of the random variable is zero. The third equation
tells us that each and every error term has the same spread (SD, i.e., the
errors are homoskedastic). The fourth equation (or rather statement) says
that the error terms are independent of one another. Putting the first two
equations together, recognizing that the true distance is a constant and taking
expectations, we arrive at

E (yi ) = µ, i = 1, . . . , n,

or, in words, the expected value of each observation is the true distance. In
plain English, the measuring device is on average right. The first equation
plus the statement about independence of the error terms tell us that the
measurements themselves are independent of one another.

Now, we do not presume to know the actual contents of the box containing
the errors, and so we do not know how big the errors might be in absolute
size. Obviously, the more precise the measuring device, the smaller in absolute
value the numbers on the tickets will be (i.e., the smaller σ will be). When it
is assumed, however, that the process that generated our observations is like
drawing tickets from a measurement box, we are making some very important
assumptions about the data generating process:

r We assume that the measurement process is unbiased when we say that the average
of the box is zero. (In equation form, unbiased means E (yi ) = µ, i = 1, . . . , n.)

r We assume that each measurement is independent of every other measurement
when we say that we are drawing with replacement from the box.

r We assume that all measurements are alike in the sense that each measurement
faces the same array of possible errors when we say that we are drawing from the
same box every time. (In statistical jargon, the errors are identically distributed;
they all have the same expected value and the same SD.)

If these three assumptions do not hold, the statistical results that follow
in the next two sections are wrong. Computer software may give you an SE,
but it will not be valid. All is not lost, however. It may be possible to come
up with a more complicated box model of the data generation process and to
compute SEs based on the new model. In fact, violations of these assumptions
are an important part of econometrics and are discussed in Chapters 18, 19,
and 20.

290 The Measurement Box Model

Comparing the Measurement and Chapter 10 Box Models

We should take a moment to stress the differences between the measurement
box model and the two standard box models of basic statistics. In the box
models of Chapter 10, the tickets are observable once they are drawn. In the
measurement box model, they are not.

We assume that the average of the box is zero for the measurement box.
In the two earlier box models, the average of the box may be unknown
and not necessarily zero. There may be an infinite number of tickets in the
measurement box. In previous box models, the number of tickets in the box is
simply the possible outcomes of a game (like two for coin flipping) or the size
of the population from which the sample is drawn. In previous box models,
we often want to estimate the average of the box. With the measurement box,
it is assumed the average is zero and we instead want to estimate something
else: a parameter such, as the distance between two objects.

Notice as well the philosophical difference between the measurement
model and the polling model. In the latter we are interested in the pop-
ulation average, but we recognize that individuals differ from one another.
Some people are taller, and some people shorter. In the measurement model,
on the other hand, we believe that all observations measure the same value;
the reason they differ is related to the measurement process.

Although it is important to distinguish the measurement box model from
other box models, do not forget that all box models share a crucial common
bond – chance is at work in generating the observed outcome. This common
bond does not merely allow us to organize the world in a convenient fashion,
but it facilitates the application of basic ideas of statistical inference to any
data that are generated via a chance process.

Summary

Having established a box model for the observed distances, we are ready to
make inferences about the true, exact distance between the two peaks. We
will follow two routes to statistical inference, using Monte Carlo simulation
in Section 11.4 and statistical theory in Section 11.5.

11.4. Monte Carlo Simulation

Workbook: Measure.xls

We are interested in the true, exact distance between the two mountain peaks,
but we can only get an estimate from our sample. Because the data can be
modeled as if they resulted from a simple random sample of measurement
errors, we will be able to apply the methods of statistical inference. On the

Monte Carlo Simulation 291

assumption that our measurement of the distance between two peaks was
unbiased (so that the errors in the box average to zero), the observed distances
in our sample are a composite number of true distance plus draw from the
error box:

Individual Measurement = True Distance + Chance Error.

The bad news is that we can not get rid of the chance error component
of measurement error. In other words, it is not possible to solve for the true
distance as follows:

True Distance = Individual Measurement − Chance Error.

The reason for this is that the chance error of any individual measurement
is unknown. But what we can do is take many measurements and use the
distribution of individual measurements to make a good guess about the true
distance and the spread of observed sample average values. Thus although
the components of an individual measurement cannot be disentangled, we
can apply the box model to make inferences about the exact value of the
unknown parameter and the variation in potential parameter estimates.

Monte Carlo Simulation

In this section Monte Carlo simulation is used to drive home the point that the

sample average has a distribution (the probability histogram for the sample

average) and to show how that distribution depends on the basic parameters

of the model. The Excel workbook Measure.xls shows how. As you explore

the sheet called LiveSample, click on the cells in the Measured Distance

column to see the cell formulas.5 Notice how the measurement box model is

being applied. Hit F9 a few times to draw a new sample of 25 measurements.

Notice that the sample average changes every time you hit F9. Chance is

involved in determining the sample average. The average bounces around the

true, exact value. What we need is a measure of this variation in the sample

average, which is called the SE of the sample average.

Click on the Show Monte button to take many samples, calculate their aver-
Carlo Simulation

age, and get an approximation of the SE of the sample average by calculating

the SD of the sample averages. It is an SD, not an SE, because it is based

on a finite number of sample averages. The true SE of the sample average

is based on an infinite number of samples. Figure 11.4.1 shows the output of

one Monte Carlo experiment with 1,000 repetitions.

5 In this workbook the errors are normally distributed. One of Gauss’s contributions was to point out
that the distribution of the tickets in the error box is immaterial so long as the mean was zero. Thus, we
could have used the uniform distribution, for example, and nothing essential would change. More on
this is presented in Chapter 14.

292 The Measurement Box Model

Summary Statistics True Distance 107.1165
Average 107.162 SD Errors 4
SD 0.8218
Max 108.867
Min 105.090

Empirical Histogram for 1000 Repetitions

120

100

80

60

40

20

0
104 105 106 107 108 109 110 111
Sample average distances

Figure 11.4.1. The Monte Carlo approximation to the probability histogram in the
distance between two peaks measurement problem.
Source: [Measure.xls]LiveSample.

The True Distance and the precision of the measuring instrument, both of
which would be unknown to our scientists, are given in the upper right in red
on the Excel sheet. The Average is the average of the 1,000 separate estimates
of the distance between the two peaks. Each estimate is the sample average
of 25 individual measurements. The SD is the Monte Carlo approximation to
the exact SE of the sample average.

As you work with the Monte Carlo simulation, keep in mind that it cannot
be used to actually estimate the true distance between the two mountain
peaks. Instead, Monte Carlo simulation is a kind of testing ground. Assuming
you know the true distance, you can see the kinds of sample results obtained
and, of utmost importance, the variability of the sample results.

You can run interesting experiments by changing the precision of the
measuring instrument in cell D16 of the LiveSample sheet and then rerun-
ning the Monte Carlo simulation. Qualitatively speaking, how does the spread
of the empirical histogram pictured in Figure 11.4.1 depend on the precision
of the measuring instrument?

Summary

By running Monte Carlo simulations, you should be able to convince yourself,
first, that the sample average obtained from a single sample is likely to be a

Applying the Box Model 293

good estimate of the True Distance and, second, that the typical discrepancy
between the sample average obtained and the True Distance depends directly
on the spread of the tickets in the error box. In more precise statistical terms,
the Expected Value of the sample average is the True Distance, and the SE
of the sample average depends directly on the SD of the error box. It is to
the SE of the sample average that we now turn.

11.5. Applying the Box Model

Workbook: Measure.xls

In the previous section we used Monte Carlo simulation to get a better feel
for how the measurement box model works. We saw how Monte Carlo sim-
ulation, or drawing many, many samples from a known box, allows us to
approximate the probability histogram for the sample averages. In practical
applications, however, the key parameters employed to produce the Monte
Carlo simulation are unknown – that is, the exact value of the thing we are try-
ing to measure and the SD of the box representing the measurement errors.
These parameters allow us to construct the probability histogram via statisti-
cal theory or to approximate it via Monte Carlo simulation. Without knowl-
edge of the true parameter values, it would seem that statistical inference
based on data from a single sample cannot accomplish very much.

Given the measurement box model for the data generation process, how-
ever, it turns out that a great deal can be said about the true value we are
trying to measure. To appreciate why, we need to understand the three areas
of this box model, as shown in Figure 11.5.1.

The Measure.xls workbook demonstrates these three areas. The LiveSam-
ple sheet shows both Area 1, the assumptions about the error box, and the
data generating process, as captured in the Excel formulas that generate each
observation, and Area 3, a single sample.6 The MCSim sheet approximates
the probability histogram for the sample average, that is, Area 2.

Statistical theory tells us that the Expected Value of the sample average,
the center of the probability histogram, is the exact value (true distance) we
are trying to measure. Statistical theory also says that

SE(Sample Average) = SD√(Box) ,
n

where n is the number of observations. Measure.xls can be used to obtain
suggestive evidence in support of both propositions.

6 The formula is “=ROUND(True Distance+NORMALRANDOM(0,Error SD),2).” This says that
each observation is the sum of the true distance plus a normally distributed random variable that
averages 0 and has a given SD. The resulting sum is rounded to two decimal places.

294 The Measurement Box Model

AREA 2: SE for sample
average
PROBABILITY
HISTOGRAM

EV = exact value sample
average

NOTE: Tickets are drawn with replacement. AREA 1:
THE BOX

ERROR BOX

observed measurement = exact value + 1
1

observed measurement = exact value + 2
2
AREA 3: An INFINITE number
of tickets with
THE different values

SAMPLE Average of the box = 0
n SD of the box = ??
observed measurement = exact value +
n

Figure 11.5.1. The three areas of the measurement box model.

This reasoning says that a plausible estimate of the True Distance is just
the sample average. We still seem at a loss, however, in trying to determine
how far away our sample average is likely to be from the exact value we are
trying to measure. We need to know the spread of the error box in order to
make this calculation, but the errors are not observed.

A crucial insight of statistical inference is that we can make do with using
the SD of the measurements in the sample to estimate the SD of the errors
in the box. Students often ask, “But how can one know the spread of the
chance error when it cannot even be seen?” The answer is that the spread
of the measurements is used to reveal the spread of the chance errors. The
key is that we apply a property of the SD of a list of numbers: subtracting the
same number from each number in the list, in this case an unknown exact
value, leaves the SD unchanged. For a demonstration of this property in the
context of this measurement problem, go to the EstimatingSDBox sheet of
Measure.xls.

Let us apply this thinking to our concrete example of the distance between
two mountain peaks. One simple random sample, if explicitly tied to a box
model, can be used to make inferences about an unknown population param-
eter. In Figure 11.5.2, we estimate the true distance between the two peaks.
It is important when applying this procedure to keep in mind that the point
estimate itself is not the only important piece of information needed – we

Applying the Box Model 295

Observation Distance Measured APPLYING THE BOX MODEL
1 (miles)
2 106.652 sample average
3 107.23 1.043 sample SD
4 106.41
5 105.97 0.209 estimated SE of the sample average
6 106.13
7 108.35 The estimate for the true distance is the sample average,
8 105.60 106.652 miles.
9 105.55
10 105.64 The typical discrepancy between this estimate and the
11 106.80 unobserved true distance is
12 105.57
13 108.77 0.209 miles.
14 108.56
15 108.65
16 105.99
17 105.48
18 106.83
19 107.12
20 105.51
21 106.19
22 106.71
23 106.59
24 107.71
25 106.82
106.18
105.95

Figure 11.5.2. Estimating the true distance and the SE based on a single sample.
Source: [Measure.xls]DeadSample.

also need to know the spread in the estimate (i.e., the SE of the sample aver-

age). To find this SE we need to know the SD of the box (i.e., the precision

of the measuring instrument). We do not observe the SD of the box, but it

is possible to estimate it via the SD of the measurements. Armed with an

estimated SD of the box, we can estimate the SE of the sample average via

the standard formula. In the data described by Figure 11.5.2, the sample SD

was 1.02. Therefore, the estimated SE of the sample average is 1√.043 =∼ 0.209.
25
This SE can be estimated only if the box model is specified and its conditions

are met.

Summary

We conclude this section by discussing what could cause statistical inference
to fail in the context of the measurement box model. Statistical inference can
break down

r If the measurement process is biased. If did not notice that your ruler had the first

inch snapped off so that when you read off 4 1 inches it was actually 3 1 inches, that
8 8

would be bias.

296 The Measurement Box Model

r If the errors are not independent of one another. This is called autocorrelation or
serial correlation. If you used a machine to read off the distance from the photo and
it somehow kept the last measurement in memory, the next measurement would
depend on the previous measurement.

r If the measurements are not all alike – that is, we were not always drawing from
the same box and thus the errors are not identically distributed. This condition is
known as heteroskedasticity.

Here is an example of heteroskedasticity: If you ran out of film on the tenth
measurement and substituted a high-powered laser beam device to measure
the peaks, you would not want simply to mix the observations together – the
more precise laser beam observations should carry more weight.

We explore all of these problems in later chapters. In all three cases the
box model does not apply, and inference using computed estimates and SEs
gives incorrect answers.7 The computer program used to analyze your data
is unable to catch these violations of the measurement box model. The com-
puter software assumes the data do not violate the requirements. Human
judgment is required to determine how the data were generated before the
data are submitted to the computer. This is worth remembering.

11.6. Hooke’s Law

Workbook: HookesLaw.xls

In this section, we take an important step by extending the measurement
model to cover the case of bivariate regression. Our fictional example comes
from the world of physics. Robert Hooke (1653–1703, British) hypothesized
that the “stretchiness” of a spring is proportional to the load placed on it.
Expressed as an equation, Hooke’s law relates the length of a spring to the
load placed on it like this:

Length of spring (in cm) = Length with no load on it (in cm) + m · Weight
of the load (in kg), where m is a constant of proportionality measured in
cm/kg known as the spring constant. Let us accept this as true – that is, it is
absolutely true that the springiness of a spring is proportional to the weight
placed on it.8

Every spring has an intrinsic value of m. Now, suppose you were asked to
estimate the constant of proportionality for a particular spring. You enter the
laboratory and are given a spring of some unknown springiness, some weights,

7 There is one exception to this statement: heteroskedasticity in the univariate case (see Section 19.2).
However, in the bivariate and multivariate cases, heteroskedasticity does impair inference.

8 Actually, for you physics experts out there, Hooke’s law is merely a linear approximation that works
well within certain bounds. Physicists also point out that Hooke’s law can be applied to any object.
Hanging a board in the air and placing a weight at the bottom “stretches” the board (albeit not much!)
and puts “stress” (vertical pressure) and “strain” (horizontal pressure) on the board.

Hooke’s Law 297

yardarm-like metric
device ruler

spring with a
weight on it

other weights Value of weight is
ready to go known exactly

Special, heavy-duty
scientific table

Figure 11.6.1. Experimental apparatus for testing Hooke’s law.

and a ruler. Your job is to figure out the springiness of that particular spring.
It has a constant of proportionality but its value is unknown. You proceed
by hanging the spring from a small yardarm-type lab device and placing a
weight on it, as in Figure 11.6.1.

You carefully measure and record the length of the spring with different
weights (of known values) on the end of the spring. Obviously, when a weight
is placed on the spring, it stretches. Thus, you end up with a measurement of
the length of the spring for each given weight. When measuring the different
lengths of the spring with the differing weights, you are careful to prevent
one measurement from influencing another and ensure that the measuring
process used is the same for every weight.

To analyze the data and arrive at an estimate of the springiness of our
apparatus, we need a model of the data generation process. We will adapt the
measurement box model to this new situation. The data generation process
looks like Figure 11.6.2.

Gaussian Error Box

⋅Observed length1 = Intercept + Slope Weight1 + 1 Average = 0
⋅Observed length2 = Intercept + Slope Weigh t2 + 2 SD = ?

⋅Observed lengthn = Intercept + Slope Weightn + n

Figure 11.6.2. Box model for Hooke’s law experiment.

298 The Measurement Box Model

As before, the errors are independent and identically distributed. They
have an average of zero and an unknown SD. Their being identically dis-
tributed implies that the errors are independent of the weights. In plain
English, this independence assumption says that a heavier than average
weight does not mean, for example, that the measurement error is more likely
to be positive, nor does a heavier than average weight mean that the spread
of the measurement error increases. We highlight the independence of the
error terms and the X variables because this turns out to be a very important
assumption in models of data generation processes.

You can see a virtual version of the problem by exploring the sheet called
OneObs in the HookesLaw.xls workbook. Do this now. Follow the instruc-
tions in the OneObs sheet to get a complete understanding of the concept
that there is one, unchanging true length of the spring determined by the
intercept and slope parameters.

By convention, econometricians represent the parameters with Greek
letters:

True Length of Springi = β0 + β1Weighti , i = 1, . . . , n.

The “i” subscript indexes observations. There are n observations. The
OneObs sheet uses color as a guide. Red text means the number cannot be
seen in real-world situations. But just because a variable cannot be directly
observed does not mean it is unimportant or irrelevant. In fact, we know
there is a true length built on the unknown intercept and slope we are trying
to estimate.

The OneObs sheet also drives home the point that the observed length
of the spring is different from the true length of the spring. The reason for
this discrepancy is measurement error. Every time the spring is measured, an
error resulting from that particular measurement is added to its true length.
Thus, if we denote the error for the ith measurement by εi ,

Observed Length of Springi = True Lengthi + εi .

This formulation highlights the similarity between the more complicated
bivariate model of this section and the univariate measurement error model
considered in earlier sections of this chapter. Equivalently,

Observed Length of Springi = β0 + β1Weighti + εi .

Click the Get One button several times to build a data set. Each time you
Measurement

click the button, think about the data generation process. The crucial concept

is that the observed length of the spring is composed of a fixed component

(Intercept + Slope · Weight) plus an error term. Proceed to the OneSample

sheet to see a data set with 100 observations. Each observation was generated

as before. Figure 11.6.3 shows the generation of one sample.

Hooke’s Law 299

Parameters 1.235 length of spring in cm with no weight on it
Intercept 0.2
Slope constant of proportionality in cm/kg

WeightStep β1 SDBox b1 RMSE Get One
10 0.191 9.084 Sample
1 0.2
Weight Error Observed Length Length of Spring (cm) Estimating the Constant of Proportionality
True Length y = 0.191x + 0.746
0 23.615 24.850 200
1 1.235 6.922 8.357 40
2 1.435 0.895
3 1.635 − 0.740 30
4 1.835 −2.139 −0.305
5 2.035 −11.262 −9.228 20
6 2.235 −3.003 −0.769
7 2.435 10
8 2.635 6.632 9.067
9 2.835 −1.683 0.952 0
10 3.035 11.877 0 50 100 150
11 3.235 9.042 0.477 Weight (kg)
12 3.435 −2.558 6.140
3.635 6.774
2.906 5.377
3.339
1.742

Figure 11.6.3. One sample.
Source: [HookesLaw.xls]OneSample.

Hitting F9 recalculates the entire workbook and draws 100 new observa-

tions in a flash. The fundamental idea demonstrated in this sheet is that the

estimated slope is a random variable. Click the Get One button (or simply hit
Sample

the F9 key) and watch the dots, black line, and the m and b values in the

y = mx + b trend line display bounce around. The red line, however, stays

perfectly still because it is based on the fixed parameters, not the estimated

coefficients. The distinction between the bouncing behavior of the sample

and the fixed red line is a crucial concept.

It is clear that the dots are dancing on the screen because they contain

measurement error (via the data generation process explained in detail in

the OneObs sheet). Because the fitted black line is based on the sample data,

its intercept and slope will also be bouncing. The red line, however, contains

no measurement error at all. It is a fixed, unchanging truth that we are trying

to discern.

In inferential analysis, it is important to keep straight what is a parameter

and what is an estimate. Econometricians use Greek letters to represent

parameters and unobservable variables (in this chapter, we have seen µ, β, ε,

and σ ). We will use lowercase English alphabet letters to designate estimated

parameters. For example, we use the symbol b1 to designate our estimate of
β1. Many econometricians use hats (“circumflexes”) to indicate an estimate
of a parameter; thus, βˆ1 would indicate the estimated value of β1.

We present estimates of the regression slope, b1, and the RMSE. It should

be obvious that the slope estimate is fluctuating around the true value of the

spring constant, which in this example is 0.2. The RMSE oscillates around

the true value of the SD of the measurement box, which is 10 in this case.

The MCSim sheet drives home the notion that the estimated slope, b1, is
a random variable by running a Monte Carlo simulation. In each repetition,

300 The Measurement Box Model

Sample Slope Summary Statistics Population Parameters

Average 0.2019 Slope 0.2

SD 0.0336 Intercept 1.234567

Max 0.2983 SDBox 10

Min 0.1007 Exact SE Slope 0.0346

90 Empirical Histogram for 1000 Repetitions

80
70
60
50
40
30
20
10

0

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35
Estimated Slopes (b1)

Figure 11.6.4. Monte Carlo simulation of the estimated slope.
Source: [HookesLaw.xls]MCSim.

100 observations are taken and a least squares line is fitted to the data. The
estimated slope for each of the first 100 repetitions is recorded in column
B. The sheet provides summary statistics and an empirical histogram of the
1,000 estimated slopes, as shown in Figure 11.6.4. The empirical histogram
is an approximation to the probability histogram or sampling distribution of
the slope estimate.

The Monte Carlo simulation makes clear that the estimated slope is a ran-
dom variable. The good news is that it is apparently centered on the true, exact
constant of proportionality, which suggests we have an unbiased estimator.
The SD of the 1,000 estimated slopes, in this case .0336 is an approximation
to the exact SE of the estimated slope, which is 0.0346 in this example. (The
exact SE can be computed analytically, which is how the Exact SE Slope is
being calculated in cell H7.) We explain the concepts of bias and the exact
SE in detail in Chapter 14.

Summary

We used Hooke’s law to show how the measurement box model can be applied
in the context of a bivariate regression. The sample coefficients b0 and b1 from
a regression of observed length of spring on weight are random variables
with a probability histogram (or sampling distribution). The parameters β0
and β1 can be estimated from the sample data. The measurement error box
model as applied to bivariate regression looks very similar to the box model

Exercises 301

as applied to univariate measurement. An important refinement arises from
the presence of an independent variable: As noted, a crucial assumption is
that the error terms are independent of the X variables.

11.7. Conclusion

The measurement box model described in this chapter was originally devel-
oped to handle the problem of modeling the data generating process for astro-
nomical observations. Astronomy and economics may seem to be only dis-
tantly related fields. Nonetheless, the measurement box model is very closely
related to the classical econometric model, which, as its name implies, has
been the standard model for the data generating process for economic vari-
ables. The mathematical features of the measurement model – errors that are
mean zero and are independently and identically distributed and observed
values that are the sum of the error term and functions that are linear in the
parameters of one or more independent variables – are shared by the classical
econometric model. In Chapter 14 we will see that these common features
imply that, for both models, the ordinary least squares estimator has certain
optimal properties.

The difference between the measurement model and the classical econo-
metric model has to do with the explanation of the data generating process.
In the measurement model, the only reason univariate data differ from one
another and the only reason bivariate data do not all lie on the same single
regression line is the imperfection of the measurement process. In the classi-
cal econometric model there are other, more complicated, reasons for these
discrepancies.

Although the measurement model provides concepts and intuition that will
serve us well, we are not ready to jump into the classical econometric model.
Chapter 12 considers an alternative means of describing the data generating
process for economic variables. That chapter continues to use the basic box
model metaphor, but the interpretation of the box contents is different.

11.8. Exercises

1. In this book we develop two different languages for describing data generation
processes. The first uses the box model metaphor, whereas the second employs
formal mathematical symbols. What box model concepts correspond to each of
these formal mathematical symbols, statements, and equations?
a. εi
b. σ
c. E(εi ) = 0, i = 1, . . . , n
d. SD(εi ) = σ, i = 1, . . . , n
e. εi is distributed independently of ε j , ∀i, j, i = j.
f. yi = µ + εi , i = 1, . . . , n.

302 The Measurement Box Model

Figure 11.8.1. Results from Monte Carlo experiment. Summary Statistics
Source: [Measure.xls]MCSim. Average 105.035
SD 0.8127
Max 107.630
Min 102.597

2. Suppose that the measuring device described in Sections 11.2 to 11.5 was sys-
tematically biased – in particular that the measurements on average were 0.5 km
too big but all the other assumptions about the box model still held true. How
would Figure 11.5.1, which shows the three areas of the measurement box model,
change?

3. In the univariate measurement model described in Measure.xls, the residual is
defined as the difference between the individual measurement and the sample
average. No matter how many times you make 25 new measurements, in the
LiveSample sheet you will notice that the residuals always average to zero (see
cell E21). Why does this happen? The answer requires a little algebra.

4. Suppose you obtained the data in Figure 11.8.1 from the Measure.xls workbook.
Note that the true distance is not revealed. You are told that in the Monte Carlo
experiment there were 1,000 repetitions. In each repetition, 25 measurements
were taken of the unknown distance. You are asked to give your best estimate
of the true distance. What will it be and why?

5. Reconsider the hypothetical Galileo story of Section 6.2. Write down a measure-
ment box model of the data generation process for Galileo’s data on time and
distance of a falling object.

References

Galileo actually plays a role in the development of the measurement model.
According to Hald (1998, p. 33), the ideas that there is a true value to be estimated,
that all observations suffer from error, and that errors are distributed
symmetrically about zero were clearly put forth by Galileo in 1632:

Hald, Anders (1998). A History of Mathematical Statistics from 1750 to 1930. New
York: John Wiley and Sons.

Two hundred years later, Gauss perfected the measurement model and worked out
the properties of the ordinary least squares estimator when applied to data
generated according to the model. The quotation from Gauss comes from

Gauss, Carl Friedrich (1857, 1963). Theory of the Motion of the Heavenly Bodies
Moving about the Sun in Conic Sections: A Translation of Gauss’s “Theoria
motus.” With an appendix. Boston, Little, Brown and Company, 1857.
Reissued by Dover Publications, New York, 1963.

We found the quotation on page 96 of

Lee, Peter M. (n.d.) Lectures on the History of Statistics. Manuscript available from
Peter Lee’s History of Statistics page: <www.york.ac.uk/depts/maths/teaching/
pml/hos/welcome.htm>.

This chapter owes a great deal to Freedman et al. (1998). All of the key ideas of this
chapter, other than the use of Monte Carlo simulations, can be found in Chapters 6,
12, and 24 of their book.

12

Comparing Two Populations

Never will we know if the value of a statistic for a particular set of data is correct.
David Salsburg1

12.1. Introduction

In this brief chapter, we introduce yet another data generation process called
the two box model. We will see how the sample average difference is dis-
tributed through Monte Carlo simulation and analytical methods.

The two box model is an extension of the polling box model (explained in
detail in Chapter 10) and provides further practice with inferential methods.
Although the rapidly expanding list of box models may seem daunting, do
not despair. The same basic principles about variability of sample statistics
and understanding the sampling distribution underlie all data generation
processes.

Our approach in presenting the various box models is meant to illustrate
the point that a properly configured box model can represent a wide variety
of chance processes. We are also slowly building toward the box model that
underlies regression analysis in an inferential setting.

Section 12.2 introduces the two box model, and Section 12.3 offers a Monte
Carlo simulation to explore the sampling distribution of the same average dif-
ference. Section 12.4 presents a real-world application of the two box model.

12.2. Two Boxes

The two box model is essentially two polling box models combined. Instead
of estimating a parameter or testing a hypothesis about a single population –
for instance, the average wage of California residents – we are interested in

1 Salsburg (2001, p. 66).

303

304 Comparing Two Populations

Box A nA draws n Bdraws Box B
(Population A) (Sample A) (Sample B) (Population B)

Figure 12.2.1. A picture of the two box model.

a comparison of two populations (e.g., the difference in the average wages of
California and Nevada residents).

If we want to estimate the difference in average wages between men and
women or test whether men have higher wages than women, the two box
model might apply. Another example would be to estimate the difference
in average SAT scores between the 1985 and 1995 test-taking cohorts2 or
determine whether students’ scores on the SAT are statistically significantly
different between 1985 and 1995 (on the assumption that the test has not
changed in difficulty or scoring during that time).

Notice how these examples focus on estimating the difference in population
averages or testing a hypothesis about the difference. In either case, the SE
of the difference of the sample average will play a prominent role. We need
a box model as well as data to obtain an estimate of the SE of the difference.
Without a box model, we cannot obtain the SE.

A Two Box Model for Comparing Populations

Because we are comparing two different populations, we have two different
boxes. Sample A is drawn from Box A, and Sample B is drawn from Box B.
For the methods of this section to apply, the samples must be simple random
samples that are independent of one another. They may be drawn without
replacement, but we will assume that the number of tickets is large relative
to the number of draws and thus that no correction factor is needed.

The two box model is depicted in Figure 12.2.1. Each box has a large, but
finite, number of tickets representing each person in the population. There
is a fixed, unknown average of each box and, therefore, a fixed, unknown
difference of the population averages. This, of course, is what we are trying
to estimate or infer. In addition, each box has a fixed, unknown standard
deviation.

We will use Sample A to calculate the sample average of A and do the
same for B. The difference between the two sample averages is our estimate
of the difference between the two population averages. Chance or sampling

2 In economics and demographics, a cohort is a group of people who all enter the scene at the same point
in time.

Two Boxes 305

error is in a realized sample average because not all of the tickets are drawn.
Draw another sample and it will have a different set of tickets and thus a
different sample average. Because the sample averages are random variables,
the difference of the sample averages is also a random variable.

We need a new SE, the SE of the difference of the sample averages, to
provide a give or take number on our estimate of the difference between the
two population averages. To test a hypothesis about the difference between
the population averages, this SE is also required to construct a z-statistic
and get the P-value for a hypothesis test. The calculation of the SE is a little
different than in the one-box case, but otherwise the process is the same.

Getting the SE of the Difference

We construct the SE of each sample average, SEA for sample A and SEB
for sample B, as discussed in Section 10.4 (estimating the SD of each box, if
necessary). Then, the SE of the difference of the sample averages is a function
of SEA and SEB, following a square-root law, like this:

SEDifference = SEA2 + SEB2 .

This formula assumes that the samples are independent, simple random sam-
ples. If the samples are dependent or are not simple random samples, the SE
will not be correct, and analyses that use the SE (such as hypothesis testing)
will not be reliable.3

Once we have the SE of the difference of the sample averages, we can use
it as the give or take number on our estimate, generate confidence intervals,
and run hypothesis tests. We can find the z-statistic, in the usual way, like
this:

z = observed difference − hypothesized difference .
SE of the difference

The P-value is the probability of drawing a sample that has this z-statistic, or
one more extreme, if the null hypothesis is true. In large samples, the P-value
can be computed using the normal distribution.

Summary

This section introduced the two box model and presented a formula for the SE
of the difference of the sample average. The next section presents an example

3 The formula for the SE of the difference of the sample averages can be derived via the algebra of
expectations. Apply the rule for the variance and SD of a sum of independent random variables as
shown in Section 10.7.

306 Comparing Two Populations

of the two box model in Excel that provides a concrete demonstration of how
the samples are generated. It makes clear that the difference of the sample
averages is a random variable with a sampling distribution.

12.3. Monte Carlo Simulation of a Two Box Model

Workbook: TwoBoxModel.xls

In this section we explore the sampling properties of the two box model. We
focus on explicitly demonstrating the data generation process and empha-
size the variability of the difference of the sample averages. When you draw
two simple random samples from two separate populations and compare the
sample averages, it is possible to create an estimate of the difference of the
population averages.

Unfortunately, because you do not have the population averages them-
selves, it is not possible to use the observed sample difference to make a
definitive statement about the difference of the averages in the population.
The difference of the sample averages is a good guess, but it has an inherent
variability captured by the SE of the difference. This fundamental lesson is
the heart of the TwoBoxModel.xls workbook.

Setting Up The Two Box Model

Go to the HighSchool sheet in the TwoBoxModel.xls workbook. Click the

Make a Box Model button and provide the necessary information. The first time

you create a box model, make the population small – for example, 100.

Choose the default log normal distribution because wage distributions usually

have long right-hand tails. Make the average wage in the population $10 per

hour and the SD $5 per hour. A population histogram appears. Set the sample

size at 25. The resulting parameters should look like Figure 12.3.1.

Clicking the Draw a Sample One button takes one draw, without replacement, from
Ticket at a Time

the population. The chosen observation is then reported, its cell is colored

green in column A, and the value is written in column J. Scroll down (if

Figure 12.3.1. Parameter settings for the two Parameters
box model simulation. Number Tickets
Source: [TwoBoxModel.xls]HighSchool.
100
Average of the Box
$ 10.00
SD
$ 5.00
Number of Draws

25
Distribution of Box
Log Normal

Monte Carlo Simulation of a Two Box Model 307

Parameters Figure 12.3.2. Parameter settings for the two
Number Tickets box model simulation.
Source: [TwoBoxModel.xls]College.
100
Average of the Box
$ 15.00
SD
$ 10.00
Number of Draws

25
Distribution of Box
Log Normal

needed) until you spot the observation that was drawn. It is now out of the

population and cannot be drawn again. Click the Draw a Sample One button a few
Ticket at a Time

times until you get the idea of how the sample is being generated. When you

tire of drawing tickets one at a time, click the Cancel button on the message

box.

Clicking the Draw a Sample button takes an entire sample, drawn without

replacement, from the population and places it in column J (and column

A of the Difference sheet). The sample average and SD are reported in cells

G11 and G13. The correction factor (which is increasingly important as the

number of draws approaches the total number of tickets in the population) is

displayed in cell G23. The estimated (using the sample SD) and exact (using

the population SD) standard errors of the sample average are reported in

cells G25 and G26.

Click the Draw a Sample button several times. Note that the tickets in the popu-

lation and parameters (in red) remain fixed, whereas the sample itself and all

statistics based on the sample vary. The exact SE remains constant (because

the SD of the box does not change), but the estimated SE bounces (because

it is based on a bouncing sample SD).

Proceed to the College sheet and click on the Make a Box Model button. Create

a box model with the parameters shown in Figure 12.3.2. As with the High-

School sheet, you can create a sample one draw at a time or generate an entire

sample by simply clicking the Draw a Sample button. The sample is displayed in

column J and column B of the Difference sheet.

The Difference sheet shows the observed High School and College wages.

Confirm that column A in the Difference sheet is identical to column J in

the HighSchool sheet. Column B, of course, is a copy of the College sam-

ple. The Difference sheet displays the average and SD for each group and

the difference of the sample averages. We obtained the results reported

in Figure 12.3.3. Your results will be different because you have different

samples.

The difference of the sample averages of $1.40 per hour is an estimate of

the difference of the true or population average, which we know is $5 per

hour. The estimate is off the true value because of sampling or chance error.

308 Comparing Two Populations

Figure 12.3.3. High school and college Sample Sample
sample outcomes. Average SD
Source: [TwoBox.xls]Difference. High School $ 12.10 $ 5.17
College $ 13.50 $ 9.26
Difference (C-HS) $ 1.40

There are a few more high-wage high schoolers and a few more low-wage
college grads in these two particular samples than one might have expected.

Of course, the presence of chance error in the sample flows into the sample
averages and the difference of the sample averages. Thus, the difference of
the sample average is a random variable with a sampling distribution.

Monte Carlo Simulation

Click the Draw a Sample button in the Difference sheet to see the difference of
from Each Box

the sample averages (in cell E4) bounce around. There is no doubt about it –

each set of new samples pulled from the high school and college populations

generates new sample averages and a new difference of the sample average.

We could build up an approximation to the probability histogram of the

difference of the sample average by tracking each cell E4 result after drawing

new samples, but that would be slow and tedious.

Proceed to the MCSim sheet to see a much faster and easier approach.

Each repetition consists of drawing a high school and college sample and

then computing the difference of the averages of the two sample groups.

With many repetitions, an approximation to the sampling distribution or

probability histogram of the difference of the sample averages emerges.

Run your own Monte Carlo simulation by clicking on the Run Monte Carlo but-
Simulation

ton and compare your results to those reported in Figure 12.3.4. Notice that

the average of 10,000 differences of the sample averages, 4.994, is close to the

difference of the population averages, 5.00. This suggests that the sampling

distribution is unbiased – that is, it is centered on the true population differ-

ence. The SD of the 10,000 differences, 1.932, is also a good approximation of

the exact SE of the difference of the sample averages, 1.946. The latter was

computed using the square-root formula and the known population SDs. The

Monte Carlo results support the formula for the exact SE of the difference

of sample average.

That the sampling distribution is approximately normal even though the

populations are log normally distributed demonstrates the central limit the-

orem. Finally, note that the minimum difference in 10,000 repetitions was

−$1.64 per hour. For that particular realization of the chance process, the

high school average wage was actually greater than the college average. The

A Real Example: Education and Wages 309

Population Parameters

Sample Average Difference Summary Statistics Difference College High School

Average 4.994 $ 5.000 $ 15.00 $ 10.00

SD 1.932 Exact SE Diff 1.94626985 SD of the Box $ 10.00 $ 5.00

Max 12.880 Population Size 100 100

Min −1.640 Distribution of the Box Log Normal Log Normal

Empirical Histogram for 10,000 Repetitions Sample Size 25 25

1,200

1,000

800

600

400

200

0 5 10 15
−5 0

Difference of the sample averages

Figure 12.3.4. Two box Monte Carlo simulation results.
Source: [TwoBoxModel.xls]MCSim.

empirical histogram shows that there were a few samples in which the differ-
ence of the sample averages was negative.

Summary

This section has demonstrated a two box model by using a fictional scenario
in which we sampled from artificial high school and college populations. We
showed that the difference of the sample averages is a random variable and
verified the formula for calculating the SE (the square root of sum of the
squared individual sample SEs).

The TwoBoxModel.xls workbook allows for experimentation. Explore how
the SE of the difference changes as the number of draws increases or the
underlying SDs of the boxes change. Try changing the population distribu-
tions to see how the sampling distribution is affected.

The next section uses a real-world example to illustrate the two box model.
We will not know the true parameters, and thus they will need to be estimated.
It will not be possible to take repeated samples, but we know the one sample
we have is a single outcome from the random process that generated the data.

12.4. A Real Example: Education and Wages

Workbook: CPS90Workers.xls

This section explores an actual example of the two box model. We will
use samples from two populations to make inferences about the difference

310 Comparing Two Populations

between the two population averages. This example will be used to show the
logical order involved in a hypothesis test. We present the research question
and then proceed through a series of steps to answer the question.

The Research Question

Does education increase a person’s wage? More precisely, if other factors that
influence wages are held constant, do college-educated workers earn more
than workers who have only a high school education? Human capital the-
ory says that education increases the productive capacity of individuals. An
individual obtaining an education is like a firm investing in physical capital:
current outlays (tuition and foregone earnings) increase future returns (earn-
ings from future jobs). If education increases productivity, it should increase
wages.

Of course, we cannot simply compare the sample average wages of workers
with college and high school degrees because we know the sample averages
and sample difference are random variables. An average wage for college-
educated workers might be obtained that is higher than the average wage of
their high school counterparts just by pure chance. In other words, perhaps
the population average wages are in fact the same and it was just luck that
we drew a few more highly paid college workers and a few more lower paid
high school workers. A hypothesis test will enable us to handle this issue.

The Null and Alternative Hypotheses

To conduct a hypothesis test, we need to define a null hypothesis and an
alternative hypothesis.

Null Hypothesis: The average wage of workers with a college degree is equal to the
average wage of workers with only a high school degree.

Alternative Hypothesis: The average wage of workers with a college degree is higher
than the average wage of workers with only a high school degree.

Notice that the null represents the default answer that there really is no
difference. Observed sample differences are caused by chance alone. We will
test the null and decide to reject or not reject it.

The Data

The data that we will use to investigate this question come from the
March 1990 Current Population Survey. They consist of two random sam-
ples of workers from the entire population of those people who had a job in

A Real Example: Education and Wages 311

Data Education 16 Grand Total
Average of Wage 12 14.06 $ 10.61
StdDev of Wage 7.14 $ 6.04
Max of Wage $ 9.31 $ 44.23 $ 44.23
Min of Wage $ 4.99 $ 1.38 $ 1.00
Count of Wage $ 40.00 $
$ 1.00 $ 308 1127

819

Figure 12.4.1. Summarizing wage data by high school and college.
Source: [CPS90Workers.xls]PivotTable.

March 1990. For each observation (corresponding to an individual person),
we have values for the following variables:

Education = highest grade completed, in years
Wage = reported hourly wage, in $ per hour

The first sample, which has 819 observations, is a random sample of all
those working in March 1990 who had 12 years of education (i.e., a high
school degree). The second sample, which has 308 observations, is a random
sample of all those working in March 1990 who had 16 years of education
(i.e., a college degree).

A PivotTable in the CPS90Workers.xls workbook, shown in Figure 12.4.1,
summarizes the data from the two groups.

Is the Difference Real or Due to Chance?

The college grad sample has a higher average wage than the high school
sample. Have we found the answer to our question? Should we conclude
that the population of people with 16 years of education has a higher average
wage than the population of people with 12 years of education? Not yet.
Although the sample averages support an affirmative answer, there is still the
possibility that the population averages are actually equal and the difference
observed is simply due to the luck of the draw. To determine if this difference
is real or due to chance, we need to use a test of significance. To do this,
we need to construct a box model that represents the data generation process.

Setting Up the Box Model

The first box contains individual wages of workers who have 12 years of edu-
cation. The second box contains the wages of workers who have 16 years
of education. Figure 12.4.2 depicts the two box model and actual sam-
ples in this case. We will argue that the data were generated according to
Figure 12.4.2. In fact, the CPS uses a cluster sampling scheme, not the pure

312 Comparing Two Populations

High School Box College Box
Avg = ? Avg = ?
SD = ? SD = ?

819 draws 308 draws
Sample Avg = 9.31 Sample Avg = 14.06
Sample SD = 4.99 Sample SD = 7.14

Figure 12.4.2. The box model for comparing high school and college wages.

simple random sampling design required by the two box model. The cluster
sampling means our computation of the SE of the difference in the sample
average is a little off. We can legitimately argue, however, that the data were
generated by a random process. For the purposes of illustrating the two box
model, we will proceed as if the two box model applies.

We believe it is of utmost importance to tie the data to a box model. You
cannot use the sophisticated methods explained in this book to determine the
variability of a sample statistic unless the data generation process is explicitly
connected to a box model. Often, the tie will not be exact. In this case, it
is best to state the lack of agreement in the actual DGP from the ideal box
model being used to justify the application of statistical methods.

Figure 12.4.2 allows us to recast our null and alternative hypotheses in the
language of the box model.

Null Hypothesis: Both boxes have the same average, or, the difference between the
averages is equal to zero.

Alternative Hypothesis: The college box has a higher average than the high school
box, or, the difference between the average for the college box and the average of
the high school box is positive.

Constructing the Test Statistic and Interpreting the Results

With a box model that reflects the data generating process, explicit state-
ments of the null and alternative hypotheses, and sample data, we are ready
to construct the test statistic. We will use the z-statistic because the sample
sizes are large enough that we know, from the central limit theorem, the sam-
pling distribution of the difference of the sample averages is approximately
normal. We know the observed difference: it is 4.75 (=14.06 − 9.31). The null
hypothesis gives us the hypothesized difference, which is zero. To find the
z-statistic, we still need the SE of the difference.

The sample of 819 workers with 12 years of education has an SD of 4.99.
Using this SD as the SD of the high school population, we get an estimated

A Real Example: Education and Wages 313

SE of the sample average equal to

√4.99 = 4.99 ≈ 0.17.
819 28.62

The sample of 308 workers with 16 years of education has an SD of 7.139.
Using this SD as the SD of the college box, we obtain an estimated SE of the
sample average equal to

√7.14 = 7.14 ≈ 0.41.
308 17.55

The SE of the difference between these two sample averages is

SEA2 + SE2B = 0.172 + 0.412

= 0.0289 + 0.1681
= 0.44.

The z-statistic, then, is

4.75 ≈ 10.7.
0.44

The z-statistic tells us that, if the null is true, the observed difference is 10.7
standard errors away from the hypothesized difference of zero.

The P-value for this z-statistic is tiny. We reject the null that there is no
difference in the average wage of high school and college-educated people
in the United States in March 1990 because our sample result (or one more
extreme) is ridiculously unlikely to have been observed if there really were
no difference.

A Brief Note on Confounding

The data confirm that college-educated workers earn more than workers with
only a high school education. Suppose, however, we want to know whether
getting a college education will increase your wage if everything else is held
constant. If we take someone who currently has a high school education
and send that person to college, will that person’s hourly wage increase by
5 dollars? This is a more difficult question to answer, and we may not be able
to answer it if there is confounding. If the two populations are not alike in
every way except for their level of education, then we may have confounding.
The test of our null hypothesis does not tell us whether there is confounding
and may lead us to the wrong conclusion if confounding exists.

Virtually every study shows that better educated workers have higher
wages. Much controversy, however, remains over the interpretation of this

314 Comparing Two Populations

result. Do better educated workers have higher wages because the schooling
improves their productivity or because they were more talented in the first
place? Perhaps people who are less talented get less schooling because they
do not like education. In other words, another factor, innate ability, may be
confounding the comparison between the wages of the two groups. Econo-
metricians also use the term omitted variable bias to describe this situation.
We discuss this issue in more detail in Chapter 18.

Summary

This section has been devoted to a real-world application of the two box
model with an emphasis on the logic of hypothesis testing. We estimated
the difference in the average wage between college and high school edu-
cated workers with the sample difference. Given a concern that the observed
difference might be due to chance alone, we ran a test of significance.

We began the hypothesis test by explicitly tying the data from the CPS sam-
ple to the two box model. The tie was not perfect, but it was close enough.
The important lesson is that we made an argument for using the two box
model. Without this argument, we cannot justify the use of the formula for
the SE of the difference in the sample averages. The rest of the testing pro-
cedure was fairly mechanical. We constructed a test statistic, computed the
corresponding P-value, and made a decision to reject the null.

12.5. Conclusion

This brief chapter serves as a stepping stone to our eventual goal, the classical
econometric model. By examining a data generation process in which two
groups are being compared (for example, wages of high school versus college
educated people), we are taking a small step toward a regression that explores
the effect of education on earnings.

This chapter especially emphasizes the idea that a sample difference
between two groups is a random variable that changes with each new sample.
Although the sample difference is important, remember that the SE of the
sample difference is also crucial. Without this give or take number, we have
no way of knowing whether the observed sample difference reflects an actual
difference between the two population averages. Every statistic derived from
a data generation process has a sampling distribution, and much effort is
focused on determining the center, spread, and shape of the statistic’s prob-
ability histogram. Of course, if the sample was not generated by a random
process, you have no business applying the methods demonstrated here.

These lessons remain in force as we turn to the next chapter, which intro-
duces the classical econometric model.

References 315

12.6. Exercises

Workbook: CPS90ExpWorkers.xls

These exercises are organized around the research question: Does experience
increase a person’s wage in the United States?

Open the Excel workbook, CPS90ExpWorkers.xls. The Intro sheet explains the
variables.

1. Report the average wage for Experienced and Inexperienced workers.
2. The difference between the average wage of the Experienced workers and the

average wage of the Inexperienced workers is $3.10 per hour. Why can we not
conclude that experience raises a person’s wage based on this fact?
3. Draw a two box model that represents the data generation process.
4. State the null and alternative hypotheses.
5. Find the SE of the difference of the sample averages. Show your work.
6. On the assumption that the null is true, draw a rough sketch of the sampling
distribution of the difference of the sample averages. Mark the location of the
$3.10 per hour difference we observed in our sample.
7. Would you reject the null hypothesis? Explain.

References

We drew on Freedman et al. (1998), Chapter 27 for our discussion of the two box
model. The two box model is nonstandard terminology describing a test of
differences between means. Hypothesis testing has long been a source of confusion.
For a well written account of the logic behind hypothesis testing and a fun
introduction to the history of statistics, we recommend

Salsburg, D. (2001). The Lady Tasting Tea: How Statistics Revolutionized Science in
the Twentieth Century. New York: Henry Holt and Company.

Salzburg’s explanation of the source of the term significant is worth quoting as
follows:

Somewhere early in the development of this general idea, the word significant came to be
used to indicate that the probability was low enough for rejection. Data became significant if
they could be used to reject a proposed distribution. The word was used in its late-nineteenth-
century English meaning, which is simply that the computation signified or showed something.
As the English language entered the twentieth century, the word significant began to take on
other meanings, until it developed its current meaning, implying something very important.
Statistical analysis still uses the word significant to indicate a very low probability computed
under the hypothesis being tested. In that context, the word has an exact mathematical mean-
ing. Unfortunately, those who use statistical analysis often treat a significant test statistic as
implying something much closer to the modern meaning of the word (p. 98).

13

The Classical Econometric Model

. . . the class of populations we are dealing with does not consist of an infinity of
different individuals, it consists of an infinity of possible decisions which might be
taken with respect to the value of y.

Trygve Haavelmo1

13.1. Introduction
This chapter will introduce and discuss the classical econometric box model.
We will use CEM as our acronym for this fundamental model. In other books
and articles, you might see this model referred to as the classical linear model
or the classical regression model. The name is not as important as the content.

The CEM has been by far the most commonly used description of the
data generation process in econometrics. Understanding the requirements,
functioning, and characteristics of the CEM is extremely important because
modeling the data generation process is a crucial step in econometric analy-
sis. Without a model of how the data were generated, inference is impossible.
Subsequent chapters present more complicated box models designed to han-
dle some of the situations in which this basic model deals inadequately with
the data generation process.

Sections 13.2 and 13.3 present a hypothetical example designed to provide
an intuitive understanding of the CEM, and Sections 13.4 and 13.5 describe
the CEM in a more formal way.

13.2. Introducing the CEM via a Skiing Example
Workbook: Skiing.xls

The heart of this chapter, and a crucial idea in econometrics, is the data gen-
eration process (DGP) specified by the CEM. This section uses an extended

1 Haavelmo (1944) in Hendry and Morgan (1995, p. 488).

316

Introducing the CEM via a Skiing Example 317

Women Super Giant Slalom Medal Results in the 1998 Winter Olympics

Medal Athlete Country Time (min:s.00) Time (s)
Gold 78.02
Silver Picabo Street USA 01:18.02 78.03
Bronze 78.09
Michaela Dorfmeister AUT 01:18.03

Alexandra Meissnitzer AUT 01:18.09

Figure 13.2.1. Women super giant slalom results, 1998 Winter Olympics.
Source: [Skiing.xls]Picabo.

hypothetical example to illustrate the DGP embedded in the model. We
could instead have launched into a dry, abstract description of the model and
its requirements, but we think you will have more fun and learn more by
beginning with an example that makes intuitive sense.

Super G at the 1998 Nagano Olympics

Olympic skier Picabo (pronounced PEEK-a-boo) Street is poised to come
shooting out of the gate. She will reach speeds in excess of 70 mph as she
completes her Super G run. Her competitors will try to beat her time. One
after the other they come rocketing down the mountain. In the 1998 Winter
Olympic Games in Nagano, Japan, the final standings for the Super G medals
are presented in Figure 13.2.1.

We are going to consider ways to model the outcome of this and other
imaginary races from an econometric perspective. Our goal is to give an
informative example of the CEM. We will work toward that model by starting
one we have already encountered. How can the measurement box model be
used to interpret each individual time?

The Measurement Box Model

A simplistic application of the measurement error DGP would give a cynical
and clearly false explanation of what happened at Nagano as follows. Each
skier actually had the exact same time on her run in the Super G, but the
official clock sometimes registered a faster time, sometimes a slower time.
Picabo Street happened to have been the luckiest skier, and so she got the
gold medal!

Now this story is nonsense, but let us write down the model anyway for
purposes of comparison to more realistic models:

Model 1: Observed Timei = True Time + εi for i = 1, . . . , n.

Subscript i indexes skiers, and thus, for example, Observed Time9 would be
the time for the ninth skier. There are n skiers in all. The observed time
for skier i is the true time, which is the same for all skiers (you can tell

318 The Classical Econometric Model

Figure 13.2.2. A measurement box model for observed ski times.

because there is no subscript on True Time) plus εi , a draw from a mea-
surement box. All the draws are independent, meaning, for example that a
clock that was too slow for one skier tells us nothing about the likely amount
of the timing error for any other skier. The SD of the error box depends
on the precision of the timing system and that precision did not vary during
the competition (e.g., a better timing device was not installed after the second
skier’s run).

Model 1 applies the univariate measurement box model not to the distance
between two mountains, but to the time taken by world-class skiers hurtling
down a mountain. The observed time is composed of two unobservables: (1) a
true, unchanging value, plus (2) a random, chance error term generated by
the measuring device itself. Figure 13.2.2 is a picture of Model 1.

Notice that the measurement box model, as currently implemented, is
based on no information about each skier. It is assumed that the skiers are
identical and that, owing to the vagaries of the timing system, some pick
positively numbered tickets from the box on their way down the mountain
(which is bad because they want to get down there fast!), whereas others draw
negatively numbered tickets. The parameter β0 indicates the true, unknown
time for each skier.

One way to estimate the fixed, unknown True Time would be to take the
sample average of all 11 skiers’ times. Furthermore, if we wanted to pre-
dict any individual skier’s time, we would guess the sample average, give
or take the SD of the sample (which would be our estimate of the SD of
the box). We repeat that Model 1 is just the univariate measurement box
model.

Introducing the CEM via a Skiing Example 319

But we ought to reject Model 1. Why is Model 1 unsatisfactory? There
are three good reasons. First, owing to modern technology and the scrutiny
of the entire skiing world, the Olympic timing system is actually quite pre-
cise. Measurement errors for the clock in the Super G are considerably less
than a hundredth of a second (Picabo Street’s winning margin over Michaela
Dorfmeister at Nagano). Second, it seems likely that, even among the very
best skiers in the world, some are better than others, and so the assumption
that the true time was the same for all skiers is silly. Third, even if all the skiers
were equally talented, there is no way they would take the same time coming
down the mountain. Snow, wind, and the path left by other skiers must have
an impact on each skier’s time. Our conclusion is that the measurement box
model, as implemented in Model 1, does not describe this data generation
process.

Notice that we do not blindly assume a data generation process. In this
case, we have rejected Model 1 because it does not accurately depict the way
the observed times were generated. Similarly, we would reject a model of
the data generation process based on the polling box model because the way
observed ski times are generated is not from a box with a fixed average and
a finite number of tickets each representing a single skier.

A New DGP to Describe Observed Ski Times

To represent the data generation process in the skiing example correctly,
we are going to need a new box model. We will call it the classical econo-
metric box model or CEM for short. Before we begin, let us think more
realistically about what causes differences in skiing times. Athletic perfor-
mance clearly depends on raw talent, the time spent practicing the sport,
and luck. Now there is no way to influence raw talent and luck, but it is pos-
sible to adjust time spent practicing. Skiers know that the more you train,
the lower your time, but no one knows how much you can improve by
increasing training. Furthermore, it is possible to measure how much time
an athlete spends training, but it is very hard to measure either talent or
luck.

The box model for skiing must correct the three flaws in the simple measure-
ment error model. First, we will eliminate measurement error as an important
explanation for differences in observed times. There is undoubtedly some
measurement error in the observed ski time, but it is so small in this case
(compared with the other sources of variation in times of skiers) that mea-
surement error can be safely ignored. Second, we explicitly model training
time as a variable that helps to explain differences in the true time of each
skier – more training, we think, yields a lower true time ceteris paribus. Third,

320 The Classical Econometric Model

we will allow observed time to be influenced by two other factors as well: luck
and pure talent.

In this more realistic model, each skier has a true, exact, but unknown time
on a given hill on a given day. That time is determined by his or her training and
talent. What each observed time represents is a composite number formed
according to Model 2:

Model 2: Timei = β0 + β1 · Trainingi + β2 · Talenti + νi , i = 1, . . . , n.

In Model 2, the error term, νi , represents luck: good luck is a negative error
term, reducing the observed time, whereas bad luck is a positive error term.
The observed time is different from the true time because of luck.

Although Model 2 is a much more satisfactory description of the data
generating process, it cannot be estimated. The problem is that Talent is
unobserved: medical science is incapable of measuring raw skiing talent.2
For purposes of estimating the model, talent must be dumped into the error
box. A model we actually could estimate is Model 3:

Model 3: Timei = β0 + β1 · Trainingi + εi , i = 1, . . . , n.

The source of the chance error (εi ) in Model 3 can be found in two places.
First, each error term in part represents the impact of omitted variables. An
omitted variable is an independent variable that influences the dependent
variable, but is not included in the regression model. Although we have high-
lighted natural ability (talent) as an obvious determinant of performance,
there are potentially many more omitted variables such as the motivation
and health of each skier. The second component of the error in each obser-
vation reflects the inherent randomness in the world (slight wind shifts while
flying down the mountain, bumps, etc.) or, in other words, just plain luck.
Thus, Model 3’s error term εi is really the sum of all omitted variables (like
talent), measurement error (which is small compared to the other two sources
of error), and just plain luck (the ν term in Model 2). The i subscript reminds
us that the value of the error term varies from one skier to the next.

Figure 13.2.3 is a graphical representation of the CEM treatment of
Model 3. Each ticket drawn from the box is a composite error term rep-
resenting the effect of talent, luck, measurement error, and other factors.
Because each draw from the error box is made at random and with replace-
ment, the chance errors are independent of the included X variables. In this
case, that means the amount of training a skier has tells us nothing about his
or her talent. Because the draws are with replacement, the box is the same
for all of the skiers. That rules out situations in which, for example, skiers

2 Researchers, however, do try to measure the relationship between exercise and how well the body
performs using regression techniques. For example, see Winter, Eston, and Lamb (2001).

Introducing the CEM via a Skiing Example 321

Error Box

Reflecting the impact of omitted
variables and luck.

Observed Time1 = β0 + β1⋅Training1 + Average = 0
Observed Time2 = β0 + β1⋅Training2 + SD = ?

Observed Timen = β0 + β1⋅Trainingn +

Figure 13.2.3. The classical econometric box model.

with more training have more consistent times, which would imply that there
is a smaller spread in better trained skiers’ error terms.

Comparing Box Models

Suppose we knew the values of the four unknown parameters in Model 2: the
intercept, two slopes, and SD of the error box. Then for each individual skier
we could predict the amount of time he or she would take coming down the
mountain given the number of hours that skier trained and his or her talent
level. Mathematically speaking, this situation would be just like the Hooke’s
law example. If we knew the exact physical characteristics of the spring (the
intercept and slope parameter values), we could compute the exact length of
the spring for different weights. The skier’s true time (with a zero draw from
the box) and the exact length of the spring are deterministic functions of the
parameters and independent variables. In both cases, the observed values of
the dependent variables are different from the true, deterministic values. This
is due to the presence of the error terms in the models.

Despite the formal similarity, there is a big philosophical difference
between the two models. That difference lies in the explanation for the pres-
ence of the error terms. In the CEM, the error term no longer reflects an
imperfect measurement instrument. Instead it summarizes the impact of all
the factors not explicitly included in the model: motivation, quality of train-
ing, equipment, and plain old luck. For economists, this idea that the error
term reflects factors influencing the outcome that they cannot measure is an
appealing concept. Because it is usually impossible for economists to collect
information on all the relevant variables, we attribute variation in observed
dependent variables to factors that have not been measured.

322 The Classical Econometric Model

Summary

This section introduced our hypothetical shing example. It forms the back-
bone of our presentation of the CEM. This discussion may seem overly
abstract; therefore, let us implement these notions in a concrete, visual
presentation to explain more clearly what is going on. The next section
shows how Excel can be used to simulate the ideas presented here and
gives you a chance to see the data generation process of the CEM in action
literally.

13.3. Implementing the CEM via a Skiing Example

Workbook: Skiing.xls

To demonstrate the operation of the classical econometric box model, let
us imagine an experimental setup that could generate the data. Suppose the
Austrian Ski Federation, stung from their defeat at Nagano, is determined
to figure out the effect of training on performance. Therefore, the federation
has decided to perform a series of tests designed to determine the effect of
training on ski times. They will take groups of 25 skiers and apply a training
regimen to each skier. One will train 8 hours per day, whereas another might
train only 2 hours per day. After 6 months, the 25 skiers will race and their
times will be recorded.

Open the Excel file Skiing.xls. Go to the sheet called EstimatingBeta1,
which implements the DGP of the classical econometric model. The purpose
of this workbook is to clarify the roles played by the observed and unobserved
variables in the data generating process of the CEM.

Let us take a tour through the EstimatingBeta1 sheet, as depicted in Fig-
ure 13.3.1. As you work on understanding the information presented on this
sheet, we suggest clicking on cells to reveal formulas and noting which cells
bounce and which remain constant as we simulate the data generation pro-
cess. As usual, all of the parameters and variables that would be unknown to
the econometrician are in red text. Training and Observed Time are in black
text because they are observed.

The key parameter of interest is β1, the coefficient on Training Time, which
has been set at −0.5 and is in units of seconds/hours per day. This means
that, for every additional hour of training per day, the skier’s time falls by
0.5 seconds. Although this may not seem like much, when you consider that
races are won by hundredths of a second, maybe training that extra hour
every day really is worth it. Of course, we have cooked all of these data
and really do not know how much training affects skiing performance; how-
ever, we do think training, whether for skiing or in the classroom, really
matters.

Implementing the CEM via a Skiing Example 323

Population Parameters

β0 100 (s) Model 3: Time = β0 + β1Training + ε

β1 −0.5 (s/hr per day) b1 −0.637 100.612 b0

β2 −0.2 (s/index points) est. SE(b1) 0.088 0.472 est. SE(b0)
0.5 (s) R2 0.695 1.270 RMSE
SD of Nus
r (Training, 0.29 Sample Correlation between Training and Talent F 52.309 23 df

Talent)

RegSS 84.400 37.110 SSR

Average 4.52 −0.436 97.827 −0.091 −0.004 97.7348
Max
Min 9 11.884 101.462 0.916 2.623 102.12
0 −12.867 93.123 −0.593 −2.286 93.21
SD of Nus 2.95
Training 6.45 2.22 0.39 1.31 2.25
Skier (hr per
day) True Timei β2Talenti + Observed Race
Time

Talenti (s) νi (s) νi (s) (s) Skier

A7 1.503 96.199 −0.146 −0.447 96.05 A Winner Y 93.21 s
B8 −8.895 97.779 −0.431 1.348 97.35 B
C9 11.884 93.123 −1.460 94.04 C Observed Time as a function of Training
D4 10.655 95.869 0.916 −2.028 95.97 D
E2 0.683 98.863 0.103 0.383 99.38 E
F7 0.519
G2
H7 3.936 95.713 −0.187 −0.974 95.53 F Observed Time (s) 104 W
I0 −8.303 100.661 0.258 1.919 100.92 G VG
J3 −0.585 96.617 0.054 0.171 96.67 H 102
K2 2.491 99.502 99.26 I N EK JU T
L5 −6.514 99.803 −0.237 −0.735 99.23 J 100 R S B
M9 −2.287 99.457 −0.569 0.734 98.93 K I
N1 −3.637 98.227 −0.528 97.63 L PQ L O
O6 4.229 94.654 −0.593 −0.071 94.40 M 98 D H
P4 −0.248 99.550 −0.251 0.134 99.62 N FXA
Q4 −2.237 97.447 97.38 O 96
R1 3.851 97.230 0.067 −1.097 96.78 P MC
S3 0.802 97.840 −0.070 0.117 97.33 Q 94 Y
0.959 99.308 −0.450 0.378 99.06 R
T8 97.819 −0.510 −1.220 97.80 S 92 2 slope 4 -06.63663273 8 10
U3 3.405 −0.244 −0.670
V1 98.573 −0.016 −0.436 98.53 T 90
W1 −12.867 99.538 −0.697 99.32 U 0
X7 −5.192 −0.045
Y9 −0.216 2.528 Training (hr/day)
0.823

−7.177 100.935 0.276 1.711 101.21 V Add Regression Line Remove Regression Line
−9.810 101.462 0.661 2.623 102.12 W
96.206 −0.559 −0.853 95.65 X
1.471

10.991 93.302 −0.088 −2.286 93.21 Y

Figure 13.3.1. The skiing example.
Source: [Skiing.xls]EstimatingBeta1.

Click on one of the Observed Time cells in column G. We clicked on cell
G18 to reveal the following formula:

= ROUND(Beta0 + Beta1∗Training + Beta2∗Talent + Nu, 2)

The formula puts Model 3 of the previous section into play. Nu is an error term
reflecting luck and measurement error. Because we do not observe talent,
our error term is actually Beta2*Talent+Nu. The values of this composite
error term are given in column F. The ROUND function is used to force the
computed result to be rounded to the second decimal place.

A key assumption of the CEM is that the omitted X ’s that help make up
the tickets in the error box must be independent of the included X ’s. For this
to be true when we implement the skiing example, the correlation between
Talent and Training must be zero on average. Cell B6, which reports the sam-
ple correlation between Talent and Training, shows that in each individual
sample the correlation will not be exactly zero. As you repeatedly draw sam-
ples, however, you will observe that this correlation bounces around zero, as
claimed by the CEM. Use the Monte Carlo simulation add in to demonstrate
that the correlation between Training and Talent is indeed zero on average.

324 The Classical Econometric Model

Observed Time as a function of Training

104 L
102
Observed Time (s) 100 I WV KE US P
NR G
98
96 JD O HA CM
94 Y
92 Q T
XB 10
F

90
02 46 8
Training (hr/day)

Figure 13.3.2. Results of one race.
Source: [Skiing.xls]EstimatingBeta1.

In Chapter 18, we will see how a nonzero long-run correlation between the
error term and the X variable affects the sampling distributions of regression
slopes causing biased estimates of the slopes.

Another condition required by the CEM is that the composite error terms
shown in column F, consisting of Talent and luck, vary with every new sample
of 25 skiers. Thus, we cannot describe the DGP as taking the same 25 people
with the same talents and racing them over and over. If, for example, the first
skier always had more talent than the second skier, then the two skiers would
not be drawing from the same error box. To meet the requirements of the
CEM, we must imagine that each time, the experiment is run, the Austrian
Ski Federation gets a set of 25 new skiers and forces the training protocol
upon them (in column B), and thus we are getting 25 new Talents in every
sample.3

Having seen how the observed times are generated, let us race. Click on
the Race button. A typical outcome of a race from a set of 25 skiers might
look like Figure 13.3.2.

There are several things to notice about this chart. First, there seems to
be a negative relationship between time in the race and training time. This is
not an accident because the value of β1 has been set to −0.5, meaning that,
if everything else is held constant, an increase of 1 hour per day in training
time results in a decrease in time on the course of −0.5 seconds. Second, even
though several skiers have the same amount of training time, they do not
have the same race time.

3 Data sets in which the same individuals are observed more than once are called panel data. Models for
the DGP appropriate to panel data are beyond the scope of this book. See Wooldridge (2003).


Click to View FlipBook Version