WHY THE FORMAL METHOD IN STATISTICS IS USUALLY THEORETICALLY INFERIOR
Julian L. Simon
You are standing in the warehouse of a playing-card factory
that has been hit by a tornado. Cards are scattered everywhere,
some not yet wrapped and others ripped out of their packages.
The factory makes a variety of decks - for poker without a joker,
poker with a joker, and pinochle; magician's decks; decks made of
paper and others of plastic; cards of various sizes; and so on.
Two hours from now a friend will join you for a game of
near-poker with these cards. Each hand will be chosen as randomly
as possible from the huge heap of cards, and then burned. What
odds should you attach to getting the combination two-of-a-kind -
two cards of different or the same suit but of the same number or
picture - in a five-card draw?
Ask this question of a professional probabilist or
statistician, and - based on the small sample I have taken - s/he
is likely to say "I don't have enough information". There is
even a name for this sort of question: Problems lacking
structure.
Ask the same question of a class of high-school students or
college freshmen and you will quickly get the suggestion, "Draw
hands from the card pile the same way you will draw them when you
play later, and see how often you get two-of-a-kind".
Who produces the better (more useful) reply - the "naive"
students, or the learned statistician/probabilist?
(If the question had been framed as the probability of
getting [say] the jack of spades in a poker hand drawn from the
pile, the probabilist probably would think of suggesting a
sample. Apparently it is the combination of elements that leads
the trained person to say that the job cannot be done.)
This case reminds one of the three-door problem, in which
resampling immediately produces the correct answer whereas
trained intellects almost uniformly arrive at the wrong answer.
The untutored person's try-it procedure is, in this case,
not only as good as any procedure can be, but better than any
formal procedure can be, even in principle. One reason is that
the probability of any given hand in the warehouse is affected by
the physical properties of the cards - their sizes and materials.
The various cards are not perfectly alike, just as a die cannot
be perfectly true; even a bit of purposeful shaving of a die's
edge can affect the odds enough to enable a gambler to cheat
successfully.) But an empirical estimation with an actual
sample-and-deal procedure includes the effect of these physical
influences, whereas any more abstract approach has great
difficulty doing so.
Another issue: You might also want to estimate the chance of
a three-of-a-kind hand. You quickly recognize that this event
does not happen very often, and it will take many hands to
estimate its probability. So you consider this procedure: take a
sample of (say) 1000 cards, record their values, transform those
values to a form that a computer can read, then program the
computer to choose (with replacement, now) five cards at random
from the 1000, and examine many trial hands (say 10,000) to see
whether there are three-of-a-kind. The computer procedure should
be as close an analog as possible to physically shuffling and
dealing five-card hands from the 1000 sampled cards. Please
notice that one need never know how many of each type (that is,
face value) of card the sample contains. Rather, as each of the
cards is examined, its value is transmitted to the computer. It
is unnecessary to calculate any sample space or any partition of
it; one never needs to know that there are 2, 598, 9600 or
whatever number of possible poker hands. (Goldberg, 1960, p. 305)
A probabilist might suggest computing the chance of
three-of-a-kind from the same 1000 pieces of information by using
probability theory. Both these procedures will arrive at much
the same result. Both fail to take account of physical factors -
size, and type of material - that might affect physical trials
with the 1000 sampled cards. The simulation will be slightly
less "exact" than the theoretical calculation, the lesser
exactness being made as small as desired by increasing the number
of computer trials; the loss of accuracy surely will be very
small relative to the sampling error deriving from choosing the
1000 cards from the huge pile - including both the random-
sampling error and the bias due to not drawing the sample
randomly. And of course the formal calculation in this case will
be quite tricky and prone to error. It must assess the size of
the sample space of three-of-a-kind hands when the numbers of
cards of various values will differ, both because the factory
makes different numbers (no jacks, queens, and kings in some
decks, for example) as well as because of the inaccuracy due to
the sample of only 1000 cards. In contrast, the sample space
need never be known for physical or computer resampling.
EXPLANATION OF THE ADVANTAGE OF RESAMPLING
Lighter Conceptual Burden
In general, the conceptual burden in resampling is much
slighter than in probability theory; this is one of resampling
main advantages. One does not need to be able to add or even to
count in order to conduct individual experimental trials. One
only needs to know the concept of counting, and also the concept
of a ratio, so as to (first) keep a record of the numbers of
successful and unsuccessful trials, and (second) add to get the
total trials and dividing to get the ratio of successful to
total. Certainly the discipline that applauds the likes of Peano,
Russell, and Whitehead for boiling down mathematics to its most
fundamental elements should have some appreciation for an
intellectual method that gets along so successfully with so
little recourse to higher abstractions.
Consider, for example, the case of the probabilities of
various numbers of points when throwing two dice (refer to
Goldberg, 1960, p. 158ff). When specifying the sample space,
etc., one needs to add the two top faces of the dice to determine
the range of the function. With simulation it is not necessary
to ever determine this range; one simply tosses the two dice and
inspects the outcomes. One can ask the probability of getting
"13" (or any other number) and get an answer experimentally
without knowing the range in advance.
Reducing the Extent of Abstraction from Actual Experience
Robert Shannon, in a book on Systems Simulation, constructs
a continuum from "Physical models" to "Scaled models" to "Analogy
models" to "Computer simulation" to "Mathematical models" (1975,
p. 8). (I would add experimentation with the actual material of
interest as a stage even less abstract than Physical models.) At
each successive stage of translation to greater abstraction one
runs the risk of losing some important aspect of experiential
reality, and of introducing misleading assumptions and
simplifications. This argues for abstracting as little as
possible, doing so only to the extent that it is necessary.
As Shannon's continuum suggests, simulation methods in
statistics (with or without a computer) are less abstract than
are distributional and formulaic methods, and they should be less
at risk of error. This speculation jibes with the experimental
evidence that people can attain more correct answers to numerical
problems with resampling methods than with formulaic methods,
when given the equal amounts of instruction (Simon, Atkinson, and
Shevokas, 1976).
Of course the optimal level of abstraction depends upon the
circumstances. If one wants to estimate the probability of a
given sum with four dice in order to maximize one's chance of
winning with those particular dice, experimenting with those very
dice is likely to be optimum, but if one wants to know the odds
with four dice in other circumstances, a more abstract approach
may be better. However, there are very few circumstances in
which the formulaic and distributional abstractions are likely to
be better than Monte Carlo methods (lack of data being one such
circumstance, and low probability being another).
Operationalizing the Problem
A third virtue of resampling may be stated as: If you
understand the posing of the problem operationally, you
automatically will obtain the correct answer. For example,
consider this probability puzzle from Lewis Carroll's Pillow
Problems (by way of Martin Gardner, correspondence, May, 1993):
A bag contains one counter, known to be either white or
black. A white counter is put in, the bag shaken, and
a counter drawn out, which proves to be white. What is
now the chance of drawing a white counter?
The issue is, do I state the problem correctly in steps 1-4
below? If I do, that implies that the repetition of the process
in those steps will lead to a correct answer to the problem.
1. Put a white counter (later have the computer call it "7"
to avoid confusion) or a black counter (call it "8") in the urn
with probability .5.
2. Put in a white and shuffle.
3. Take out a counter. If black, stop.
4. (If result of (3) is white): Take out the remaining
counter, examine, and record its color.
5. Repeat steps 1-4 (say) 1000 times.
6. Compute how many trials yielded a white first.
7. Count the number and compute the proportion of whites
("7s") among the "white first" trials.
The benefits of the operationalization of problems that
occurs with simulation can be seen in a different way in another
problem of Lewis Carroll's:
Given that there are 2 counters in a bag, as to which all that
was originally known was that each was either white or
black. Also given that the experiment has been tried a
certain number of times, of drawing a counter, looking
at it, and replacing it; that it has been white every
time...What would then be the chance of drawing white?
(p. 15).
This problem was an eye-opening experience for me. First I
wrote down a set of steps to handle the problem with white and
black balls ("counters"). But I did not actually execute the
procedure. Instead, while I was waiting for an associate to
write a computer program to solve the problem, following the
steps I had outlined, I set out to explain the problem logically.
I wrote five nice pages of what I thought to be clear
explanation.
A few days later I reread the steps I had written down. But
now I found that I could not understand the logic. This
experience shows how easy it is to get confused with Bayesian
problems of this sort if one works analytically rather than with
simulation. So I tried harder to create a simulation - and
harder - and harder. And then I found that I simply could not
create a simulation that would model the problem as Carroll wrote
it ( and as I understood it). Apparently I was as confused as
anyone could be.
What to do? I decided to go back to my very basic
principle: There must be a way to physically model every
meaningful question in probability and statistics. If one cannot
find a way to model a simulation for the problem, maybe there is
something wrong with the problem rather than with my modeling.
And indeed, when we examine it closely, we may see that Carroll's
problem is not operational and hence not meaningful.
The difficulty turns out to lie in Carroll's phrase "given
that the experiment has been tried a certain number of times, of
drawing a counter, looking at it, and replacing it; that it has
been white every time". In Carroll's solution he indicates that
he believes that it is possible to infer a probability for the
next trial on the basis of a series of trials that are all
successes. This is a famous formula in probability theory - that
the probability is n/(n+1), where n is the number of observed
successes. But probability theorists such as Feller have argued
(correctly, in my view) that this formula is not meaningful. And
the fact that it is not possible to model the formula
meaningfully in this context confirms that theoretical analysis.
So once again the act of attempting to create an operational
simulation of a problem and then actually executing the procedure
has kept our feet on solid ground and off the slippery slope into
confusion or meaninglessness.
LIMITS OF THE RESAMPLING METHOD
Low Probabilities
Can the formal method be better in any respect? Yes, it
can. If you want to estimate the chance of a royal flush in
poker, which probably would happen only once in hundreds of
thousands or millions of trial hands, taking samples by sitting
on the floor of the warehouse for a few hours and dealing hands
will not produce a sound estimate. And even computer sampling
might be much less accurate than analysis without an inordinate
amount of computer time devoted to the problem.
But will the formal method surely be better for the royal
flush? No. There is an excellent chance that anyone except a
very skilled probabilist will use the wrong calculating formula,
and the erroneous answer might well be worse than no answer at
all, and worse than computer sampling or perhaps even sampling by
hand. This realistic possibility of conceptual analytic error
cannot be ignored in any practical situation. It is as much a
source of possible error as the sampling procedure, physical
characteristics of the cards, and unsound computer programming if
a computer is used. Just as with the calculation of the
possibility of a disaster at a nuclear reactor, each possible
source of trouble must be gauged and allowed for in proportion to
its likely importance. None can be dismissed as being avoidable
"in principle" by proper handling.
Small Samples
Imagine a sample of the heights of four persons. You wish
to estimate a confidence interval for the population mean or
median. It is rather obvious that the interval should go beyond
the range of the four observations, but a resampling procedure
will never give that result. Does this mean that resampling is
inferior here to the conventional method using (say) the t test?
Implicit in the conventional method is an assumption about
the shape of the distribution. Making this assumption is in no
way different in principle from a Bayesian prior. And the nature
of the assumption is crucial. An assumption that would be
appropriate for heights would not be appropriate for incomes.
Once we have established that it is necessary to bring
outside information and judgment to bear, we can then consider
doing so with the resampling method as well as the conventional
method. We need not enter into technical details here, but there
are many possible ways to coordinate the observations to any
shape of distribution in such fashion as to estimate its
dispersion, and then to draw samples from the distribution to
estimate confidence limits. This would not seem inferior to the
conventional method. And if one made the assumption of a
peculiar type of distribution, the advantage would seem to be
with the resampling method, though this subject needs more
exploration.
WHAT ABOUT "USUALLY"?
The title of this article says that the formal method is
"usually" inferior. This assertion assumes that most
applications of probability and statistics deal with situations
and probabilities that lend themselves well to direct physical
sampling and/or to the resampling procedure on the computer.
This very general assertion, of course, might be refuted by
systematically-gathered evidence. What is most important,
however, is not the general assertion but rather choosing the
method that is right for each particular situation.
The card-warehouse example lacks realism. But estimating
the probability that there will be two faults in a particular
piece of machine output, where the probability of each fault
seems to be independent of each other, is not very dissimilar,
though the probability model is rather different. And a quite
analogous realistic set of problems was the basis for Galileo's,
and then Pascal's and Fermat's, foundational work with dice games
in formal probability theory that proceeded by assessing the
sample space and partitions of it. But experimentally estimating
the odds as gamblers previously had done had led to sounder
answers than even such great minds as Gottfried Leibniz had
arrived at with deductive methods (cited by Hacking, 1975, p.
52).
Why argue that formal methods are often inferior in
principle? One of the objections to resampling in statistics is
that it is "only" an imperfect substitute for formal methods, and
that the passage to formal methods represents an advance over
simulation methods. For example, when William Kruskal compared
the early statement of resampling methods in the stark terms of
the necessary operational procedures, versus developments in the
literature later on, he dismissed the importance and value of the
former by saying that the latter embodies "real mathematics"
(personal correspondence, 1984).
There is an important analog between the lack of exactness
in resampling and the movement in modern physics and mathematics,
since Poincare and Bohr, away from Newtonian deterministic
analysis of closed systems and toward non-deterministic analysis
of open systems. (See Ekeland, 1988, for an illuminating
discussion of this movement.) Probability theory is a set of
exact closed-form replicas of inexact open physical situations,
of which the card warehouse is an example. (A sample of 1000
cards taken from the warehouse, and then converted to equally-
weighted entities converts the open system to a closed system.)
That is, when calculating the probability of two-of-a-kind in a
poker hand, the sample space and the partition containing that
subset are exact numbers even though in any actual situation
there are incalculable elements such as the different weights of
the cards due to the different amounts of ink on them, their
slightly different sizes, and so on.
I am not criticizing the exact model for not being an
inexact replica, any more than a photograph should be criticized
for not being a perfect replica of the scene it portrays. But to
claim that the photograph is a truer form than is the scene
itself, or to claim that probability theory is more exact than a
physical manipulation which is the very subject of interest -
that is, to claim that the calculation of getting a pair of "2s"
with two given dice is more exact than a million throws of the
same two dice - is hardly supportable.
The probabilist will reply that the calculation does not
refer to a particular pair of dice. But the scientist and the
decision maker are always interested in some particular physical
reality - a given comet, or the price of corn tomorrow - and if
probability theory is to be judged in other than by an esthetic
test, it must be judged on its helpfulness in these particular
situations.
In contrast, resampling - especially physical experiments
with the elements whose that constitute the situation to be
estimated - is inescapably inexact. It is ironic that it is
criticized for that mirroring of reality.
REFERENCES
Ekeland, Ivar, Mathematics and the Unexpected (Chicago: U.
of Chicago Press, 1988)
Feller, William, An Introduction to Probability Theory and
Its Applications (New York: Wiley, 1950)
Goldberg, Samuel, Probability - An Introduction (New York:
Dover Publications, Inc., 1960).
Hacking, Ian, The Emergence of Probability New York: Cam-
bridge U. P., 1975, pp. 166-171
Shannon, Robert, Systems Simulation (Englewood Cliffs:
Prentice-Hall, 1975).