WHY THE FORMAL METHOD IN STATISTICS IS USUALLY THEORETICALLY INFERIOR
        
                                 Julian L. Simon
        
        
             You are standing in the warehouse of a playing-card factory 
        
        that has been hit by a tornado.  Cards are scattered everywhere, 
        
        some not yet wrapped and others ripped out of their packages.  
        
        The factory makes a variety of decks - for poker without a joker, 
        
        poker with a joker, and pinochle; magician's decks; decks made of 
        
        paper and others of plastic; cards of various sizes; and so on.
        
             Two hours from now a friend will join you for a game of 
        
        near-poker with these cards. Each hand will be chosen as randomly 
        
        as possible from the huge heap of cards, and then burned. What 
        
        odds should you attach to getting the combination two-of-a-kind - 
        
        two cards of different or the same suit but of the same number or 
        
        picture - in a five-card draw?
        
             Ask this question of a professional probabilist or 
        
        statistician, and - based on the small sample I have taken - s/he 
        
        is likely to say "I don't have enough information".  There is 
        
        even a name for this sort of question: Problems lacking 
        
        structure.  
        
             Ask the same question of a class of high-school students or 
        
        college freshmen and you will quickly get the suggestion, "Draw 
        
        hands from the card pile the same way you will draw them when you 
        
        play later, and see how often you get two-of-a-kind". 
        
             Who produces the better (more useful) reply - the "naive" 
        
        students, or the learned statistician/probabilist? 
        
             (If the question had been framed as the probability of 
        
        getting [say] the jack of spades in a poker hand drawn from the 
        

        pile, the probabilist probably would think of suggesting a 
        
        sample.  Apparently it is the combination of elements that leads 
        
        the trained person to say that the job cannot be done.)
        
             This case reminds one of the three-door problem, in which 
        
        resampling immediately produces the correct answer whereas 
        
        trained intellects almost uniformly arrive at the wrong answer.
        
             The untutored person's try-it procedure is, in this case, 
        
        not only as good as any procedure can be, but better than any 
        
        formal procedure can be, even in principle.  One reason is that 
        
        the probability of any given hand in the warehouse is affected by 
        
        the physical properties of the cards - their sizes and materials. 
        
        The various cards are not perfectly alike, just as a die cannot 
        
        be perfectly true; even a bit of purposeful shaving of a die's 
        
        edge can affect the odds enough to enable a gambler to cheat 
        
        successfully.)  But an empirical estimation with an actual 
        
        sample-and-deal procedure includes the effect of these physical 
        
        influences, whereas any more abstract approach has great 
        
        difficulty doing so. 
        
             Another issue: You might also want to estimate the chance of 
        
        a three-of-a-kind hand. You quickly recognize that this event 
        
        does not happen very often, and it will take many hands to 
        
        estimate its probability.  So you consider this procedure: take a 
        
        sample of (say) 1000 cards, record their values, transform those 
        
        values to a form that a computer can read, then program the 
        
        computer to choose (with replacement, now) five cards at random 
        
        from the 1000, and examine many trial hands (say 10,000) to see 
        
        whether there are three-of-a-kind.  The computer procedure should 
        
        be as close an analog as possible to physically shuffling and 


        dealing five-card hands from the 1000 sampled cards.  Please 
        
        notice that one need never know how many of each type (that is, 
        
        face value) of card the sample contains.  Rather, as each of the 
        
        cards is examined, its value is transmitted to the computer.  It 
        
        is unnecessary to calculate any sample space or any partition of 
        
        it; one never needs to know that there are 2, 598, 9600 or 
        
        whatever number of possible poker hands. (Goldberg, 1960, p. 305) 
        
             A probabilist might suggest computing the chance of 
        
        three-of-a-kind from the same 1000 pieces of information by using 
        
        probability theory.  Both these procedures will arrive at much 
        
        the same result.  Both fail to take account of physical factors - 
        
        size, and type of material - that might affect physical trials 
        
        with the 1000 sampled cards.  The simulation will be slightly 
        
        less "exact" than the theoretical calculation, the lesser 
        
        exactness being made as small as desired by increasing the number 
        
        of computer trials; the loss of accuracy surely will be very 
        
        small relative to the sampling error deriving from choosing the 
        
        1000 cards from the huge pile - including both the random-
        
        sampling error and the bias due to not drawing the sample 
        
        randomly.  And of course the formal calculation in this case will 
        
        be quite tricky and prone to error.  It must assess the size of 
        
        the sample space of three-of-a-kind hands when the numbers of 
        
        cards of various values will differ, both because the factory 
        
        makes different numbers (no jacks, queens, and kings in some 
        
        decks, for example) as well as because of the inaccuracy due to 
        
        the sample of only 1000 cards.  In contrast, the sample space 
        
        need never be known for physical or computer resampling.
        
        
                   EXPLANATION OF THE ADVANTAGE OF RESAMPLING
        
        Lighter Conceptual Burden
        
             In general, the conceptual burden in resampling is much 
        
        slighter than in probability theory; this is one of resampling 
        
        main advantages.  One does not need to be able to add or even to 
        
        count in order to conduct individual experimental trials.  One 
        
        only needs to know the concept of counting, and also the concept 
        
        of a ratio, so as to (first) keep a record of the numbers of 
        
        successful and unsuccessful trials, and (second) add to get the 
        
        total trials and dividing to get the ratio of successful to 
        
        total. Certainly the discipline that applauds the likes of Peano, 
        
        Russell, and Whitehead for boiling down mathematics to its most 
        
        fundamental elements should have some appreciation for an 
        
        intellectual method that gets along so successfully with so 
        
        little recourse to higher abstractions.
        
             Consider, for example, the case of the probabilities of 
        
        various numbers of points when throwing two dice (refer to 
        
        Goldberg, 1960, p. 158ff).  When specifying the sample space, 
        
        etc., one needs to add the two top faces of the dice to determine 
        
        the range of the function.  With simulation it is not necessary 
        
        to ever determine this range; one simply tosses the two dice and 
        
        inspects the outcomes.  One can ask the probability of getting 
        
        "13" (or any other number) and get an answer experimentally 
        
        without knowing the range in advance.  
        
        
        Reducing the Extent of Abstraction from Actual Experience
        
             Robert Shannon, in a book on Systems Simulation, constructs 
        
        a continuum from "Physical models" to "Scaled models" to "Analogy 
        

        models" to "Computer simulation" to "Mathematical models" (1975, 
        
        p. 8).  (I would add experimentation with the actual material of 
        
        interest as a stage even less abstract than Physical models.)  At 
        
        each successive stage of translation to greater abstraction one 
        
        runs the risk of losing some important aspect of experiential 
        
        reality, and of introducing misleading assumptions and 
        
        simplifications.  This argues for abstracting as little as 
        
        possible, doing so only to the extent that it is necessary.
        
             As Shannon's continuum suggests, simulation methods in 
        
        statistics (with or without a computer) are less abstract than 
        
        are distributional and formulaic methods, and they should be less 
        
        at risk of error.  This speculation jibes with the experimental 
        
        evidence that people can attain more correct answers to numerical 
        
        problems with resampling methods than with formulaic methods, 
        
        when given the equal amounts of instruction (Simon, Atkinson, and 
        
        Shevokas, 1976).  
        
             Of course the optimal level of abstraction depends upon the 
        
        circumstances.  If one wants to estimate the probability of a 
        
        given sum with four dice in order to maximize one's chance of 
        
        winning with those particular dice, experimenting with those very 
        
        dice is likely to be optimum, but if one wants to know the odds 
        
        with four dice in other circumstances, a more abstract approach 
        
        may be better.  However, there are very few circumstances in 
        
        which the formulaic and distributional abstractions are likely to 
        
        be better than Monte Carlo methods (lack of data being one such 
        
        circumstance, and low probability being another).  
        
        
        Operationalizing the Problem
        

             A third virtue of resampling may be stated as:  If you 
        
        understand the posing of the problem operationally, you 
        
        automatically will obtain the correct answer.  For example, 
        
        consider this probability puzzle from Lewis Carroll's Pillow 
        
        Problems (by way of Martin Gardner, correspondence, May, 1993):
        
             A bag contains one counter, known to be either white or 
             black.  A white counter is put in, the bag shaken, and 
             a counter drawn out, which proves to be white.  What is 
             now the chance of drawing a white counter?  
        
             The issue is, do I state the problem correctly in steps 1-4 
        
        below?  If I do, that implies that the repetition of the process 
        
        in those steps will lead to a correct answer to the problem.
        
             1.  Put a white counter (later have the computer call it "7" 
        
        to avoid confusion) or a black counter (call it "8") in the urn 
        
        with probability .5.
        
             2.  Put in a white and shuffle.
        
             3.  Take out a counter.  If black, stop.  
        
             4.  (If result of (3) is white):  Take out the remaining 
        
        counter, examine, and record its color.
        
             5.  Repeat steps 1-4 (say) 1000 times.
        
             6.  Compute how many trials yielded a white first.  
        
             7.  Count the number and compute the proportion of whites 
        
        ("7s") among the "white first" trials.
        
             The benefits of the operationalization of problems that 
        
        occurs with simulation can be seen in a different way in another 
        
        problem of Lewis Carroll's:
        
        Given that there are 2 counters in a bag, as to which all that 
             was originally known was that each was either white or 
             black.  Also given that the experiment has been tried a 
             certain number of times, of drawing a counter, looking 
             at it, and replacing it; that it has been white every 
             time...What would then be the chance of drawing white? 


             (p. 15). 
        
             This problem was an eye-opening experience for me.  First I 
        
        wrote down a set of steps to handle the problem with white and 
        
        black balls ("counters").  But I did not actually execute the 
        
        procedure.  Instead, while I was waiting for an associate to 
        
        write a computer program to solve the problem, following the 
        
        steps I had outlined, I set out to explain the problem logically. 
        
        I wrote five nice pages of what I thought to be clear 
        
        explanation.
        
             A few days later I reread the steps I had written down.  But 
        
        now I found that I could not understand the logic.  This 
        
        experience shows how easy it is to get confused with Bayesian 
        
        problems of this sort if one works analytically rather than with 
        
        simulation.  So I tried harder to create a simulation - and 
        
        harder - and harder.  And then I found that I simply could not 
        
        create a simulation that would model the problem as Carroll wrote 
        
        it ( and as I understood it).  Apparently I was as confused as 
        
        anyone could be.
        
             What to do?  I decided to go back to my very basic 
        
        principle:  There must be a way to physically model every 
        
        meaningful question in probability and statistics.  If one cannot 
        
        find a way to model a simulation for the problem, maybe there is 
        
        something wrong with the problem rather than with my modeling.  
        
        And indeed, when we examine it closely, we may see that Carroll's 
        
        problem is not operational and hence not meaningful.  
        
             The difficulty turns out to lie in Carroll's phrase "given 
        
        that the experiment has been tried a certain number of times, of 
        
        drawing a counter, looking at it, and replacing it; that it has 


        been white every time".  In Carroll's solution he indicates that 
        
        he believes that it is possible to infer a probability for the 
        
        next trial on the basis of a series of trials that are all 
        
        successes.  This is a famous formula in probability theory - that 
        
        the probability is n/(n+1), where n is the number of observed 
        
        successes.  But probability theorists such as Feller have argued 
        
        (correctly, in my view) that this formula is not meaningful.  And 
        
        the fact that it is not possible to model the formula 
        
        meaningfully in this context confirms that theoretical analysis.
        
             So once again the act of attempting to create an operational 
        
        simulation of a problem and then actually executing the procedure 
        
        has kept our feet on solid ground and off the slippery slope into 
        
        confusion or meaninglessness.
        
        
                         LIMITS OF THE RESAMPLING METHOD
        
        Low Probabilities
        
             Can the formal method be better in any respect?  Yes, it 
        
        can.  If you want to estimate the chance of a royal flush in 
        
        poker, which probably would happen only once in hundreds of 
        
        thousands or millions of trial hands, taking samples by sitting 
        
        on the floor of the warehouse for a few hours and dealing hands 
        
        will not produce a sound estimate.  And even computer sampling 
        
        might be much less accurate than analysis without an inordinate 
        
        amount of computer time devoted to the problem. 
        
             But will the formal method surely be better for the royal 
        
        flush?  No.  There is an excellent chance that anyone except a 
        
        very skilled probabilist will use the wrong calculating formula, 
        
        and the erroneous answer might well be worse than no answer at 
        

        all, and worse than computer sampling or perhaps even sampling by 
        
        hand.  This realistic possibility of conceptual analytic error 
        
        cannot be ignored in any practical situation.  It is as much a 
        
        source of possible error as the sampling procedure, physical 
        
        characteristics of the cards, and unsound computer programming if 
        
        a computer is used.  Just as with the calculation of the 
        
        possibility of a disaster at a nuclear reactor, each possible 
        
        source of trouble must be gauged and allowed for in proportion to 
        
        its likely importance.  None can be dismissed as being avoidable 
        
        "in principle" by proper handling.
        
        
        Small Samples
        
             Imagine a sample of the heights of four persons.  You wish 
        
        to estimate a confidence interval for the population mean or 
        
        median.  It is rather obvious that the interval should go beyond 
        
        the range of the four observations, but a resampling procedure 
        
        will never give that result.  Does this mean that resampling is 
        
        inferior here to the conventional method using (say) the t test?
        
             Implicit in the conventional method is an assumption about 
        
        the shape of the distribution.  Making this assumption is in no 
        
        way different in principle from a Bayesian prior.  And the nature 
        
        of the assumption is crucial.  An assumption that would be 
        
        appropriate for heights would not be appropriate for incomes.
        
             Once we have established that it is necessary to bring 
        
        outside information and judgment to bear, we can then consider 
        
        doing so with the resampling method as well as the conventional 
        
        method. We need not enter into technical details here, but there 
        
        are many possible ways to coordinate the observations to any 
        

        shape of distribution in such fashion as to estimate its 
        
        dispersion, and then to draw samples from the distribution to 
        
        estimate confidence limits.  This would not seem inferior to the 
        
        conventional method.  And if one made the assumption of a 
        
        peculiar type of distribution, the advantage would seem to be 
        
        with the resampling method, though this subject needs more 
        
        exploration. 
        
        
                              WHAT ABOUT "USUALLY"?
        
             The title of this article says that the formal method is 
        
        "usually" inferior.  This assertion assumes that most 
        
        applications of probability and statistics deal with situations 
        
        and probabilities that lend themselves well to direct physical 
        
        sampling and/or to the resampling procedure on the computer.  
        
        This very general assertion, of course, might be refuted by 
        
        systematically-gathered evidence.  What is most important, 
        
        however, is not the general assertion but rather choosing the 
        
        method that is right for each particular situation.  
        
             The card-warehouse example lacks realism.  But estimating 
        
        the probability that there will be two faults in a particular 
        
        piece of machine output, where the probability of each fault 
        
        seems to be independent of each other, is not very dissimilar, 
        
        though the probability model is rather different.  And a quite 
        
        analogous realistic set of problems was the basis for Galileo's, 
        
        and then Pascal's and Fermat's, foundational work with dice games 
        
        in formal probability theory that proceeded by assessing the 
        
        sample space and partitions of it.  But experimentally estimating 
        
        the odds as gamblers previously had done had led to sounder 
        

        answers than even such great minds as Gottfried Leibniz had 
        
        arrived at with deductive methods (cited by Hacking, 1975, p. 
        
        52).
        
             Why argue that formal methods are often inferior in 
        
        principle?  One of the objections to resampling in statistics is 
        
        that it is "only" an imperfect substitute for formal methods, and 
        
        that the passage to formal methods represents an advance over 
        
        simulation methods.  For example, when William Kruskal compared 
        
        the early statement of resampling methods in the stark terms of 
        
        the necessary operational procedures, versus developments in the 
        
        literature later on, he dismissed the importance and value of the 
        
        former by saying that the latter embodies "real mathematics" 
        
        (personal correspondence, 1984). 
        
             There is an important analog between the lack of exactness 
        
        in resampling and the movement in modern physics and mathematics, 
        
        since Poincare and Bohr, away from Newtonian deterministic 
        
        analysis of closed systems and toward non-deterministic analysis 
        
        of open systems.  (See Ekeland, 1988, for an illuminating 
        
        discussion of this movement.)  Probability theory is a set of 
        
        exact closed-form replicas of inexact open physical situations, 
        
        of which the card warehouse is an example.  (A sample of 1000 
        
        cards taken from the warehouse, and then converted to equally-
        
        weighted entities converts the open system to a closed system.)  
        
        That is, when calculating the probability of two-of-a-kind in a 
        
        poker hand, the sample space and the partition containing that 
        
        subset are exact numbers even though in any actual situation 
        
        there are incalculable elements such as the different weights of 
        
        the cards due to the different amounts of ink on them, their 


        slightly different sizes, and so on.  
        
             I am not criticizing the exact model for not being an 
        
        inexact replica, any more than a photograph should be criticized 
        
        for not being a perfect replica of the scene it portrays.  But to 
        
        claim that the photograph is a truer form than is the scene 
        
        itself, or to claim that probability theory is more exact than a 
        
        physical manipulation which is the very subject of interest - 
        
        that is, to claim that the calculation of getting a pair of "2s" 
        
        with two given dice is more exact than a million throws of the 
        
        same two dice - is hardly supportable.  
        
             The probabilist will reply that the calculation does not 
        
        refer to a particular pair of dice.  But the scientist and the 
        
        decision maker are always interested in some particular physical 
        
        reality - a given comet, or the price of corn tomorrow - and if 
        
        probability theory is to be judged in other than by an esthetic 
        
        test, it must be judged on its helpfulness in these particular 
        
        situations.
        
             In contrast, resampling - especially physical experiments 
        
        with the elements whose that constitute the situation to be 
        
        estimated - is inescapably inexact.  It is ironic that it is 
        
        criticized for that mirroring of reality.
        

                                   REFERENCES
        
        
             Ekeland, Ivar, Mathematics and the Unexpected (Chicago:  U. 
        
        of Chicago Press, 1988)
        
             Feller, William, An Introduction to Probability Theory and 
        
        Its Applications (New York: Wiley, 1950) 
        
             Goldberg, Samuel, Probability - An Introduction (New York: 
        
        Dover Publications, Inc., 1960). 
        
             Hacking, Ian, The Emergence of Probability New York: Cam-
        
        bridge U. P., 1975, pp. 166-171
        
             Shannon, Robert, Systems Simulation (Englewood Cliffs:  
        
        Prentice-Hall, 1975).
        
        
