Chapter 13. Sampling and Simulation

Contents

  • 13.1 Overview of Sampling and Simulation 311

  • 13.2 Simulate Tossing a Coin 312

  • 13.3 Simulate a Coin-Tossing Game 314

    • 13.3.1 Distribution of Outcomes 314

    • 13.3.2 Compute Statistics for the Simulation 316

    • 13.3.3 Efficiency of the Simulation 319

  • 13.4 Simulate Rolling Dice 321

  • 13.5 Simulate a Game of Craps 322

    • 13.5.1 A First Approach 323

    • 13.5.2 A More Efficient Approach 324

  • 13.6 Random Sampling with Unequal Probability 327

  • 13.7 A Module for Sampling with Replacement 329

  • 13.8 The Birthday Matching Problem 332

    • 13.8.1 A Probability-Based Solution for a Simplified Problem 332

    • 13.8.2 Simulate the Birthday Matching Problem 335

  • 13.9 Case Study: The Birthday Matching Problem for Real Data 338

    • 13.9.1 An Analysis of US Births in 2002 338

    • 13.9.2 The Birthday Problem for People Born in 2002 340

    • 13.9.3 The Matching Birth Day-of-the-Week Problem 340

    • 13.9.4 The 2002 Matching Birthday Problem 344

  • 13.10 Calling C Functions from SAS/IML Studio 346

  • 13.11 References 347

13.1 Overview of Sampling and Simulation

This chapter describes how to use the techniques in this book to generate random samples from a specified distribution. In particular, the chapter describes how to use SAS/IML software to generate random samples from a set of outcomes with known probabilities.

Previous chapters used the RANDGEN subroutine to sample from continuous distributions such as the normal or uniform distribution. This chapter samples from discrete distributions. In particular, assume that some event can have k outcomes: outcome X1 occurs with probability p1, outcome X2occurs with probability p2, and so on, where Σki=1 Pi = 1.

For example, the outcomes of tossing a coin can be "Heads" or "Tails," with associated probabilities p1 = p2 = 0.5. This is known as sampling with equal probability. Alternatively, you could have a jar that contains marbles of various colors. The probability of pulling a particular color is equal to the proportion of marbles of that color. This is known as sampling with unequal probabilities.

A random sample is a sample of n elements that is chosen randomly from a population according to some scheme. If a given element can be selected more than once, the scheme is known as random sampling with replacement; otherwise, the scheme is known as random sampling without replacement. The term "with replacement" means that drawing a particular element does not change the probability of choosing that element for the next draw. A classic example is pulling colored marbles from a jar. When sampling with replacement, each marble that you pull from the jar is examined, the color is recorded, the marble is replaced in the jar, and the jar is well shaken prior to pulling another marble. The examples in this chapter all use sampling with replacement. A widely used text on survey sampling is Lohr (2009), which includes a section on simple random sampling and an appendix on probability concepts that are used in sampling.

A simulation is a model of a real-world outcome that involves randomness. Some simulations presented in this chapter include the following:

  • simulating a game that involves tossing a coin

  • simulating the game of craps, which involves rolling a pair of dice

  • simulating selecting items when some items have a higher probability of being selected than other items

  • simulating the probability of two people in a room having the same birthday

13.2 Simulate Tossing a Coin

The simplest game of chance is the coin toss. For a fair coin, the probability of a flipped coin landing on a given side ("heads") is 0.5. You can simulate a coin toss by generating a pseudorandom number from the uniform distribution on the open interval (0, 1). You can classify the number as "heads" if it is in the range [0.5,1) and as "tails" otherwise. Alternatively, you can use the CEIL or FLOOR function to produce integers that represent heads and tails, as shown in the following statements:

/* simulate a coin toss */
call randseed(1234);                /* set random number seed     */
N = 10;
c = j(N, 1);                        /* allocate Nx1 vector        */
call randgen(c, "uniform");         /* fill with values in (0,1)  */
c = floor(2*c);                     /* convert to integers 0 or 1 */
print c;
Simulated Coin Flips

Figure 13.1. Simulated Coin Flips

Notice that no loop is used to generate the pseudorandom numbers: a vector is obtained by setting the size of the vector c and then calling the RANDGEN subroutine a single time. The resuts of the simulation are shown in Figure 13.1. Heads are represented by ones and tails by zeros.

It is often useful to encapsulate a simulated event into a module, as shown in the following statements:

/* Module to simulate the flip of a fair coin.
 * Call the RANDSEED function before calling this module.
 * The module returns a vector of ones (heads) and zeros (tails).
 */
 start TossCoin(N);
    c = j(N, 1);                       /* allocate Nx1 vector        */
    call randgen(c, "uniform");        /* fill with values in (0,1)  */
    c = floor(2*c);                    /* convert to integers 0 or 1 */
    return ( c );
 finish;

The module can be used to simulate the event while hiding the specifics of the simulation. For example, the following statements simulate tossing a coin 100 times and count the total number of heads and tails:

/* simulate many coin tosses */
call randseed(1234);                  /* set random number seed     */
N = 100;
g = TossCoin(N);                      /* flip a coin N times        */
NumHeads = sum(g);                    /* count the number of heads  */
NumTails = N - NumHeads;              /* the other tosses are tails */
PropHeads = g[:];                     /* proportion of heads        */
print NumHeads NumTails PropHeads[format=percent8.3];
Simulated Coin Tosses

Figure 13.2. Simulated Coin Tosses

As expected, roughly half of the simulated coin tosses result in heads.

13.3 Simulate a Coin-Tossing Game

In a game of chance, you can sometimes use probability theory to compute the expected gain (or loss) for playing the game. Sometimes the strategy you use and the choices you make while playing affect the game's outcome; other times the outcome is completely determined by the rules of the game. Even if you cannot calculate the probabilities exactly, you can estimate the expected gain by simulating the game many times and averaging the gains and losses of the simulated games.

This section analyzes a coin game with the following rules:

  1. Toss a fair coin.

  2. If the coin comes up heads, toss again.

  3. If the coin comes up tails, the game is over.

In a multiplayer version of this game, the player with the longest string of tosses wins. The next section simulates the game. Subsequent sections estimate the expected gain for playing the game when there are certain costs and rewards that are associated with each outcome.

13.3.1 Distribution of Outcomes

The simplest way to analyze this game is to determine the probability for each outcome, as shown in Table 13.1. In general, the probability of the game ending after k tosses is 2−k, k = 1, 2,....

Table 13.1. Outcomes and Associated Probabilities for a Coin Game

Outcome

Tosses

Probability

T

1

1/2

HT

2

1/4

HHT

3

1/8

HHHT

4

1/16

Outcomes and Associated Probabilities for a Coin Game

Outcomes and Associated Probabilities for a Coin Game

Outcomes and Associated Probabilities for a Coin Game

H...HT

k

1/2k

A simulation of this game is straighforward and uses the TossCoin module developed in the previous section.

/* Simulate a game: flip a coin until tails appears.
   Keep count of the number of tosses.
   First approach: not optimally efficient. */
call randseed(1234);                /* set random number seed       */
NumGames = 1e5;                     /* repeat the game many times   */
Results = j(NumGames, 1, .);        /* number of tosses per game    */
do i = 1 to NumGames;
   count = 1;                       /* number of tosses so far      */
   toss = TossCoin(1);              /* toss the coin                */
   do while(toss=1);                /* continue until tails appears */
      count = count + 1;
      toss = TossCoin(1);
   end;
   Results[i] = count;              /* record the results           */
end;

The program allocates the Results vector prior to the main loop. This vector holds the results of the simulation. The i th element of the vector contains the number of tosses for the i th game. If the simulation is accurate, then the distribution of simulated tosses should approximate the distribution shown in Table 13.1.

The simulation uses classic language features and functions of SAS/IML software. One way to visualize the distribution of outcomes is to use the following IMLPlus statements to create a bar chart:

/* visualize the distribution of simulated outcomes */
declare BarChart bar;
bar = BarChart.Create("coin", Results);
bar.SetOtherThreshold(0);
bar.ShowPercentage();
bar.ShowBarLabels();
Outcomes of Simulated Coin Flips

Figure 13.3. Outcomes of Simulated Coin Flips

Figure 13.3 shows that the distribution of simulated outcomes for the coin game is very close to the theoretical distribution shown in Table 13.1.

13.3.2 Compute Statistics for the Simulation

The previous section simulated many possible outcomes for the coin-tossing game. If the simulation is correct, a statistic computed from the simulated outcomes is an estimate for a corresponding probability.

13.3.2.1 What is the expected number of coin tosses in a game?

If you play the coin game, what is the expected number of coin tosses in a game?

For this problem it is possible to use elementary probability theory to compute the expected value. As stated earlier, the probability of a game that consists of exactly k flips is 2k. Therefore, the expected value for the number of flips is Σk = 1k2k. You can use techniques from calculus to show that this series converges to the value 2.

For readers who prefer computations more than calculus, the following statements estimate the expected number of coin tosses per game by numerically summing a finite number of terms:

/* compute expected number of tosses per game: theoretical result   */
flips =1:50;                /* approximate series by first 50 terms */
prob = 2##(-flips);         /* probability for each game            */
e = sum(flips#prob);        /* expected number of coin flips        */
print e;
Numerical Summation of Probabilities

Figure 13.4. Numerical Summation of Probabilities

For this game, it is not hard to show that the expected number of coin tosses for a game is 2. However, in more complicated scenarios, it might not be possible to compute the probability exactly. If that is the case, you can estimate the expected value by computing the mean of the (simulated) number of tosses for a large number of games. The following statements compute the average number of tosses per game in the simulated data:

/* calculate average number of tosses per game in simulated data */
AvgFlipsPerGame = Results[:];             /* mean over all games */
print AvgFlipsPerGame;
Average Number of Flips per Game

Figure 13.5. Average Number of Flips per Game

The results, shown in Figure 13.5, indicate that the average number of coin tosses in the simulated data is about 2, which is in close agreement with the theoretical value.

Notice that the average number of coin tosses in the simulated data depends on the random number seed and on the number of simulated games. If you rerun the simulation with different seeds (but still use 100,000 games for each simulation), the mean number of coin tosses will vary. Perhaps one simulation produces 2.006 for the mean, another produces 1.992, and yet another produces 1.998. You can imagine that if you rerun the simulation many times, you might produce a set of means that are distributed as shown in Figure 13.6. Chapter 14 shows how you can use this distribution of statistics to obtain a standard error for the mean statistic and confidence intervals for the mean.

Mean Number of Coin Tosses for 100 Simulations

Figure 13.6. Mean Number of Coin Tosses for 100 Simulations

Figure 13.6 shows that a simulation that uses 100,000 games results in a mean estimate that does not vary very much. Of course, if you use a smaller number of games for each simulation, you would expect greater variation in the statistic.

13.3.2.2 What is the expected gain (loss) for playing the game?

For many games of chance, it is interesting to calculate the expected gain (or loss) for the game, given a particular cost to play and a reward for certain outcomes.

Suppose a huckster approaches you on a street corner, describes the coin game, and offers you the following proposition:

  • It costs one dollar to play the game.

  • You keep tossing until the coin comes up on the tails side.

    • – If the coin shows tails on the first or second toss, you lose.

    • – If your toss shows heads two or more times before it shows tails, you win a dollar for each head.

Should you accept the bet? If the game outcome is T or HT, you lose a dollar. If the outcome is HHT, you get two dollars—but only one dollar of that is a gain, since it costs a dollar to play. The outcome HHHT results in a net gain of two dollars, and so on, as shown in Table 13.2.

Table 13.2. Gain or Loss for Each Outcome

Outcome

Probability

Gain

T

1/2

−1

HT

1/4

−1

HHT

1/8

1

HHHT

1/16

2

Gain or Loss for Each Outcome

Gain or Loss for Each Outcome

Gain or Loss for Each Outcome

You can compute the expected gain (loss) by using the simulated data. It is convenient to examine the number of heads in each game, which is simply one less than the number of tosses. You can use the CHOOSE function (see Section 3.3.3) to compute the gain of each game in the simulation. After you have computed the gain for each simulated outcome, you can compute the mean of these quantities, which is an estimate for the expected gain. The following statements perform these computations:

/* compute gain = payout for game - cost to play */
NumHeads = Results - 1;               /* each game ends with tails   */
Gain = choose(NumHeads<2, 0, NumHeads)-1;   /* proposed gain/loss    */
AvgGain = Gain[:];                    /* average gain for many games */
print AvgGain;
Evaluation of a Betting Scheme on Simulated Games

Figure 13.7. Evaluation of a Betting Scheme on Simulated Games

The CHOOSE function enables you to return a vector of results based on the value of each element of the NumHeads vector. For each value less than 2, the CHOOSE function returns zero. For each value greater than or equal to 2, it returns the corresponding element of the NumHeads vector. Consequently, the CHOOSE function returns the amount of money that the huckster pays you for each simulated game. Since it costs a dollar to play the game, that amount is subtracted from the payout. Therefore, the vector Gain contains the gain or loss for each simulated game. The average gain is shown in Figure 13.7. For this betting scheme, the expected loss is twenty-five cents per game. You would be foolish to accept the bet!

13.3.3 Efficiency of the Simulation

As a general rule, it is better to call a function one time to compute a vector of values, as compared with calling the same function many times. This rule also holds for generating random numbers.

An alternative to the algorithm in the previous section is to simulate many coin tosses, store those results in a vector, and then analyze the tosses to determine the outcomes of the games. This is the approach used in this section. With this approach, you call the TossCoin module with an argument that is large enough to complete a desired number of games. (Knowing how many coin tosses are needed might be difficult.) The following statements use this approach:

/* Simulate a game: flip a coin until tails appears.
   Keep count of the number of tosses.
   Second approach: more efficient. */
call randseed(1234);              /* set random number seed         */
NumGames = 1e5;
N = int(2.5 * NumGames);          /* 1. Determine number of rolls   */
toss = TossCoin(N);               /* 2. generate all random numbers */
Results = j(NumGames, 1, .);      /* 3. Allocate results vector     */
tossIdx = 1;
do i = 1 to NumGames;
   count = 1;                     /* number of tosses so far        */
   do while(toss[tossIdx]=1);     /* continue until tails           */
      count = count + 1;
      tossIdx = tossIdx + 1;      /* 4. Move to next toss           */
   end;
   Results[i] = count;            /* 5. Record the results          */
   tossIdx = tossIdx + 1;
end;

The algorithm in this section has a few differences from the previous algorithm:

  1. The purpose of the program is to simulate a certain number of games, but how can you determine how many tosses you will need for the simulation? The previous section showed that the average number of tosses per games is 2. To be on the safe side, increase this number to 2.5 (or more if you are ultra-conservative or intend to simulate a small number of games) and estimate that 2.5*NumGames tosses are sufficient to simulate the requested number of games.

  2. Generate all random numbers with a single call. The toss vector is a long vector of zeros and ones.

  3. Allocate a vector to hold the results (the number of heads for each game). The allocation is outside of the main loop.

  4. Use the integer tossIdx to keep track of the current toss. The integer is an index into the toss vector that contains the simulated coin tosses. Every time the program needs a new toss, it increments tossIdx.

  5. Every time that a game is determined, the number of heads for that game is recorded. At the end of the simulation, the Results vector contains the number of tosses for each game.

The algorithm is only slightly more complicated than the algorithm in the previous section, but the increased efficiency results in the revised program being about 40% faster than the previous approach. This is quite an improvement for a slight reorganization of the statements.

13.4 Simulate Rolling Dice

You can use the ideas in the previous section to simulate rolling a pair of six-sided dice. The main idea is simple: geneate random uniform variates and transform those values into integers in the range one through six. The following statements present a module that uses the RANDGEN subroutine to simulate rolling a pair of dice a specified number of times:

/* Module to roll two six-sided dice N times.
 * Call the RANDSEED function before calling this module.
 * The return value is the sum of the values of the roll.
 * This module calls the uniform distribution.
 */
start RollDice(N);
    d = j(N, 2);                    /* allocate Nx2 vector (2 dice) */
    call randgen(d, "uniform");     /* fill with values in (0,1)    */
    d = ceil(6*d);                  /* convert to integers in 1-6   */
    return ( d[,+] );               /* values of the N rolls        */
finish;

The RollDice module takes a single argument: the number of dice rolls to simulate. The RANDGEN subroutine creates an N × 2 matrix of values in the range (0,1), so d consists of pairs of integers with the values 1-6. The module returns the sum of the values showed on each roll.

Notice that the vector returned by the RollDice module has N elements. The following statements simulate the results of rolling dice several times:

/* simulate rolling a pair of six-sided dice */
call randseed(54321);
rolls = RollDice(10);
print rolls;
Simulated Rolls for a Pair of Six-Sided Dice

Figure 13.8. Simulated Rolls for a Pair of Six-Sided Dice

Actually, there is an even more efficient way to simulate these dice rolls. The RANDGEN subroutine supports a "Table" distribution that enables you to specify a table of probabilities for each of k outcomes. In this case, k = 6 and the probability for each outcome is 1/6. The following statements use the RANDGEN function to sample from six outcomes with equal probability:

/* Module to roll two six-sided dice N times.
 * Call the RANDSEED function before calling this module.
 * The return value is the sum of the values of the roll.
 * This module calls the Table distribution with equal probability.
 */
start RollDice(N);
    d = j(N, 2);                    /* allocate Nx2 vector          */
    prob = j(6, 1, 1)/6;            /* equal prob. for six outcomes */
    call randgen(d, "Table", prob); /* fill with integers in 1-6    */
    return ( d[,+] );               /* values of the N rolls        */
finish;

This second definition of the RollDice module is about 40% faster than the first definition.

13.5 Simulate a Game of Craps

If you can simulate the rolling of dice, you can simulate games that use dice. A well-known dice game is the gambling game of craps. This section presents a program that simulates the game of craps and analyzes the results.

The rules of craps are straighforward: a player (called the "shooter") rolls two six-sided dice and adds their values.

  • If the first roll sums to 7 or 11, the shooter wins. (This is called a "natural.")

  • If the first roll sums to 2, 3, or 12, the shooter loses. (This is called "craps.")

  • If the first roll sums to anything else (4, 5, 6, 8, 9, or 10), that sum becomes the player's "point." The shooter continues rolling the dice until he does either of the following:

    • – "makes his point" (that is, rolls the point again) and wins, or

    • – rolls a 7 and loses.

As noted in the previous section, there are two approaches to simulate the game. The first is to loop over a large number of games, rolling (simulated) dice for each game until the game is decided. This mimics the way that the game is physically played. The second approach is to simulate a large number of dice rolls, and then loop over the rolls to determine how many games were decided and which ones were won or lost. The first approach is easier to program and explain, but is less efficient because it requires a large number of calls to a random number routine. The second approach is more efficient because it requires a single call to a random number routine, but it is slightly harder to program and requires more memory to store the results.

13.5.1 A First Approach

The following algorithm is one that many programmers would use to simulate playing many games of craps. It simulates each game by calling the RollDice module with an argument of 1.

/* simulate the game of craps: first approach
   Not optimally efficient: many calls to the random number routine  */
call randseed(54321);
NumGames = 1e5;
Results = j(NumGames, 1, 0);         /* win/loss status of each game */
do i = 1 to NumGames;
   roll = RollDice(1);
   if roll=7 | roll=11 then do;
      Results[i] = 1;                /* won by a "natural"           */
   end;
   else if roll=2 | roll=3 | roll=12 then do;
      Results[i] = 0;                /* lost (rolled "craps")        */
   end;
   else do;
      game = .;                      /* game is not yet determined   */
      thePoint = roll;               /* establish the point          */
      do while (game=.);             /* keep rolling                 */
         roll = RollDice(1);
         if roll=7 then
            game = 0;                /* roll seven? Lost!            */
         else if roll=thePoint then
            game = 1;                /* make the point? Won!         */
      end;
      Results[i] = game;             /* record results (won/lost)    */
   end;
end;

The algorithm is straightforward. The program allocates the (Results) vector prior to the main loop. This vector holds the results of the simulation. For each game, the appropriate entry for the Results vector is set to zero or one according to whether the game was lost or won. The following statements use the values in this vector to show simple statistics about the game of craps. The results are shown in Figure 13.9.

won = sum(Results);                  /* count the number of 1's */
lost = NumGames-won;
pctWon = Results[:];
print won lost pctWon[format=PERCENT8.3];
Simulated Wins and Losses

Figure 13.9. Simulated Wins and Losses

Notice that the average number of games won is slightly less than 50%. Assuming that the simulation accurately reflects the true probabilities in the game, this indicates that the shooter's chance of winning is roughly 1.4% less than the chance of losing. You can use probability theory to show that the exact odds are 244/495 ≈ 0.4929 (Grinstead and Snell 1997). The estimate based on the simulated data is very close to this value.

You could also allocate and fill other vectors that record quantities such as the number of rolls for each game, the value of the "point," and whether each game was determined by a "natural," "craps," or by rolling for a "point." These extensions are examined in the next section.

13.5.2 A More Efficient Approach

An alternative to the algorithm in the previous section is to simulate many rolls of the dice, store those results, and then analyze the rolls to determine which games were won or lost. With this approach, you call the RollDice module with an argument that is large enough to complete the desired number of games. (Again, the potential problem is that it might not be obvious how many rolls are sufficient.) The following statements use this approach:

/* simulate the game of craps: second approach
   More efficient: one call to the random number routine*/
call randseed(54321);
NumGames = 1e5;
N = 4*NumGames;                   /* 1. Determine number of rolls  */
rolls = RollDice(N);
                                  /* 2. Allocate results vectors   */
Results = j(NumGames, 1, 0);      /* win/loss status of each game  */
rollsInGame = j(NumGames, 1, 1);  /* how many rolls in each game?  */
Points      = j(NumGames, 1, .);  /* "point" for each game?        */
howEnded = j(NumGames, 1, "       "); /* "natural", "craps", ...   */
rollIdx = 1;
do i = 1 to NumGames;
   count = 1;
   roll = rolls[rollIdx];         /* 3. Keep track of current roll */
   if roll=7 | roll=11 then do;
      game = 1;                   /* won by a "natural"            */
      howEnded[i] = "natural";
   end;
   else if roll=2 | roll=3 | roll=12 then do;
      game = 0;                   /* lost (rolled "craps")         */
      howEnded[i] = "craps";
   end;
   else do;
      game = .;                   /* game is not yet determined    */
      thePoint = roll;            /* establish the point           */
      do while (game = .);        /* keep rolling                  */
         rollIdx = rollIdx + 1;   /* examine next roll             */
         roll = rolls[rollIdx];
         if roll = 7 then
            game = 0;             /* roll seven? Lost!             */
         else if roll = thePoint then
            game = 1;             /* make the point? Won!          */
         count = count + 1;
      end;
      Points[i] = thePoint;       /* 4. Record results (won/lost)  */
      howEnded[i] = choose(game,"point-Won","point-Lost");
   end;
   Results[i] = game;
   rollsInGame[i] = count;
   rollIdx = rollIdx + 1;
end;

The revised program is about 40% faster than the previous approach. The algorithm in this section has a few differences from the previous algorithm:

  1. How can you determine how many rolls you will need for the simulation? You can either use probability theory or you can simulate a small number of games using the algorithm in the previous section and estimate the average number of rolls per game. In either case, you can determine that the expected number of rolls per game is about 3.37. To be on the safe side, round this up to 4 and estimate that 4*NumGames rolls is sufficient to simulate the requested number of games.

  2. Allocate vectors to hold the results. This is done outside of the main loop. In addition to keeping track of the wins and losses, this program also counts how many rolls were required to determine the game (in the vector rollsInGame), what the "point" (if any) was for each game (in Points), and whether the game was determined by rolling a natural, craps, winning a point, or losing a point (in howEnded).

  3. Use the integer rollIdx to keep track of the current roll.

  4. Every time that a game is determined, the various results for that game are recorded in the relevant vectors.

You can estimate the probability of winning the game by calculating the proportion of games won in the simulation. The expected number of rolls is estimated by the mean number of rolls per game in the simulation. These estimates are computed by the following statements and are shown in Figure 13.10.

PctGamesWon = Results[:];              /* percentage of games won */
uHow = unique(howEnded);               /* "natural", "craps", ... */
pct = j(1, ncol(uHow));                /* percentage for howEnded */
do i = 1 to ncol(uHow);
   pct[i] = sum(howEnded = uHow[i]) / NumGames;
end;
AvgRollsPerGame = rollsInGame[:];      /* average number of rolls */
print PctGamesWon, pct[colname=uHow], AvgRollsPerGame;
Simulation-Based Estimates

Figure 13.10. Simulation-Based Estimates

Similarly, for each value of the point, you can estimate the conditional expectation that a shooter will make the point prior to throwing a seven and losing the game. The following statements estimate the conditional expectations:

/* examine conditional expectations: Given that the point
   is (4, 5, 6, 8, 9, 10), what is the expectation to win? */
pts = {4 5 6 8 9 10};                  /* given the point...        */
PctWonForPt = j(1, ncol(pts));         /* allocate results vector   */
do i = 1 to ncol(pts);
   idx = loc(Points=pts[i]);           /* find the associated games */
   wl = Results[idx];                  /* results of those games    */
   PctWonForPt[i] = wl[:];             /* percentage of games won   */
end;
print PctWonForPt[colname=(char(pts,2)) format=6.4;
            label="Pct of games won, given point"];

The previous statements use the LOC function to find all games for which a given point was established, and then compute the percentage of those games that were won. The results are shown in Figure 13.11.

Estimates for Conditional Probabilitites

Figure 13.11. Estimates for Conditional Probabilitites

For comparison, the exact odds of making each point is given in Table 13.3.

Table 13.3. Probabilities of Making a Given Point

Point

4 or 10

5 or 9

6 or 8

Exact Probability

3/9

4/10

5/11

Approximation

0.333

0.4

0.455

13.6 Random Sampling with Unequal Probability

Previous sections have simulated events in which each event had the same probability of occurrence. For tossing a coin, the probability of either side appearing is 1/2; for rolling a single six-sided die, the probability of any face appearing is 1/6. These are examples of random sampling with equal probability.

However, sometimes the probability that is associated with one event is greater than the probability of another event. For example, suppose that there are ten socks in a drawer: five are black, two are brown, and three are white. If you draw a sock at random, the probability of the sock being black is 0.5, the probability of brown is 0.2, and the probability of white is 0.3. This is an example of random sampling from the set {Black, Brown, White} with unequal probability.

If you are not a sock lover, you can mentally substitute other scenarios. For example, the ideas in this section also apply to political affiliation (for example, Democrats, Republicans, and Independents) and marital status (for example, single, married, separated/divorced).

The following steps simulate the process of drawing socks (with replacement) from a drawer:

  1. Choose a sock at random. The probability of choosing a particular color is proportional to the number of socks with that color.

  2. Record the color of the sock.

  3. Replace the sock in the drawer.

  4. Repeat the process.

How can you sample from the set {Black, Brown, White} with unequal probability? As shown in Section 13.4, you can use the RANDGEN subroutine to generate outcomes with specified probabilities. In this case, there are three outcomes with probabilities p = {0.5,0.2,0.3}. The RANDGEN subroutine generates integers 1, 2, or 3 according to those probabilities. This is shown in the following statements:

/* simulate drawing a sock from a drawer */
call randseed(123);
outcome = {"Black" "Brown" "White"};
p = {0.5 0.2 0.3};                  /* probability of events        */
N = 1e5;                            /* number of simulated draws    */

results = j(N, 1);
call randgen(results, "table", p);  /* fill with values 1..ncol(p)) */

Notice that the results vector is numeric. When the program ends, the results vector contains indices into the outcome vector. That is, results[i] is an integer (1, 2, or 3) that indicates the color of the i th sock.

Figure 13.12 shows the results for one run of the algorithm. The frequencies of the simulated draws are very close to the theoretical probabilities. The figure is created by the following IMLPlus statements:

declare BarChart bar;
bar = BarChart.Create("Socks", outcome[results]);
bar.ShowPercentage();
bar.ShowBarLabels();
Outcomes of Drawing from a Set with Unequal Probability

Figure 13.12. Outcomes of Drawing from a Set with Unequal Probability

You can analyze statistics of the simulated results to estimate probabilities such as the probability that three draws in a row result in the same color, or the expected number of draws before you obtain a matching pair in some color. These are left as exercises for the reader.

13.7 A Module for Sampling with Replacement

In each of the previous sections, you used SAS/IML statements to choose from some set of events with some probability. For simulating a coin toss, you sample from the set {Heads, Tails} with equal probability 1/2. For simulating a roll of two dice, you sample twice (independently) from the set {1, 2, 3, 4, 5, 6} with equal probability 1/6. For simulating a choice of socks, you sample from the set {Black, Brown, White} with associated probabilities {0.5,0.2,0.3}.

Each of these examples is an instance of the general problem of sampling with replacement from a set of events. It is both useful and instructive to construct a SAS/IML module that samples with replacement from a given set. The module will be a function that returns a matrix that contains random samples from a given set of events. A convenient syntax for the module might be

sample = SampleWithReplace(A, numSamples, prob);

where the arguments are as follows:

A

is the set of events from which to sample.

NumSamples

specifies the number of times to sample.

prob

specifies the probabilities for sampling. Because sampling with equal probability is an important subcase, it is useful to adopt the convention that if this argument is a missing value, then the module should assume that each element of A has equal probability of being selected.

For example, to simulate tossing a coin 1000 times, you could call the module as follows:

EQUAL = .;       /* convention: missing implies uniform probability */
tosses = SampleWithReplace({"H" "T"}, 1000, EQUAL);

In this call, EQUAL is explicitly set to be a missing value. The module must check whether the third argument is a missing value. If so, it should sample from the events with equal probability.

The following statements define the module Sample WithReplacement:

/* Random sampling with replacement. The arguments are:
 * A           events (sample space)
 *             numSamples  number of times to sample from A.
 *             numSamples[1] specifies the number of rows returned.
 *             If numSamples is a vector, then numSamples[2]
 *             specifies the number of repeated draws from A
 *             that are contained in each sample.
 * prob        specifies the probabilities that are associated with
 *             each element of A. If prob=. (missing), then equal
 *             probabilities are used.
 * The module returns a matrix that contains elements of A. The matrix
 * has numSamples[1] rows. It has either 1 or numSamples[2] columns.
 */
start SampleWithReplace(A, numSamples, prob);
   ;x = rowvec(A);                                       /* 1 */
   /k = ncol(x);

   if prob = . then
      p=j(1,k,1)/k;                                      /*2*/
   else do;
      p = rowvec(prob);
      if ncol(p) ^= k then do;                           /* 3 */
         msg = "ERROR: The probability vector must have the same
                number of elements as the set being sampled.";
         /* Runtime.Error(msg); */         /* use in SAS/IML Studio   */
         reset log; print msg; reset nolog;/* use in PROC IML         */
         stop;
      end;
   ;end;

   /* overload the numSamples argument:
      if a vector, then sumSamples[2] is a repetition factor */
   if nrow(numSamples)=1 & ncol(numSamples)=1 then do;
      nSamples = numSamples;                             /* 4 */
      nRep = 1;
   end;
   else do;
      nSamples = numSamples[1];
      nRep = numSamples[2];
   end;

   results = j(nSamples, nRep);                          /* 5 */
   call randgen(results, "Table", p);

   return (shape(A[results], nSamples));                 /* 6 */
finish;

This module includes basic error checking and logic to handle both equal and unequal probabilities. The main steps of the module are as follows:

  1. The ROWVEC module (in IMLMLIB) transforms the first argument (A, the matrix of events) to a row vector. This enables the module to uniformly handle the events without knowing whether A was passed in as a row vector or as a column vector. For example, the next statement defines k to be the number of events, and this statement is independent of the shape of A.

  2. The prob argument is examined. If the prob argument is missing, the module defines p to be a row vector of equal probabilities. Otherwise, p is defined by the probabilities passed into the module.

  3. Is the prob argument properly specified? It should have the same number of elements as there are events. If not, you can print a message to the SAS log and use the STOP statement to stop execution of the module. (If you are running the program in SAS/IML Studio, you can use the Runtime.Error method to accomplish the same thing.) You could also add additional error checking, such as using the CUSUM function to check whether the probabilities sum to unity, but that is not done in this module.

  4. The astute reader might wonder how the module will simulate rolling two (or more) dice. After all, there is only a single argument (numSamples) that specifies the number of times to roll the dice, but no argument to specify the number of dice. However, the numSamples argument is not constrained to be a scalar! You can overload the argument: if numSamples is vector, then use the second element to specify how many times to repeatedly toss additional coins, roll additional dice, or draw additional socks. Within the module, the local variables nSamples and nRep contain this information.

  5. The samples are generated as described in Section 13.6. A results matrix is created and filled with integers in the range [1, k], according to the specified probabilities.

  6. The results matrix contains integers in the range [1, k]. Therefore, the matrix A[results] is a column vector that contains the corresponding elements of A. The SHAPE function reshapes that matrix so that it has nSamples rows and nRep columns. This reshaped matrix is returned to the caller of the module.

This one module can do all of the work of the programs in the previous sections of this chapter. For example, the following statements call the SampleWithReplace module to simulate a few coin tosses, dice rolls, and sock draws:

call randseed(321);
N = 5;
EQUAL = .;
coinTosses = SampleWithReplace({"H" "T"}, N, EQUAL);
diceRolls = SampleWithReplace(1:6, N||2, EQUAL);       /* two dice */
p = {0.5 0.2 0.3};
sockDraws = SampleWithReplace({"Black" "Brown" "White"}, N, p);
print coinTosses, diceRolls, sockDraws;
Results of Calling the SampleWithReplace Module

Figure 13.13. Results of Calling the SampleWithReplace Module

By encapsulating the complexities of sampling into a module, programs become simpler to support, modify, and maintain. The following statement stores the module for later use:

store module=SampleWithReplace;

The SampleWithReplace module will be used in the rest of the chapter and in Chapter 14, "Bootstrap Methods." (If you intend to use the module from PROC IML, you might want to use the RESET STORAGE statement to store the module to a permanent library, as described in Section 3.4.6.)

13.8 The Birthday Matching Problem

The birthday matching problem is a well-known problem in elementary probability. A statement of the problem is, "If there are N > 2 people in a room, what is the probability that two or more share the same birthday?" This problem has been studied and presented in many places. The ideas in this section are inspired by the excellent presentation by Trumbo, Suess, and Schupp (2005).

One of the reasons this problem is often discussed is because many people find the probabilities in the problem to be surprising. For example, if N = 25, the chance of at least two people sharing a birthday is about 57%!

13.8.1 A Probability-Based Solution for a Simplified Problem

The birthday matching problem is usually presented under the following assumptions:

  • The people in the room are randomly chosen

  • A year consists of 365 days; no one in the room was born on February 29

  • Birthdays are uniformly distributed throughout the year

This last assumption is called the "uniformity assumption." Under these three assumptions, you can use elementary probability theory to compute the probability of matching birthdays. (Actually, it is simpler to compute the complement: the probability that no one in the room has a matching birthday.) Number the people in the room 1,..., N. The probability that the second person does not share a birthday with the first person is 364/365. The probability that the third person's birthday does not coincide with either of the first two people's birthdays is 363/365. In general, the probability that the i th person's birthday does not coincide with any of the birthdays of the first i 1 people is (365 − i + 1)/365 = 1 − (i − 1)/365, i = 1... 365.

Let QN be the probability that no one in the room has a matching birthday. By the assumption that the people in the room are chosen randomly from the population, the birthdays are independent of each other, and QN is just the product of the probabilities for each person:

13.8.1 A Probability-Based Solution for a Simplified Problem

If RN is the probability that two or more people share a birthday, then RN = 1 − QN.

You can use the index operator (:) to create a vector whose i th element is the probability that there is not a common birthday among the first i individuals. It is then easy to use the subscript reduction product operator (#) to form the product of the probabilities. For example, if there are N = 25 people in a room, then the following statements compute R25, the probability that at least two people share a birthday:

/* compute probability that at least two people share a birthday
   in a room that contains N people */
N = 25;                                /* number of people in room  */
i = 1:N;
iQ = 1 -(i-1)/365;                     /* individual probabilities  */
Q25 = iQ[#];                           /* product: prob of no match */
R25 = 1 -Q25;                          /* probability of match      */
print R25
Probability of a Shared Birthday (N = 25)

Figure 13.14. Probability of a Shared Birthday (N = 25)

For 25 people in a room, the computation shows that there is about a 57% chance that two or more share a birthday.

You can compute the quantity RN for a variety of values for N. Notice that the value of Qi is simply (1 − (i − 1)/365) times the value of Qi − 1. Therefore, you can compute the value of Qi iteratively, as shown in the following statements:

/* compute vector of probabilities: R_N for N=1,2,3... */
/maxN = 50;                                   /* maximum value of N */
/Q = j(maxN, 1, 1);                           /* note that Q[1] = 1 */
/do i = 2 to maxN;
    Q[i] = Q[i-1] * (1 -(i-1)/365);
end;
ProbR = 1 -Q;

The vector Q contains the value of QN and the vector ProbR contains the value of RN for N = 1, 2,..., 50. You can query the ProbR vector to find a specific probability or to determine how many people need to be in the room before the probability of a shared birthday exceeds some value. For example, the following statements show that if there are at least 32 people in a room, then the chance of a matching birthday exceeds 75%:

/* Compute how many people need to be in the room before the
   probability of two people having the same birthday exceeds 75% */
N75 = loc(ProbR > 0.75)[1];
print N75[label="N such that probability of match exceeds 75%"];
75% Chance of a Match When 32 People Are in a Room

Figure 13.15. 75% Chance of a Match When 32 People Are in a Room

You can also create an IMLPlus graph that plots elements of the ProbR vector versus the number of people in the room. Figure 13.16 shows how the probability depends on N. The following statements create Figure 13.16:

/* create plot of probability versus number of people in the room   */
np = t(1:maxN);                 /* number in the room (t=transpose) */
declare DataObject dobj;
dobj = DataObject.Create("Birthday", {"NumPeople" "ProbR"}, np||ProbR);

declare ScatterPlot plot;
plot = ScatterPlot.Create(dobj, "NumPeople", "ProbR");
plot.ShowReferenceLines();

The figure shows that the probability increases rapidly as the number of people in a room increases. By the time there are 23 people in a room, the chance of a common birthday is more than 50%. With 40 people in a room, the chance is more than 90%.

Probability of a Matched Birthday versus the Number of People in a Room

Figure 13.16. Probability of a Matched Birthday versus the Number of People in a Room

13.8.2 Simulate the Birthday Matching Problem

The birthday matching problem readily lends itself to simulation. To simulate the problem, represent the birthdays by the integers {1,2,..., 365} and do the following:

  1. Choose a random sample of size N (with replacement) from the set of birthdays. That sample is a "room."

  2. Determine whether any of the birthdays in the room match.

  3. Repeat this process for thousands of rooms and tabulate the results.

The following statements create a single "room" with 25 people in it. The 25 birthdays are randomly generated with uniform probability by using the SampleWithReplace module that is described in Section 13.7:

/* create room with 25 people; find number of matching birthdays */
call randseed(123);
load module=SampleWithReplace;        /* not required in IMLPlus    */
N = 25;                               /* number of people in room   */
EQUAL = .;                            /* convention                 */
bday = SampleWithReplace(1:365, 1||N, EQUAL);
u = unique(bday);                     /* number of unique birthdays */
numMatch = N - ncol(u);               /* number of common birthdays */
print bday, numMatch;

The vector that contains all birthdays of people in the room is shown in Figure 13.17, along with the number of matching birthdays. The vector birthday contains 25 elements. The UNIQUE function removes any duplicate birthdays and returns (in u) a sorted row vector of the unique birthdays. As pointed out in Trumbo, Suess, and Schupp (2005), this can be used to determine the number of matching birthdays: simply use the NCOL function to count the number of columns in u and subtract this number from 25.

Probability of a Shared Birthday (N = 25)

Figure 13.17. Probability of a Shared Birthday (N = 25)

For this sample, there are actually two pairs of matching birthdays. As shown in Figure 13.17, the third and twenty-third persons share a birthday on day 29 (that is, 29JAN), and the seventh and seventeenth persons share a birthday on day 124 (that is, 04MAY).

As seen in this example, it is possible to have a sample for which there is more than one birthday in common. How often does this happen? Or, said differently, in a room with N people, what is the distribution of the number of matching birthdays?

Simulation can answer this question. You can create many "rooms" of 25 people. For each room, record the number of matching birthdays in that room. The following statements present one way to program such a simulation:

/* simulation: record number of matching birthdays for many samples */
N = 25;                               /* number of people in room   */
NumRooms = 1e5;                       /* number of simulated rooms  */
bday = SampleWithReplace(1:365, NumRooms||N, EQUAL);
match = j(NumRooms, 1);               /* allocate results vector    */
do j = 1 to NumRooms;
   u = unique(bday[j,]);              /* number of unique birthdays */
   match[j] = N - ncol(u);            /* number of common birthdays */
end;

You can use the contents of the match vector to estimate the probability of certain events. For example, the following statements estimate the probability that a room with 25 people contains at least one pair of matching birthdays, and the probabilities of exactly one and exactly two matching birthdays. The results are shown in Figure 13.18.

/* estimate probability of a matching birthday */
SimP = sum(match>0) / NumRooms;              /* any match           */
print SimP[label="Estimate of match (N=25)"];

SimP1 = sum(match=1) / NumRooms;             /* exactly one match   */
print SimP1[label="Estimate of Exactly One Match (N=25)"];

SimP2 = sum(match=2) / NumRooms;             /* exactly two matches */
print SimP2[label="Estimate of Exactly Two Matched (N=25)"];
Estimated Probabilities of Matching Birthdays (N = 25)

Figure 13.18. Estimated Probabilities of Matching Birthdays (N = 25)

Recall from the theoretical analysis that there is about a 57% chance that two or more persons share a birthday when there are 25 people in a room. The estimate based on simulation is very close to this theoretical value. According to the simulation, there is exactly one match about 38% of the time, whereas there are two matches about 15% of the time.

Figure 13.19 shows a bar chart of the elements in the match vector. This graph uses the simulated outcomes to estimate the distribution of the number of matching birthdays in a room with 25 people.

Distribution of Matching Birthdays in Simulation (N = 25)

Figure 13.19. Distribution of Matching Birthdays in Simulation (N = 25)

13.9 Case Study: The Birthday Matching Problem for Real Data

It is interesting to examine one of the main assumptions of the birthday matching problem, namely, are birthdays really distributed uniformly thoughout the year?

This is discussed briefly in Trumbo, Suess, and Schupp (2005), and the article notes that data for US births by day are available from the National Center for Health Statistics (NCHS) at the Centers for Disease Control and Prevention (CDC) Web site. The data for live US births by day in 2002 is available in the data set Birthdays2002.

13.9.1 An Analysis of US Births in 2002

Figure 13.20 shows data for the 4,021,726 live births for the year 2002. The mean number of live births each day in 2002 was about 11,018, and the plot shows the percentage of births for each day, relative to the mean: 100 × births/(mean births). The NCHS calls this quantity the index of occurrence. The data range from a minimum value of 6,629 births (an index of about 60) on 25DEC2002, to a maximum value of 13,982 births (an index of 127) on 12SEP2002.

Data for US Births (2002)

Figure 13.20. Data for US Births (2002)

The shapes of the plot markers in Figure 13.20 correspond to days of the week. These data indicate that the number of births varies according to the day of the week. Tuesday through Friday is when most US babies are born. Mondays account for slightly fewer births, compared with the other weekdays. The graph shows that the fewest babies are born on weekends.

Figure 13.21 presents the same data as a box plot, with separate boxes for each day of the week. Again, the graph indicates a sizeable difference between the mean number of births for each day of the week.

Distribution of Births by Day of Week (2002)

Figure 13.21. Distribution of Births by Day of Week (2002)

The graphs indicate that the day of the week has a large effect on the number of births for that day. Researchers at the NCHS commented on this effect in their analysis of 2004 data, which shows the same qualitative features as the 2002 data (Martin et al. 2006, p. 10):

Patterns in the average number of births by day of the week may be influenced by the scheduling of induction of labor and cesarean delivery. . . . The relatively narrow range for spontaneous vaginal births contrasts sharply with that of . . . cesarean deliveries that ranged from 32.7 on Sunday to 130.5 on Tuesday.

In other words, the NCHS researchers attribute the day-of-week effect to induced labor and cesarean deliveries, as contrasted with "spontaneous" births. The numbers quoted are all indices of occurrence. The 2002 data also indicate a "holiday effect" (notice the outliers in Figure 13.21) and seasonality with a peak rate during the months of July, August, and September.

13.9.2 The Birthday Problem for People Born in 2002

Suppose we have a room full of randomly chosen people who were born in the US in 2002. The previous section demonstrated that the birthdays of the people in the room are not uniformly distributed throughout the year. There is a strong effect due to the day of the week. This section simulates the birthday problem for a room of people who were born in 2002. Rather than assume that every birthday is equally probable, the analysis uses the actual birth data from 2002 to assign the probability that a person in the room has a given birthday.

The analysis consists of two parts. Section 13.9.3 presents and analyzes the matching "birth day-of-the-week" problem; Section 13.9.4 analyzes the matching birthday problem. For each problem, the analysis consists of two different estimates:

  • a probability-based estimate

    • – assumes a uniform distribution of birthdays

    • – solves the matching problem exactly under the assumption of uniformity

  • a simulation-based estimate

    • – uses the actual distribution of US birthdays in 2002

    • – uses simulation and unequal sampling to estimate the probability of matching

13.9.3 The Matching Birth Day-of-the-Week Problem

Before trying to simulate the birthday matching problem with a nonuniform distribution of birthdays, consider a simpler problem. In light of the day-of-the-week effect noted in Figure 13.20 and Figure 13.21, you can ask: "If there are N people in a room (N = 2,...,!), what is the probability that two or more were born on the same day of the week?"

13.9.3.1 A Probability-Based Estimate

Section 13.8.1 uses elementary probability theory to compute the probability of matching birthdays under the assumption of a uniform distribution of birthdays. The same approach can be used for matching the day of week on which someone was born. For simplicity, let the acronym BDOW mean the "birth day of the week." A person's BDOW is one of the days Sunday, Monday, ..., Saturday. Number the people in the room 1 ... N. The probability that the ith person's BDOW does not coincide with any of the BDOW of the first i − 1 people is 1 − (i − 1)/7, i = 1,..., 7.

Let SN be the probability that no one in the room has a matching BDOW. Then

13.9.3.1 A Probability-Based Estimate

If RN is the probability that two or more people share a BDOW, then RN = 1 −SN.

As explained in Section 13.8.1, you can compute the quantity RN for a variety of values for N, as shown in the following statements:

/* compute vector of probabilities: R_N for N=1..7 */
maxN =7;                                      /* maximum value of N */
S = j(maxN, 1, 1);                            /* note that S[1] =1 */
do i = 2 to maxN;
   S[i] = S[i-1] * (1 - (i-1)/7);
end;
ProbR = 1 - S;
rlabl = 'N1':'N7';
print ProbR[rowname=rlabl];
Probability of a Shared BDOW, Assuming Uniform Distribution

Figure 13.22. Probability of a Shared BDOW, Assuming Uniform Distribution

Figure 13.22 shows how the probability of a matching BDOW depends on N, the number of people in the room, under the assumption that BDOWs are uniformly distributed. These estimates are lower bounds for the more general matching BDOW problem with a nonuniform distribution of BDOWs (Munford 1977).

13.9.3.2 A Simulation-Based Estimate

As shown in Figure 13.21, the BDOWs are not uniformly distributed. You can use simulation to determine how the departure from uniformity changes the probabilities shown in Figure 13.22.

The first step is to compute the proportion of the 2002 birthdays that occur for each day of the week. The following statements read the data and compute the empirical proportions:

/* compute proportion of the birthdays that occur for each DOW */
use Sasuser.Birthdays2002;
read all var {Births} into y;
read all var {DOW} into Group;
close Sasuser.Birthdays2002;
u = unique(Group);
GroupMeans = j(1, ncol(u));
do i = 1 to ncol(u);
   idx = loc(Group = u[i]);
   GroupMeans[i] = (y[idx])[:]; /* mean births for each day of week */
end;

proportions = GroupMeans/GroupMeans[+]; /* rescale so the sum is 1 */
labl = {"Sun" "Mon" "Tue" "Wed" "Thu" "Fri" "Sat"};
print proportions[colname=labl format=5.3];
Proportion of Birthdays for each Day of the Week

Figure 13.23. Proportion of Birthdays for each Day of the Week

You can use the proportions shown in Figure 13.23 as the values for the prob argument of the SampleWithReplace module. That is, you can generate samples where the probability that a person has a birthday on Sunday is proportional to the number of babies who were born on Sunday in the 2002 data, and similarly for Monday, Tuesday, and so on.

The following statements use the proportions vector to simulate the probability of a matching BDOW in a room of N people who were born in the US in 2002, for N = 2, 3,..., 7:

/* simulate matching BDOW by using empirical proportions from 2002  */
load module=(SampleWithReplace);      /* not required in IMLPlus    */
p = proportions;                      /* probability of events      */
call randseed(123);

NumRooms = 1e5;                       /* number of simulated rooms  */
NumPeople = 1:7;                      /* number of people in room   */

SimR = j(7, 1, 0);                    /* est prob of matching BDOW  */
match = j(NumRooms, 1);               /* allocate results vector    */

do N = 2 to 7;                        /* for rooms with N people... */
   bdow = SampleWithReplace(1:7, NumRooms||N, p);       /* simulate */
   do j = 1 to NumRooms;
      u = unique(bdow[j,]);           /* number of unique BDOW      */
      match[j] = N -ncol(u);          /* number of common BDOW      */
   end;
   SimR[N] = (match>0)[:];            /* proportion of matches      */
end;

rlabel = "N1":"N7";
;print SimR[rowname=rlabel label="Estimate"];

The results are shown in Figure 13.24. The ith row of the SimR vector contains the estimated probability that there is at least one matching BDOW in a room of i people who were born in 2002.

Estimated Probability of Matching BDOWs (2002 Proportions)

Figure 13.24. Estimated Probability of Matching BDOWs (2002 Proportions)

13.9.3.3 Compare the Probability-based and Simulation-Based Estimates

You can compare the probability-based and simulation-based estimates of a matching BDOW. The simulation-based estimate is contained in the SimR vector; the probability-based estimate is contained in the ProbR vector shown in Figure 13.22. Figure 13.25 is a line plot that compares these quantitites. The results are almost identical. The solid line (which represents the probability-based estimates) is slightly below the dashed line (the simulation-based estimate).

Comparison of Matching BDOW for Equal and Unequal Probabilities (2002)

Figure 13.25. Comparison of Matching BDOW for Equal and Unequal Probabilities (2002)

This close agreement is somewhat surprising in view of Figure 13.21, which shows stark differences between births on weekdays and on weekends. Munford (1977) showed that any nonuniform distribution increases the likelihood of a matching BDOW. Consequently, the curve of probability-based estimates shown in Figure 13.24 is lower than the simulation-based curve. The difference between the two quantities is shown in Figure 13.26. The largest difference occurs in a room with four people. However, the difference in the estimated chance of a matching BDOW is at most 1.1%.

diff = SimR - ProbR;
print diff[rowname=rlabel];
Difference between Estimates

Figure 13.26. Difference between Estimates

In conclusion, although the probability-based estimates assume a uniform distribution of BDOWs, the probability estimates are not much different from the simulation-based estimates that use a distribution of BDOWs that is based on actual 2002 data.

13.9.4 The 2002 Matching Birthday Problem

Suppose a room contains N randomly chosen people born in the US in 2002. What is the probability of a matching birthday? The values shown in Figure 13.16 are estimates for the probabilities. Recall that Figure 13.16 was created for a probability-based model that assumes a uniform distribution of birthdays.

You can also simulate the matching birthday problem. The ideas and the programs that are described in the previous section extend directly to the matching birthday problem:

  • The probabilities used for the simulation are the proportions of the total birthdays that occurred for each day in 2002.

  • Samples are drawn (with replacement) from the set f1; {2,.... 365}.

  • The number of people in the room varies from 2 to 50.

The algorithm for simulating matching birthdays is the same as for simulating matching BDOWs. However, the birthday problem is much bigger. The sample space contains 365 elements, and you need to generate many rooms that each contain between 2 and 50 people. Whereas each simulation that generates Figure 13.25 is almost instantaneous, the corresponding simulations for the matching birthday problem take longer. Consequently, the following program uses only 104 rooms for each of the 49 simulations:

/* simulate matching birthdays by using empirical proportions from 2002 */
use Sasuser.Birthdays2002;
read all var {Births};
close Sasuser.Birthdays2002;
call randseed(123);
load module=SampleWithReplace;        /* not required in IMLPlus    */
p = Births/Births[+];                 /* probability of events      */
NumRooms = 1e4;                       /* number of simulated rooms  */
maxN = 50;
NumPeople = 1:maxN;                   /* number of people in room   */

SimR = j(maxN, 1, 0);
match = j(NumRooms, 1);               /* allocate results vector    */

do N = 2 to maxN;                     /* for rooms with N people... */
   bday = SampleWithReplace(1:365, NumRooms||N, p);     /* simulate */
   do j = 1 to NumRooms;
      u = unique(bday[j,]);           /* number of unique birthdays */
      match[j] = N - ncol(u);         /* number of common birthdays */
   end;
   /* estimated prob of >= 1 matching birthdays */
   SimR[N] = (match>0)[:];
end;

You can plot the estimates from the following simulation on the same graph as the probability-based estimates shown in Figure 13.16. The results are shown in Figure 13.27.

Comparison of Probability Estimates for Matching Birthdays

Figure 13.27. Comparison of Probability Estimates for Matching Birthdays

The results are similar to the matching BDOW problem. Once again, the solid line (which represents the probability-based estimates) is slightly below the dashed line (the simulation-based estimate). The largest difference occurs in the room with 25 people; the difference is 2.6%. This close agreement between the simulation-based estimates and the probability-based estimates is surprising, given the magnitude of the departures from uniformity shown in Figure 13.20. The dashed curve for the simulation-based estimates is jagged because each point on the curve is based on only 104 simulated rooms.

13.10 Calling C Functions from SAS/IML Studio

In most cases, you can use the techniques presented in this chapter to implement simulations that are written entirely in the SAS/IML language. As the examples in this chapter demonstrate, the SAS/IML language enables you to simulate complex events such as dice games and the matching birthday problem. However, as indicated in Section 13.9.4, some simulation problems require generating many millions of samples. Furthermore, the analysis that you perform on each sample might be very time consuming. (In Section 13.9.4, the "analysis" consists of simply counting the number of matching birthdays.) In circumstances such as these, you might decide to write a portion of your program in the C programming language, compile the code into a Windows DLL, and call the C function from within SAS/IML Studio.

The process of how to compile and call a C function from a Windows DLL is explained in the SAS/IML Studio online Help. To view the documentation in the SAS/IML Studio Help:

  1. Display the help by selecting Help13.10 Calling C Functions from SAS/IML Studio Help Topics from the SAS/IML Studio main menu.

  2. Expand the link for the SAS/IML Studio Help topic.

  3. Expand the link for the chapter "Extending IMLPlus."

  4. Expand the link for the section "Using C."

Figure 13.28 shows the topics in the "Using C" portion of the documentation. Notice that there is also a section entitled "Using FORTRAN," which explains how to compile a FORTRAN function into a DLL and how to call it from an IMLPlus program.

The SAS/IML Studio Help

Figure 13.28. The SAS/IML Studio Help

13.11 References

[bibch13_01] C. M. Grinstead, and J. L. Snell, (1997), Introduction to Probability, Second revised Edition, Providence, RI: American Mathematical Society.

[bibch13_02] S. L. Lohr, (2009), Sampling: Design and Analysis, Second Edition, Pacific Grove, CA: Duxbury Press.

[bibch13_03] J. A. Martin,, B. E. Hamilton,, P. D. Sutton,, S. J. Ventura,, F. Menacker,, and S. Kirmeyer, (2006), "Births: Final Data for 2004," National Vital Statistics Reports, 55(1). URL http://www.cdc.gov/nchs/data/nvsr/nvsr55/nvsr55_01.pdf

[bibch13_04] A. G. Munford, (1977), "A Note on the Uniformity Assumption in the Birthday Problem," The American Statistician, 31(3), 119.

[bibch13_05] B. Trumbo,, E. Suess,, and C. Schupp, (2005), "Computing the Probabilities of Matching Birthdays," STATS: The Magazine for Students of Statistics, Spring(43), 3-7.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset