Chapter 18

Introducing Probability

IN THIS CHAPTER

Defining probability

Working with probability

Dealing with random variables and their distributions

Focusing on the binomial distribution

Throughout this book, I toss around the concept of probability because it’s the basis of hypothesis testing and inferential statistics. Most of the time, I represent probability as the proportion of area under part of a distribution. For example, the probability of a Type I error (a.k.a. α) is the area in a tail of the standard normal distribution or the t distribution.

In this chapter, I explore probability in greater detail, including random variables, permutations, and combinations. I examine probability’s fundamentals and applications and then zero in on a couple of specific probability distributions. Then, after telling you about probability concepts, I discuss probability-related Excel worksheet functions.

What Is Probability?

Most of us have an intuitive idea about what probability is all about. Toss a fair coin and you have a 50-50 chance it comes up “heads.” Toss a fair die (one of a pair of dice) and you have a one-in-six chance it comes up “2.”

If you wanted to be more formal in your definition, you’d most likely say something about all the possible things that could happen, and the proportion of those things you care about. Two things can happen when you toss a coin, and if you only care about one of them (heads), the probability of that event happening is one out of two. Six things can happen when you toss a die, and if you only care about one of them (2), the probability of that event happening is one out of six.

Experiments, trials, events, and sample spaces

Statisticians and others who work with probability refer to a process like tossing a coin or throwing a die as an experiment. Each time you go through the process, that’s a trial.

This might not fit your personal definition of an experiment (or of a trial, for that matter), but for a statistician, an experiment is any process that produces one of at least two distinct results (like heads or tails).

Another piece of the definition of an experiment: You can’t predict the result with certainty. Each distinct result is called an elementary outcome. Put a bunch of elementary outcomes together and you have an event. For example, with a die, the elementary outcomes 2, 4, and 6 make up the event “even number.”

Put all the possible elementary outcomes together and you’ve got yourself a sample space. The numbers 1, 2, 3, 4, 5, and 6 make up the sample space for a die. “Heads” and “tails” make up the sample space for a coin.

Sample spaces and probability

How does all this play into probability? If each elementary outcome in a sample space is equally likely, the probability of an event is

images

So the probability of tossing a die and getting an even number is

images

If the elementary outcomes are not equally likely, you find the probability of an event in a different way. First, you have to have some way of assigning a probability to each one. Then you add up the probabilities of the elementary outcomes that make up the event.

A couple of things to bear in mind about outcome probabilities: Each probability has to be between zero and one. All the probabilities of elementary outcomes in a sample space have to add up to 1.00.

How do you assign those probabilities? Sometimes you have advance information — such as knowing that a coin is biased toward coming up heads 60 percent of the time. Sometimes you just have to think through the situation to figure out the probability of an outcome.

Here’s a quick example of “thinking through.” Suppose a die is biased so that the probability of an outcome is proportional to the numerical label of the outcome: A 6 comes up six times as often as a 1, a 5 comes up five times as often as a 1, and so on. What is the probability of each outcome? All the probabilities have to add up to 1.00, and all the numbers on a die add up to 21 (1+2+3+4+5+6 = 21), so the probabilities are: pr(1) = 1/21, pr(2) = 2/21, … , pr(6) = 6/21.

Compound Events

Some rules for dealing with compound events help you “think through.” A compound event consists of more than one event. It’s possible to combine events by either union or intersection (or both).

Union and intersection

On a toss of a fair die, what’s the probability of rolling a 1 or a 4? Mathematicians have a symbol for or. It’s called union, and it looks like this: ∪. Using this symbol, the probability of a 1 or a 4 is pr(1 ∪ 4).

In approaching this kind of probability, it’s helpful to keep track of the elementary outcomes. One elementary outcome is in each event, so the event “1 or 4” has two elementary outcomes. With a sample space of six outcomes, the probability is 2/6 or 1/3. Another way to calculate this is

images

Here’s a slightly more involved one: What’s the probability of rolling a number between 1 and 3 or a number between 2 and 4?

Just adding the elementary outcomes in each event won’t get it done this time. Three outcomes are in the event “between 1 and 3,” and three are in the event “between 2 and 4.” The probability can’t be 3 + 3 divided by the six outcomes in the sample space, because that’s 1.00, leaving nothing for pr(5) and pr(6). For the same reason, you can’t just add the probabilities.

The challenge arises in the overlap of the two events. The elementary outcomes in “between 1 and 3” are 1, 2, and 3. The elementary outcomes in “between 2 and 4” are 2, 3, and 4. Two outcomes overlap: 2 and 3. In order to not count them twice, the trick is to subtract them from the total.

A couple of things will make life easier as I proceed. I abbreviate “between 1 and 3” as A and “between 2 and 4” as B. Also, I use the mathematical symbol for “overlap.” The symbol is ∩ and it’s called intersection.

Using the symbols, the probability of “between 1 and 3” or “between 2 and 4” is

images
images

You can also work with the probabilities:

images

The general formula is

images

Why was it okay to just add the probabilities together in the earlier example? Because pr(1 ∩ 4) is zero: It’s impossible to roll a 1 and a 4 in the same toss of a die. Whenever pr(A ∩ B) = 0, A and B are said to be mutually exclusive.

Intersection again

Imagine throwing a coin and rolling a die at the same time. These two experiments are independent because the result of one has no influence on the result of the other.

What’s the probability of getting Heads and a 4? You use the intersection symbol and write this as pr(Heads ∩ 4)

images

Start with the sample space. Table 18-1 lists all elementary outcomes.

TABLE 18-1 The Elementary Outcomes in the Sample Space for Throwing a Coin and Rolling a Die

Heads, 1

Tails, 1

Heads, 2

Tails, 2

Heads, 3

Tails, 3

Heads, 4

Tails, 4

Heads, 5

Tails, 5

Heads, 6

Tails, 6

As the table shows, 12 outcomes are possible. How many outcomes are in the event “Heads and 4”? Just one. So

images

You can also work with the probabilities:

images

In general, if A and B are independent,

images

Conditional Probability

In some circumstances, you narrow the sample space. For example, suppose I toss a die and I tell you the result is greater than 2. What’s the probability that it’s a 5?

Ordinarily, the probability of a 5 would be 1/6. In this case, however, the sample space isn’t 1, 2, 3, 4, 5, and 6. When you know the result is greater than 2, the sample space becomes 3, 4, 5, and 6. The probability of a 5 is now 1/4.

This is an example of conditional probability. It’s “conditional” because I’ve given a “condition” — the toss resulted in a number greater than 2. The notation for this is

images

The vertical line is shorthand for the word given, and you read that notation as “the probability of a 5 given Greater than 2.”

Working with the probabilities

In general, if you have two events A and B,

images

as long as pr(B) isn’t zero.

For the intersection in the numerator on the right, this is not a case where you just multiply probabilities together. In fact, if you could do that, you wouldn’t have a conditional probability, because that would mean A and B are independent. If they’re independent, one event can’t be conditional on the other.

You have to think through the probability of the intersection. In a die, how many outcomes are in the event “5 ∩ Greater than 2”? Just one, so pr(5 ∩ Greater than 2) is 1/6, and

images

The foundation of hypothesis testing

All the hypothesis testing I show you in previous chapters involves conditional probability. When you calculate a sample statistic, compute a statistical test, and then compare the test statistic against a critical value, you’re looking for a conditional probability. Specifically, you’re trying to find

images

If that conditional probability is low (less than .05 in all the examples I show you in hypothesis-testing chapters), you reject H0.

Large Sample Spaces

When dealing with probability, it’s important to understand the sample space. In the examples I show you, the sample spaces are small. With a coin or a die, it’s easy to list all the elementary outcomes.

The world, of course, isn’t that simple. In fact, probability problems that live in statistics textbooks aren’t even that simple. Most of the time, sample spaces are large and it’s not convenient to list every elementary outcome.

Take, for example, rolling a die twice. How many elementary outcomes are in the sample space consisting of both tosses? You can sit down and list them, but it’s better to reason it out: Six possibilities for the first toss, and each of those six can pair up with six possibilities on the second. So the sample space has 6 × 6 = 36 possible elementary outcomes. (This is similar to the coin-and-die sample space in Table 18-1, where the sample space consists of 2 × 6 = 12 elementary outcomes. With 12 outcomes, it is easy to list them all in a table. With 36 outcomes, it starts to get … well … dicey.)

Events often require some thought, too. What’s the probability of rolling a die twice and totaling 5? You have to count the number of ways the two tosses can total 5, and then divide by the number of elementary outcomes in the sample space (36). You total a 5 by getting any of these pairs of tosses: 1 and 4, 2 and 3, 3 and 2, or 4 and 1. That totals four ways, and they don’t overlap (excuse me — intersect), so

images

Listing all the elementary outcomes for the sample space is often a nightmare. Fortunately, shortcuts are available, as I show in the upcoming subsections. Because each shortcut quickly helps you count a number of items, another name for a shortcut is a counting rule.

Believe it or not, I just slipped one counting rule past you. A couple of paragraphs ago, I say that in two tosses of a die you have a sample space of 6 × 6 = 36 possible outcomes. This is the product rule: If N1 outcomes are possible on the first trial of an experiment, and N2 outcomes on the second trial, the number of possible outcomes is N1N2. Each possible outcome on the first trial can associate with all possible outcomes on the second. What about three trials? That’s N1N2N3.

Now for a couple more counting rules.

Permutations

Suppose you have to arrange five objects into a sequence. How many ways can you do that? For the first position in the sequence, you have five choices. After you make that choice, you have four choices for the second position. Then you have three choices for the third, two for the fourth, and one for the fifth. The number of ways is (5)(4)(3)(2)(1) = 120.

In general, the number of sequences of N objects is N(N-1)(N-2) … (2)(1). This kind of computation occurs fairly frequently in probability world, and it has its own notation, N! You don’t read this by screaming out “N” in a loud voice. Instead, it’s “N factorial.” By definition, 1! = 1, and 0! = 1.

Now for the good stuff. If you have to order the 26 letters of the alphabet, the number of possible sequences is 26!, a huge number. But suppose the task is to create 5-letter sequences so that no letter repeats in the sequence. How many ways can you do that? You have 26 choices for the first letter, 25 for the second, 24 for the third, 23 for the fourth, 22 for the fifth, and that’s it. So that’s (26)(25)(24)(23)(22). Here’s how that product is related to 26!:

images

Each sequence is called a permutation. In general, if you take permutations of N things r at a time, the notation is NPr (the P stands for permutation). The formula is

images

Just for completeness, here’s another wrinkle. Suppose that I allow repetitions in these sequences of 5. That is, aabbc is a permissible sequence. In that case, the number of sequences is 26 × 26 × 26 × 26 × 26, or as mathematicians would say, “26 raised to the fifth power.” Or as mathematicians would write, “265.”

Combinations

In the preceding example, these sequences are different from one another: abcde, adbce, dbcae, and on and on and on. In fact, you could come up with 5! = 120 of these different sequences just for the letters a, b, c, d, and e.

Suppose that I add the restriction that one of these sequences is no different from another, and all I’m concerned about is having sets of five nonrepeating letters in no particular order. Each set is called a combination. For this example, the number of combinations is the number of permutations divided by 5!:

images

In general, the notation for combinations of N things taken r at a time is NCr (the C stands for combination). The formula is

images

Now for that completeness wrinkle again. Suppose that I allow repetitions in these sequences. How many sequences would I have? It turns out to be equivalent to N+r-1 things taken N-1 at a time, or N+r+1CN-1. For this example, that would be 30C25.

Worksheet Functions

Excel provides functions that help you with factorials, permutations, and combinations.

FACT

FACT, which computes factorials, is surprisingly not categorized as Statistical. Instead, you’ll find it on the Math & Trig Functions menu. It’s easy to use. Supply it with a number, and it returns the factorial. Here are the steps:

  1. Select a cell for FACT's answer.
  2. From the Math & Trig Functions menu, select FACT to open its Function Arguments dialog box.
  3. In the Function Arguments dialog box, enter the appropriate value for the argument.

    In the Number box, I typed the number whose factorial I want to compute.

    The answer appears in the dialog box. If I enter 5, for example, 120 appears.

  4. Click OK to put the answer into the selected cell.

PERMUT and PERMUTIONA

You find these two on the Statistical Functions menu. As its name suggests, PERMUT enables you to calculate NPr. Here’s how to use it to find 26P5, the number of 5-letter sequences (no repeating letters) that you can create from the 26 letters of the alphabet. In a permutation, remember, abcde is considered different from bcdae. Follow these steps:

  1. Select a cell for PERMUT's answer.
  2. From the Statistical Functions menu, select PERMUT to open its Function Arguments dialog box. (See Figure 18-1.)
  3. In the Function Arguments dialog box, type the appropriate values for the arguments.

    In the Number box, I entered the N in NPr. For this example, N is 26.

    In the Number_chosen box, I entered the r in NPr. That would be 5.

    With values entered for both arguments, the answer appears in the dialog box. For this example, the answer is 7893600.

  4. Click OK to put the answer into the selected cell.
image

FIGURE 18-1: The Function Arguments dialog box for PERMUT.

PERMUTIONA does the same thing, but with repetitions allowed. Its Function Arguments dialog box looks exactly like the one for PERMUT. Its answer is equivalent to Nr. For this example, by the way, that answer is 1181376.

COMBIN and COMBINA

COMBIN works pretty much the same way as PERMUT. Excel categorizes COMBIN and COMBINA as Math & Trig functions.

Here’s how you use them to find 26C5, the number of ways to construct a 5-letter sequence (no repeating letters) from the 26 letters of the alphabet. In a combination, abcde is considered equivalent to bcdae.

  1. Select a cell for COMBIN’s answer.
  2. From the Math & Trig Functions menu, select COMBIN to open its Function Arguments dialog box.
  3. In the Function Arguments dialog box, type the appropriate values for the arguments.

    In the Number box, I entered the N in NCr. Once again, N is 26.

    In the Number_chosen box, I entered the r in NCr. And again, r is 5.

    With values entered for both arguments, the answer appears in the dialog box. For this example, the answer is 65870.

  4. Click OK to put the answer into the selected cell.

If you allow repetitions, use COMBINA. Its Function Arguments dialog box looks just like COMBIN’s. For this example, its answer is equivalent to 30C25 (142506).

Random Variables: Discrete and Continuous

Return to tosses of a fair die, where six elementary outcomes are possible. If I use x to refer to the result of a toss, x can be any whole number from 1 to 6. Because x can take on a set of values, it’s a variable. Because x’s possible values correspond to the elementary outcomes of an experiment (meaning you can’t predict its values with absolute certainty), x is called a random variable.

Random variables come in two varieties. One variety is discrete, of which die-tossing is a good example. A discrete random variable can take on only what mathematicians like to call a countable number of values — like the numbers 1 through 6. Values between the whole numbers 1 through 6 (like 1.25 or 3.1416) are impossible for a random variable that corresponds to the outcomes of die-tosses.

The other kind of random variable is continuous. A continuous random variable can take on an infinite number of values. Temperature is an example. Depending on the precision of a thermometer, having temperatures like 34.516 degrees is possible.

Probability Distributions and Density Functions

Back to die-tossing again. Each value of the random variable x (1–6, remember) has a probability. If the die is fair, each probability is 1/6. Pair each value of a discrete random variable like x with its probability, and you have a probability distribution.

Probability distributions are easy enough to represent in graphs. Figure 18-2 shows the probability distribution for x.

image

FIGURE 18-2: The probability distribution for x, a random variable based on the tosses of a fair die.

A random variable has a mean, a variance, and a standard deviation. Calculating these parameters is pretty straightforward. In the random-variable world, the mean is called the expected value, and the expected value of random variable x is abbreviated as E(x). Here’s how you calculate it:

images

For the probability distribution in Figure 18-2, that’s

images

The variance of a random variable is often abbreviated as V(x), and the formula is

images

Working with the probability distribution in Figure 18-2 once again,

images

The standard deviation is the square root of the variance, which in this case is 1.708.

For continuous random variables, things get a little trickier. You can’t pair a value with a probability, because you can’t really pin down a value. Instead, you associate a continuous random variable with a mathematical rule (an equation) that generates probability density, and the distribution is called a probability density function. To calculate the mean and variance of a continuous random variable, you need calculus.

In Chapter 8, I show you a probability density function — the standard normal distribution. I reproduce it here as Figure 18-3.

image

FIGURE 18-3: The standard normal distribution: a probability density function.

In the figure, f(x) represents the probability density. Because probability density can involve some heavyweight mathematical concepts, I won’t go into it. As I mention in Chapter 8, think of probability density as something that turns the area under the curve into probability.

Although you can’t speak of the probability of a specific value of a continuous random variable, you can work with the probability of an interval. To find the probability that the random variable takes on a value within an interval, you find the proportion of the total area under the curve that’s inside that interval. Figure 18-3 shows this. The probability that x is between 0 and 1σ is .3413.

For the rest of this chapter, I deal only with discrete random variables. A specific one is up next.

The Binomial Distribution

Imagine an experiment that has these six characteristics:

  • The experiment consists of N identical trials.

    A trial could be a toss of a die or a toss of a coin.

  • Each trial results in one of two elementary outcomes.
  • It’s standard to call one outcome a success and the other a failure. For die-tossing, a success might be a toss that comes up 3, in which case a failure is any other outcome.
  • The probability of a success remains the same from trial to trial.

    Again, it’s pretty standard to use p to represent the probability of a success, and 1-p (or q) to represent the probability of a failure.

  • The trials are independent.
  • The discrete random variable x is the number of successes in the N trials.

This type of experiment is called a binomial experiment. The probability distribution for x follows this rule:

images

On the extreme right, px(1-p)N-x is the probability of one combination of x successes in N trials. The term to its immediate left is NCx, the number of possible combinations of x successes in N trials.

This is called the binomial distribution. You use it to find probabilities like the probability you’ll get four 3’s in ten tosses of a die:

images

The negative binomial distribution is closely related. In this distribution, the random variable is the number of trials before the xth success. For example, you use the negative binomial to find the probability of five tosses that result in anything but a 3 before the fourth time you roll a 3.

For this to happen, in the eight tosses before the fourth 3, you have to get five non-3’s and three successes (tosses when a 3 comes up). Then the next toss results in a 3. The probability of a combination of four successes and five failures is p4(1-p)5. The number of ways you can have a combination of five failures and four-to-one successes is 5+4-1C4-1. So the probability is

images

In general, the negative binomial distribution (sometimes called the Pascal distribution) is

images

Worksheet Functions

These distributions are computation intensive, so I get to the worksheet functions right away.

BINOM.DIST and BINOM.DIST.RANGE

These are Excel’s worksheet functions for the binomial distribution. Use BINOM.DIST to calculate the probability of getting four 3’s in ten tosses of a fair die:

  1. Select a cell for BINOM.DIST’s answer.
  2. From the Statistical Functions menu, select BINOM.DIST to open its Function Arguments dialog box. (See Figure 18-4.)
  3. In the Function Arguments dialog box, type the appropriate values for the arguments.

    In the Number_s box, I entered the number of successes. For this example, the number of successes is 4.

    In the Trials box, I entered the number of trials. The number of trials is 10.

    In the Probability_s box, I entered the probability of a success. I entered 1/6, the probability of a 3 on a toss of a fair die.

    In the Cumulative box, one possibility is FALSE for the probability of exactly the number of successes entered in the Number_s box. The other is TRUE for the probability of getting that number of successes or fewer. I entered FALSE.

    With values entered for all the arguments, the answer appears in the dialog box.

  4. Click OK to put the answer into the selected cell.
image

FIGURE 18-4: The BINOM.DIST Function Arguments dialog box.

To give you a better idea of what the binomial distribution looks like, I use BINOM.DIST (with FALSE entered in the Cumulative box) to find pr(0) through pr(10), and then I use Excel’s graphics capabilities (refer to Chapter 3) to graph the results. Figure 18-5 shows the data and the graph.

image

FIGURE 18-5: The binomial distribution for x successes in ten tosses of a die, with p = 1/6.

Incidentally, if you type TRUE in the Cumulative box, the result is .984 (and some more decimal places), which is pr(0) + pr(1) + pr(2) + pr(3) + pr(4).

Figure 18-5 is helpful if you want to find the probability of getting between four and six successes in ten trials. Find pr(4), pr(5), and pr(6) and add the probabilities.

A much easier way, especially if you don’t have a chart like Figure 18-5 handy or if you don’t want to apply BINOM.DIST three times, is to use BINOM.DIST.RANGE. Figure 18-6 shows the dialog box for this function, supplied with values for the arguments. After all the arguments are entered, the answer (0.069460321) appears in the dialog box.

image

FIGURE 18-6: The Function Arguments dialog box for BINOM.DIST.RANGE.

tip If you don’t put a value in the Number_s2 box, BINOM.DIST.RANGE returns the probability of whatever you entered into the Number_s box. If you don’t put a value in the Number_s box, the function returns the probability of, at most, the number of successes in the Number_s2 box (for example, the cumulative probability).

NEGBINOM.DIST

As its name suggests, NEGBINOM.DIST handles the negative binomial distribution. I use it here to work out the earlier example — the probability of getting five failures (tosses that result in anything but a 3) before the fourth success (the fourth 3). Here are the steps:

  1. Select a cell for NEGBINOM.DIST's answer.
  2. From the Statistical Functions menu, select NEGBINOM.DIST to open its Function Arguments dialog box. (See Figure 18-7.)
  3. In the Function Arguments dialog box, type the appropriate values for the arguments.

    In the Number_f box, I entered the number of failures. The number of failures is 5 for this example.

    In the Number_s box, I entered the number of successes. For this example, that’s 4.

    In the Probability_s box, I entered 1/6, the probability of a success.

    In the Cumulative box, I entered FALSE. This gives the probability of the number of successes. If I enter TRUE, the result is the probability of at most that number of successes.

    With values entered for all the arguments, the answer appears in the dialog box. The answer is 0.017 and some additional decimal places.

  4. Click OK to put the answer into the selected cell.
image

FIGURE 18-7: The NEGBINOM.DIST Function Arguments dialog box.

Hypothesis Testing with the Binomial Distribution

Hypothesis tests sometimes involve the binomial distribution. Typically, you have some idea about the probability of a success, and you put that idea into a null hypothesis. Then you perform N trials and record the number of successes. Finally, you compute the probability of getting that many successes or a more extreme amount if your H0 is true. If the probability is low, reject H0.

When you test in this way, you’re using sample statistics to make an inference about a population parameter. Here, that parameter is the probability of a success in the population of trials. By convention, Greek letters represent parameters. Statisticians use π (pi), the Greek equivalent of p, to stand for the probability of a success in the population.

Continuing with the die-tossing example, suppose you have a die and you want to test whether or not it’s fair. You suspect that if it’s not, it’s biased toward 3. Define a toss that results in 3 as a success. You toss it ten times. Four tosses are successes. Casting all this into hypothesis-testing terms:

H0: π ≤ 1/6

H1: π > 1/6

As I usually do, I set α = .05.

To test these hypotheses, you have to find the probability of getting at least four successes in ten tosses with p = 1/6. That probability is pr(4) + pr(5) + pr(6) + pr(7) + pr(8) + pr(9) + pr(10). If the total is less than .05, reject H0.

That’s a lot of calculating. You can use BINOM.DIST to take care of it all (as I did when I set up the worksheet shown earlier in Figure 18-5), or you can take a different route. You can find a critical value for the number of successes, and if the number of successes is greater than the critical value, reject H0.

How do you find the critical value? You can use a convenient worksheet function that I’m about to show you.

BINOM.INV

This function is tailor-made for binomial-based hypothesis testing. Give BINOM.INV the number of trials, the probability of a success, and a criterion cumulative probability. BINOM.INV returns the smallest value of x (the number of successes) for which the cumulative probability is greater than or equal to the criterion.

Here are the steps for the hypothesis testing example I just showed you:

  1. Select a cell for BINOM.INV’s answer.
  2. From the Statistical Functions menu, select BINOM.INV and click OK to open its Function Arguments dialog box. (See Figure 18-8.)
  3. In the Function Arguments dialog box, enter the appropriate values for the arguments.

    In the Trials box, I entered 10, the number of trials.

    In the Probability_s box, I entered the probability of a success. In this example it’s 1/6, the value of π according to H0.

    In the Alpha box, I entered the cumulative probability to exceed. I entered .95 because I want to find the critical value that cuts off the upper 5 percent of the binomial distribution.

    With values entered for the arguments, the critical value, 4, appears in the dialog box.

  4. Click OK to put the answer into the selected cell.
image

FIGURE 18-8: The BINOM.INV Function Arguments dialog box.

As it happens, the critical value is the number of successes in the sample. The decision is to reject H0.

More on hypothesis testing

In some situations, the binomial distribution approximates the standard normal distribution. When this happens, you use the statistics of the normal distribution to answer questions about the binomial distribution.

Those statistics involve z-scores, which means that you have to know the mean and the standard deviation of the binomial. Fortunately, they’re easy to compute. If N is the number of trials and π is the probability of a success, the mean is

images

the variance is

images

and the standard deviation is

images

The binomial approximation to the normal is appropriate when N π ≥ 5 and N(1 – π) ≥ 5.

When you test a hypothesis, you’re making an inference about π, and you have to start with an estimate. You run N trials and get x successes. The estimate is

images

To create a z-score, you need one more piece of information — the standard error of P. This sounds harder than it is, because this standard error is just

images

Now you’re ready for a hypothesis test.

Here’s an example. The CEO of FarKlempt Robotics, Inc., believes that 50 percent of FarKlempt robots are purchased for home use. A sample of 1,000 FarKlempt customers indicates that 550 of them use their robots at home. Is this significantly different from what the CEO believes? The hypotheses:

H0: π = .50

H1: π ≠ .50

I set α = .05

N π = 500, and N(1 - π) = 500, so the normal approximation is appropriate.

First, calculate P:

images

Now create a z-score:

images

With α = .05, is 3.162 a large enough z-score to reject H0? An easy way to find out is to use the worksheet function NORM.S.DIST. (See Chapter 8.) If you do, you’ll find that this z-score cuts off less than .01 of the area in the upper tail of the standard normal distribution. The decision is to reject H0.

The Hypergeometric Distribution

Here’s another distribution that deals with successes and failures.

I start with an example. In a set of 16 light bulbs, 9 are good and 7 are defective. If you randomly select 6 light bulbs out of these 16, what’s the probability that 3 of the 6 are good? Consider selecting a good light bulb as a “success.”

When you finish selecting, your set of selections is a combination of three of the nine good light bulbs together with a combination of three of the seven defective light bulbs. The probability of getting three good bulbs is a … well … combination of counting rules:

images

Each outcome of the selection of the good light bulbs can associate with all outcomes of the selection of the defective light bulbs, so the product rule is appropriate for the numerator. The denominator (the sample space) is the number of possible combinations of 6 items in a group of 16.

This is an example of the hypergeometric distribution. In general, with a small population that consists of N1 successes and N2 failures, the probability of x successes in a sample of m items is

images

The random variable x is said to be a hypergeometrically distributed random variable.

HYPGEOM.DIST

This function calculates everything for you when you deal with the hypergeometric distribution. Here’s how to use it to work through the preceding example:

  1. Select a cell for HYPGEOM.DIST’s answer.
  2. From the Statistical Functions menu, select HYPGEOM.DIST to open its Function Arguments dialog box. (See Figure 18-9.)
  3. In the Function Arguments dialog box, enter the appropriate values for the arguments.

    In the Sample_s box, I entered the number of successes in the sample. That number is 3 for this example.

    In the Number_sample box, I entered the number of items in the sample. The sample size for this example is 6.

    In the Population_s box, I entered the number of successes in the population. In this example that’s 7, the number of good light bulbs.

    In the Number_pop box, I entered the number of items in the population. The total number of light bulbs is 16, and that’s the population size.

    In the Cumulative box, I entered FALSE. This gives the probability of the number of successes I entered in the Sample_s box. If I enter TRUE, the function returns the probability of, at most, that number of successes (for example, the cumulative probability).

    With values entered for all the arguments, the answer appears in the dialog box. The answer is 0.367 and some additional decimal places.

  4. Click OK to put the answer into the selected cell.
image

FIGURE 18-9: The HYPGEOM.DIST Function Arguments dialog box.

As I do with the binomial, I use HYP.GEOM.DIST to calculate pr(0) through pr(6) for this example. Then I use Excel’s graphics capabilities (refer to Chapter 3) to graph the results. Figure 18-10 shows the data and the chart. My objective is to help you visualize and understand the hypergeometric distribution.

image

FIGURE 18-10: The hypergeometric distribution for x successes in a six-item sample from a population that consists of seven successes and nine failures.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset