Chapter 4: Probability Theory

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Preview: Probability theory is the study of likelihoods that given events will occur. Probability theory plays a role in everything from operating a local casino to the techniques used to minimize side effects and negative outcomes in a medical setting.

No matter what the setting, probabilities can be represented by numbers between 0.0 and 1.0, where a probability of 0.0 means that there is no chance a given result will be achieved, while a probability of 1.0 means that the event will take place for certain. Probability theory often uses relative frequency to predict how often a given event will take place. That event can be anything, from the number of wins and losses for a soccer team to the number of times heads come up in a coin toss. Probability theory also examines the distribution of all results, which often is represented via a bell-shaped curve. It plays a big role in probability theory and it is an important concept for students to understand.

Learning Objectives: At the conclusion of this chapter, you should be able to:

Compute the expected value and variance of a probability distribution
Compute probabilities from binomial distributions
Solve business problems using binomial distributions
Compute probabilities from normal or continuous distributions
Solve business problems using normal or continuous distributions

Introduction

We will now switch gears and start involving probabilities in our discussions. Until now we talked about descriptive statistics, using numbers (mean, standard deviation), graphs (pie chart, histogram), or general concepts (skewed distribution) to describe data, whether from a population or from a sample. In subsequent chapters we want to introduce inferential statistics where we draw conclusions about a population based on properties of a sample and discuss the precision and accuracy of our conclusions in terms of probabilities. However, we will only use as much probability as necessary; we will not study probability theory in its own right here. This chapter will introduce the elements of probability theory that will be useful to us in subsequent chapters. Let us start with the basics.

Definition: A sample space S is the set of all possible outcomes of an experiment. An event is a subset of S. We will consider the probability of an event as the chance, or likelihood, that this event indeed takes place. All probabilities will be numbers between 0.0 and 1.0, inclusive, where a probability of 0 means that an event does not happen and a probability of 1.0 means that an event will happen for certain. We will often use the notation P(A) to denote the probability of event A occurring. The total probability of all events must be equal to 1.0, that is, P(S) = 1.

Sometimes a sample space, a set of possible events, and the probabilities assigned to each event are collectively called a probability space. We could make this more mathematically rigorous: a probability is a function that has as its domain a certain collection of sets that are subsets of some sample space S and associates with each set E ⊂ S a number between 0.0 and 1.0 so that the following properties are satisfied.

(probability of the empty set is zero) and P(S) = 1 (probability of all events is one).
for every event E.
If and all Ej are mutually disjoint, then , that is, the probability of a union of disjoint sets equals the sum of the probabilities of each set.

These axioms are known as the Kolmogorov axioms, in honor of Andrei Kolmogorov, a famous Russian mathematician who lived from 1903 to 1987. If this more rigorous definition sounds somewhat abstract, you are right. If it actually sounds too abstract for comfort, very good! In a true probability theory course we would use this above abstract definition and then continue to derive various properties of it. But for this course we will be content with saying that probabilities of events are numbers between 0 and 1 that determine the likelihood of events occurring. That should sound much simpler. In many cases these probabilities are determined by counting or as proportions.

Example: Let us say our experiment consists of tossing fair a coin once. List the sample space. What is the probability of obtaining head? Suppose our experiment consists of rolling a die. What is the probability of getting a 5 or a larger number? What is the probability of two dice adding to 4 when tossing them simultaneously? If we throw a dart randomly into a square with side length 1 m, what is the probability of landing in a circle of radius 10 cm in the middle of the square (bull’s-eye).

The first experiment consists of throwing a single coin. There are two possible outcomes, heads or tails (coins do not land on their side). Thus, the sample space S is {H, T }. Whenever all outcomes of an experiment are equally likely, we can compute probabilities simply by counting. We have for any event E:

where S is the sample space, as usual. In tossing a coin, for example, there are two possible outcomes, head (H ) or tail (T ), both equally likely (if the coin is fair). Thus, our sample space is the set {H, T } and the probability of obtaining a head should be (# elements in {H })/(# elements in {H, T }), or 1/2. Another way of saying this is that the chance of a head in tossing a fair coin is 1 out of 2, which in mathematics simply means “1 divided by 2.” Thus: P({H }) = 0.5.

Similarly, for a die there are six possible outcomes, all equally likely. Thus, our sample space consists of the set S = {1, 2, 3, 4, 5, 6}, and the event of obtaining a number 5 or more is composed of the event of getting a 5 or a 6. Thus, the corresponding probability should be 2 out of 6, 2/6, or 1/3. In other words: P({5 or 6}) = 2/6 = 1/3 = 0.3333.

Next, if we throw two dice simultaneously, each could show a number from 1 to 6. If we record their sum, the sample space is S = {2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}. To compute the probabilities of these numbers occurring, we create a table where each entry inside the table denotes the sum of the die in that column and that row (see Table 4.1). Using that table to establish probabilities is again a simple exercise in counting: There are a total of 36 possible ways to throw two dice; we are interested in their sum being 4; from the table we see that there are three possible throws adding up to 4 (a 3 + 1, 2 + 2, and 1 + 3) so that our probability is 3 out of 36, or 3/36, which reduces to 1/12. Thus: P({sum of two dice = 4}) = 3/36 = 1/12 = 0.0833.

For the final example we need to compute areas. We assume that every dart thrown will land inside the large square. Then the chance of hitting the little circle at random is the ratio of the area of that circle over the area of the square (see Figure 4.1). Recall that the area of a square with side x is x 2 and the area of a circle with radius r is πr2. Thus, the probability of hitting a bull’s-eye at random is

Table 4.1 Sum of two dice

	1	2	3	4	5	6
1	2	3	4	5	6	7
2	3	4	5	6	7	8
3	4	5	6	7	8	9
4	5	6	7	8	9	10
5	6	7	8	9	10	11
6	7	8	9	10	11	12

Figure 4.1 Darts thrown at a dart board

Note that the probability remains the same regardless of where the little circle is located inside the square (as long as it is completely inside the square). Also, by practicing, a dart player can significantly increase the probability of hitting a bull’s-eye; the preceding number refers to darts thrown randomly at the board.

In more real-life experiments it may be too time-consuming or simply impossible to list all possible outcomes or to count the ones we are interested in, but we can instead use experimentation or relative frequencies to come up with approximate probabilities.

Example: Suppose we have a weighted coin. Find the probability of obtaining a head (H) in one toss of the coin.

Since the coin is weighted, the chance of getting one head is no longer 50–50. So we toss the coin 100 times and find 71 heads (and consequently 29 tails). Thus, we proclaim that

P({H }) = 0.71 and consequently P({T }) = 1 − 0.71 = 0.29.

Example: Suppose that a (hypothetical) frequency distribution for the age of people in a survey is as shown in Table 4.2. What is the missing probability? What is the chance that a randomly selected person is 40 years or younger?

Table 4.2 Relative frequencies for age

Category	Probability
0–18	0.15
19–40	0.25
41–65
66 and older	0.3

Here we simply used decimal numbers instead of percentages, that is, the entry in the first row means that 15 percent of the people in the survey were between 0 and 18 years old. One number is missing in the table but since probabilities have to add up to 1.0, the missing number is 1.0 − (0.15 + 0.25 + 0.3) = 0.3.

The event of being 40 years or younger means that a person is either in the 0 to 18 category, with probability 0.15, or in the 19 to 40 category, with probability 0.25. Therefore, the total probability of a person being 40 years or younger is 0.15 + 0.25 = 0.40, or equivalently 40 percent.

It is often helpful to consider probabilities in relation to frequency histograms graphically.

Example: The following data set consists of a number of variables related to the health records of 40 female patients, randomly selected. Construct a frequency histogram for the height of the 40 patients, including a chart. Then use that histogram to find:

The probability, approximately, that a woman is 60 in. or shorter
The probability, approximately, that a woman is 65 in. or taller
The probability, approximately, that a woman is between 60 and 65 in. tall

For each question, shade the part of the histogram chart that you used to answer the question.

Description: http://www.mathcs.org/statistics/course/00-icons/excel.gif www.betterbusinessdecisions.org/data/health_female.xls

Figure 4.2 Relative frequency distribution of height

We construct a frequency histogram using the appropriate Analysis ToolPak procedure as described in Chapter 2. We have manually specified the bin boundaries and modified the histogram table slightly to clarify the bin boundaries. We also computed the relative frequency for each row, defined as the number in that row divided by the total number of observations. The results are shown in Figure 4.2.

Using this chart it is now easy to answer the questions. Note that our bin boundaries do not exactly correspond to the boundaries posed in the questions, but we can use the closest bin boundary available to get the approximately right answer.

P(women 60 in. or less) = (1 + 1 + 3)/40 = (0.025 + 0.025 + 0.075) = 0.125
P(a women 65 in. or more) = (3 + 7)/40 = (0.075 + 0.175) = 0.25
P(women between 60 and 65 in.) = (6 + 8 + 11)/40 = (0.15 + 0.2 + 0.275) = 0.625

To illustrate these probabilities, we have shaded the respective portions in Figure 4.3.

To be sure, our probabilities are approximate only because the bin boundaries do not exactly match the questions. In addition, we have not really computed, for example, that the probability of a general woman to be between 60 and 65 in. tall is 62.5 percent. Instead, we computed that the probability of a randomly selected woman from our sample of 40 women is between 60 and 65 in. tall is 62.5 percent. But if in turn the entire sample was truly randomly selected, then it is a fair guess to propose that the probability of any woman to be between 60 and 65 in. tall is 62.5 percent, or, phrased differently, that 62.5 percent of all women are between 60 and 65 in. tall. Of course we have generalized from the women in our sample to the set of all women, which might seem reasonable but the big question is whether such an inference really works and how well it works. We will tackle that in the next chapter; first we need more background information.

Figure 4.3 Relevant categories of the probability distribution

The Normal Distribution

If you compute a lot of frequency histograms and their associated charts you will notice that most of them differ in detail but have somewhat similar shapes: the chart is usually “small” on the left and right sides with a “bump” in the middle. With a little bit of imagination you might say that such distributions look somewhat similar to a “church bell” (see Figure 4.4).

Figure 4.5 shows several histogram charts with the imagined “church bell” shape super imposed (all of the data comes from the health_female.xls and health_male.xls data files).

These bell-shaped distributions differ from each other by the location of their hump and the width of the bell’s opening, and they have a special name.

Definition: A distribution that looks bell-shaped is called a normal distribution. The position of the hump is denoted by m and stands for the mean of the distribution, and its width is denoted by s and corresponds to the standard deviation. Thus, a particular normal distribution with mean m and standard deviation s is denoted by N(m, s).

The special normal distribution N(0, 1), that is, bell-shaped with mean 0 and standard deviation 1, is called the standard normal distribution.

Figure 4.4 The bell curve

Figure 4.5 Sample frequency histograms with superimposed normal curve

Figure 4.6 shows three normal distributions. Remember that they simply represent relative frequency charts, with the height of each bar corresponding to the probability of a randomly selected number falling in that bin.

Side note: The bell-shaped normal distribution is frequently called the Gaussian normal distribution, named after the famous German mathematician Carl Friedrich Gauss (1777–1855). It can be modeled mathematically by the exponential function , where m stands for the mean and s for the standard deviation of the distribution. Figure 4.7 shows four normal distributions with different parameters. For each, the mean n shows where the top of the hill is, which is also the axis of symmetry, and indicates the most likely occurrence. The standard deviation s specifies the width of the hill.

Computing Normal Probabilities with Excel

Instead of creating a frequency histogram with (more or less) arbitrary bin boundaries, we can compute the mean and the standard deviation of the data and use the normal distribution with that particular mean and standard deviation to compute the probabilities we are interested in.

Example: Consider the Excel data set health_female.xls, showing a number of variables related to the health records of 40 female patients, randomly selected.

Description: http://www.mathcs.org/statistics/course/00-icons/excel.gif www.betterbusinessdecisions.org/data/health_female.xls

Compute the mean and standard distribution for the height variable of that data set and then use the corresponding normal distribution to visualize:

The probability, approximately, that a woman is 60 in. or shorter
The probability, approximately, that a woman is 65 in. or taller
The probability, approximately, that a woman is between 60 and 65 in. tall

Figure 4.6 Normal distributions with different means and standard deviations

Figure 4.7 Gaussian normal distribution for different parameters

As explained in Chapter 3, we can use Excel to quickly compute the mean and standard deviation to be as follows: mean m = 63.2 and standard deviation s = 2.74. The normal distribution with these parameters is (see Figure 4.8).

We can now use that graph to visualize the various probabilities by shading the appropriate area under that curve (see Figure 4.9).

If you happen to have had calculus prior to this course, you might remember that the area under a curve is computed via integration (do not worry, if you have not had calculus or you do not care about it, just skip to the next paragraph). Therefore, the probabilities are:

Figure 4.8 Graph of the normal distribution N(63.2, 2.74)

Figure 4.9 Shading the relevant portion of the normal distribution N(63.2, 2.74)

To evaluate these integrals is actually pretty difficult, even if you remember your calculus well; we have used an advanced computer program called Mathematica to get the answers. The good news is that Excel can easily compute these areas under a normal distribution as well, but there is a catch.

Definition: To compute probabilities under a normal distribution Excel provides the formula NORMDIST(X, m, s, true), where m and s are the mean and standard deviation, respectively, and the last parameter should always be set to “true.” The value of that formula always represents the probability (aka area under the curve) on the left side under the normal distribution up to the value of X: NORMDIST(X, m, s, true) = P(x ≤ X ), where x is N(m, s).

If you are using this formula in Excel, do not forget to start it with an equal sign, as you would do for any Excel formula. For example:

Note that the last value happens to be exactly the area we need to answer the first of our three questions. Therefore: P (x ≤ 60) = NORMDIST(60, 63.2, 2.74, true) = 0.1214. The original method, using the actual frequency histogram, yields 0.125. Both computed values are close to each other, but using the normal distribution and Excel is way faster and allows for arbitrary boundary points to be used.

Excel formula	Mathematical notation	Computed area	Value
= NORMDIST(0, 0, 1, true)	P (x ≤ 0) x standard normal N(0, 1)		0.5
= NORMDIST(4, 2, 3, true)	P (x ≤ 4) x normal N(2, 3)		0.7475
Excel formula	Mathematical notation	Computed area	Value
= NORMDIST(60, 63.2, 2.74, true)	P (x ≤ 60) x normal N(63.2, 2.74)		0.1214

Other probabilities can be computed in a similar way, using the additional fact that the probability of everything must be 1. For example, suppose we want to use an N(63, 2) normal distribution to compute the probability P(height ≥ 65). We cannot simply use the Excel formula NORMDIST(65, 63, 2, true) because that formula computes, as always, P(x ≤ 65), not what we want (in fact, it is kind of the opposite). However, we know that the probability of everything is 1 so that:

P(height ≤ 65) + P(height ≥ 65) = 1.

To compute a probability like P(60 ≤ height ≤ 65), we can apply a similar trick, shown in Figure 4.10.

Now, in fact, we can use Excel to rapidly compute probabilities without ever constructing a frequency histogram at all. In fact, we do not even need to have access to the complete data set. All we need is to know the mean and the standard deviation of the data so that we can pick the right normal distribution.

Example: Consider the Excel data set health_male.xls, showing a number of variables related to the health records of 40 male patients, randomly selected. Without constructing a frequency histogram for the height of the 40 patients, find the following probabilities.

What is the probability, approximately, that a man is 60 in. or shorter?
What is the probability, approximately, that a man is 65 in. or taller?
What is the probability, approximately, that a man is between 60 and 65 in. tall?

Instead of constructing a complete frequency histogram, we quickly use Excel to compute the mean and the standard deviation of our data. Then we use the NORMDIST function, just as earlier, but of course using the mean and standard deviation for this data set. Here we go:

Figure 4.10 How to compute P(60 ≤ height ≤ 65)

mean height:	68.3
st. dev.	3.02
P(height <=60)=	0.002995	=NORMDIST(60,68,3,3.02, TRUE)
P(height >=60)=	0.862741	=1-NORMDIST(65,68.3,3.02, TRUE)
P(60 <= height <=65)=	0.134265	=NORMDIST(65,68.3,3.02,TRUE) - NORMDIST (60,68.3,3.02, TRUE)

Note that the probability of a man being less than 60 in. tall is now about 0.003, or 0.3 percent, much lower than the probability for a woman. That makes sense, since men are, on average, taller than woman (68.3 in. versus 63.2 in.), so the probability of a man being less than 60 in. tall should indeed be lower than the comparable probability for women. The other figures equally make sense. Note also that all three probabilities add up to 1 (approximately). Again, that makes sense–explain.

Important: The computed probabilities will be (approximately) correct under the assumption that the height of men is indeed normally distributed.

Now it should be clear how to use various normal distributions together with Excel to quickly compute probabilities. To practice, here are a few exercises for you to do. The answers are listed, but not how to get them. Remember, sometimes you need to use 1 - NORMDIST or subtract two NORMDIST values from each other–draw a picture of the normal curve, shade the desired area, and determine how that area relates to the Excel function NORMDIST.

Example: Find the indicated probabilities, assuming that the variable x has a distribution with the given mean and standard deviation.

x has mean 2.0 and standard deviation 1.0. Find P(x <= 3.0) [ = 0.8413].
x has mean 1.0 and standard deviation 2.0. Find P(x >= 1.5) [ = 0.4013].
x has mean −10 and standard deviation 5.0. Find P(−12 <= x <= −7) [ = 0.3812].
x is a standard normal variable. Find P(x <= −0.5) [ = 0.3085].
x is a standard normal variable. Find P(x >= −0.5) [ = 0.6915].
x is a standard normal variable. Find P(x >= 0.6) [ = 0.2742].
x is a standard normal variable. Find P(−0.3 <= x <= 0.4) [ = 0.2733].

The Inverse Normal Problem

While we now can easily compute probabilities P(x < a) for given values of a, we sometimes want to do just the opposite: given a probability p, find a cut-off value a such that P(x < a) = p. Excel has just the function for us, as usual.

Definition: If x is N(m, s), that is, normal with mean m and standard deviation s, then the Excel function NORMINV(p, m, s) gives the value of a such that P(x < a) = p. Thus, the Excel functions NORMDIST and NORMINV are inverses of each other.

Of course, once we can find values of a such that P(x < a) = p, we can also find cut-off values if the prescribed probability has a different form. Perhaps an example will clarify this:

Example: If x is standard normal, find a such that P(x < a) = 0.4. What if x was N(4, 1.5)? Can you use the NORMINV function to find b such that P(x > b) = 0.05 if x is N(5, 1)?

For the first question, we know right away that a must be less than 0 because the distribution is normal with mean 0 and standard deviation 1. Thus, if a probability of the form P(x < a) wants to be less than 50 percent, a must be negative. In fact, a = NORMINV(0.4, 0, 1) = −0.2533. Indeed, we can check that NORMDIST(−0.2533, 0, 1, true) = 0.4, which is of course the inverse of the problem.

If the variable x was N(4, 1.5) instead of the standard normal and we again wanted to find a such that P(x < a) = 0.4, then it is easy to see that a has to be less than the mean of 4. In fact, a = NORMINV(0.4, 4, 1.5) = 3.6200. We could verify this again using NORMDIST but we will leave that to you.

Finally, with a little imagination and the picture of the normal distribution in our mind we can figure out that to find b such that P(x > b) = 0.05 is equivalent to P(x < b) = 0.95 so that b = NORMINV(0.95, 5, 1) = 6.6448. Indeed, to double-check: P(x > 6.6448) = 1 − P(x < 6.6448) = 1 − NORMDIST(6.6448, 5, 1, true) = 0.05.

Normal Distribution and Its Standard Deviation

While the mean of a normal distribution is easy to see (it is the line of symmetry through the top of the mountain), it seems harder to visualize the standard deviation. It is relatively simple to decide which of the two normal distributions has the smaller variance, but as it turns out even if you have only one normal distribution you can “see” the standard deviation.

Definition: If x is normally distributed with mean m and standard deviation s, then the three-sigma rule of thumb states (compare Figure 4.11):

The interval (m − s, m + s) contains ≈68 percent of the data.
The interval (m − 2s, m + 2s) contains ≈95 percent of the data.
The interval (m − 3s, m + 3s) contains ≈99 percent of the data.

This rule can be used to verify whether a given distribution of data is normal or not: check how much of the data is within one, two, and three standard deviations of the mean and compare it with the three-sigma rule of thumb: if there is an approximate match, the distribution is likely normal.

Figure 4.11 Normal distribution and standard deviation

Example: Bags of chips have an average weight of 425 g, with a standard deviation of 2.5 g. Assuming the weight is normal, how many bags in a box of 500 bags weigh between 420 and 430g?

We want to find:

which works out to 0.9545. This matches with our observation that the interval (m − 2s, m + 2s) contains approximately 95 percent. Thus, we expect 0.95 ⋅ 500 = 475 bags will have the desired weight.

Incidentally, the preceding rule explains why we can approximate the standard deviation s as range/4, as we saw in Chapter 3: the interval m − 2s to m + 2s contains approximately 95 percent of the data, or in other words, the strip from m − 2s to m + 2s has a width of 4s and contains 95 percent of the data, approximately. Thus, 4s ≈ range or s ≈ range/4.

Converting to z-Scores

It turns out that you can easily convert one normal distribution into another. This is particularly handy when converting an arbitrary normal distribution to the standard normal N(0,1).

Transformation Formula for Normal Distributions: If x is normal with mean m and standard deviation s, then has the standard normal distribution, that is, the mean of z is 0 and its standard deviation is 1. The number is frequently called the z-score of x.

This formula allows us to compute probabilities of normally distributed variables in (at least) two ways.

Example: Suppose x is normally distributed as N(5, 2), that is, normal with mean 5 and standard deviation 2. Then compute P(2 < x < 6) using (a) the original parameters and (b) using z-scores.

For part (a) we compute as usual:

By the transformation formula the variable is N(0, 1). But if x = 2 then the z-score is and the z-score of x = 6 is . Thus:

Thus, as long as you know how to compute probabilities of the standard normal distribution you can actually compute probabilities of any normal distribution. Therefore, Excel includes a special function to compute probabilities of the standard normal.

Definition: The Excel function =NORMSDIST(Z) computes the probability P(z < Z) if z has a standard normal distribution. In other words, NORMSDIST(Z) = NORMDIST(Z, 0, 1, true).

The NORMSDIST has the advantage that it is somewhat simpler to use but offers no other benefits. Thus, we will stick with NORMDIST(Z, 0, 1, true) as a reminder that the standard normal distribution has mean 0 and standard deviation 1.

Discrete and Continuous Random Variables

Previously we have used the term “variable” without properly defining it; we relied on common sense. Since we are currently adding a solid foundation to our discussion anyway, we might as well do the same for our most basic terminology.

Definition: A random variable is a variable whose values are numerical outcomes of an experiment. A discrete random variable can take only distinct values; a continuous one can take any value within a range.

Let us say we are tossing a single coin once. A random variable needs to assign numbers to the events in the sample space. Thus, we define a random variable x by saying, for example, that x({H }) = 0 and x({T }) = 1. If we toss a coin twice, a random variable could count the number of H’s, so that x({T,T }) = 0, x({T,H }) = x({H,T }) = 1, and x({H,H }) = 2. These two variables are discrete. As an example for a continuous variable, consider an experiment that measures the height of people. A random variable x could simply be the height of a person in inches. For discrete random variables it is convenient to define them via a table of values including their probabilities, while continuous random variables are often represented as the graph of a function called the probability density function.

Mean and Standard Deviation for Discrete Random Variables

Of course we can compute the mean or standard deviation of discrete random variables. It is similar to computing those parameters for frequency tables but we need to take into account that the distinct values of our random variable can occur with different probabilities.

Definition: The mean m (or expected value E(x)) of a discrete random variable x with values x1, x2, ..., xn is:

The variance s2 of a discrete random variable x is:

The standard deviation is, as usual, the square root of the variance:

This looks pretty intimidating but once you work through an example everything should clear up and you should have no problems. Note that interpreting the mean of a random variable as the expected value is particularly interesting.

Example: Suppose you want to open a new pizzeria. You do some research and you find that 30 percent of comparable pizzerias operate at a loss of $35,000, 40 percent break even, 20 percent make a profit of $25,000, and 10 percent make a profit of $95,000. How much money can you expect to make if you go through with your plans? What is the standard deviation?

First, we will convert the information into our new lingo: we define the random variable x to measure how much profit a pizzeria makes. Thus, x has four distinct values with the probabilities as shown in column 2 of Table 4.3.

Table 4.3 Finding the expected profit opening a pizzeria

x (Profit) ($)	P(x = xi)	xi P(x = x1)	x2i P(x = x1)
−35,000	0.3	−10,500	367,500,000
0	0.4	0	0
25,000	0.2	5,000	125,000,000
95,000	0.1	9,500	902,500,000
		4,000	1,395,000,000

To find the mean, or the expected value, we multiply xi P(x = x1) and add column 3 to the table. Since we also need to find the standard deviation, we add one more column containing xi2 P(x = x1). Then we find the total of column 3, which will be the expected value of x. Thus, the expected value of x is $4,000, which means that statistically speaking you can expect a profit of $4,000 if you open the pizzeria.

To compute the variance (and hence the standard deviation), we add up the fourth column and use the preceding formula to compute the variance: s2 = 1,395,000,000 − 16,000,000 = 1,379,000,000 so that the standard deviation becomes $37,134.89.

Mean and Standard Deviation for Continuous Random Variables

Defining mean and standard deviation for continuous random variables requires integration of functions—that is, areas under curves—and is generally beyond the scope of this text. Still, for completeness, we list the definitions here as well.

Definition: The mean m, or expected value E(x), of a continuous random variable x with density function p(x) is:

The variance s2 of the continuous random variable x is:

The standard deviation is, as usual, the square root of the variance: .

Even though we do not know how to integrate, here is a relatively simple example.

Example: Suppose we constructed a dial with a spinner, similar to a wheel of fortune, and spin it randomly (see Figure 4.12). Define the random variable x to be the angle at which the spinner comes to a rest. Compute the mean and variance of x.

The random variable x can take any value between 0 and 360: it is therefore a continuous variable. Since you are spinning randomly, every angle is equally likely, so x is called a uniformly distributed random variable. The probability density function for x must be constant, since every value between 0 and 360 is equally likely: p(x) = c for 0 ≤ x ≤ 360 (see Figure 4.13).

Figure 4.12 A continuous “wheel of fortune”

Figure 4.13 Uniform distribution

We know that the total probability has to be 1, as always, so that the area of the rectangle with width 360 and height c must be 1. Therefore, 360 c = 1, so that

. Now we can find the mean and variance:

Other Distributions

There are many different distributions. In fact, any function p(x) that is non-negative for all x and with the total area under the curve being 1 can generate a probability distribution. We already introduced the normal distributions and worked extensively with them, and in the previous section we introduced a uniform distribution for 0 ≤ x ≤ 360. Now we will introduce two additional ones.

Definition: Two other frequently used continuous distributions are the Student t-distribution and the F distribution. Both have complicated density functions but in terms of Excel they are defined via:

Student t-distribution: TDIST(x, df, tails), where df stands for degree of freedom and tails is 1 (to compute one tail) or 2 (to compute two tails). TDIST(x, df, 1) = P(X > x) and TDIST(x, df, 2) = P(X > x) + P(X < −x) for positive x.
F distribution: FDIST(x, df 1, df 2), where df 1 and df 2 stand for degrees of freedom 1 and 2, respectively. FDIST(x, df 1, df 2) = P(X > x).

You can see their graphs in Figures 4.14 and 4.15.

The t-distribution looks similar to the standard normal distribution but its peak is not quite as high whereas its tails are wider. If the degree of freedom is high, the t-distribution is nearly identical to the standard normal distribution. The F distribution, on the other hand, looks completely different (see Figure 4.14). In particular, it has no axis of symmetry. We will need these distributions in later chapters, at which point we will also explain the significance of the degree of freedoms. Right now we just want to familiarize ourselves with new ways to compute probabilities.

Note that the Excel definitions of both TDIST and FDIST give the probabilities at the tail end of the distribution whereas NORMDIST gives the probability to the left of x. See Figure 4.15.

Example: Suppose x is distributed according to a t-distribution with six degrees of freedom. Use Excel to find and . Also verify that for large degrees of freedom the t-distribution and the standard normal distribution are approximately the same. Finally, compare the one-tailed probabilities p(x≥15) if x is distributed according to the F distribution with df1 = 10 and df 2 = 2 with the standard normal one.

Figure 4.14 t-distribution versus standard normal (left) and two F distributions (right)

Figure 4.15 One- and two-tailed t-distribution versus standard normal and F distribution

Assuming x is distributed according to a t-distribution with df = 6, we have . On the other hand, P(|x| ≤ 1) = 1 − TDIST(1,6,2) = 0.8220.

To verify that a t-distribution with high degrees of freedom is about equal to the standard normal distribution, we compare TDIST(1,1000,1) = 0.15878 with and repeat those calculations for different values of x. You will find that the probabilities agree very well indeed.

If x is distributed according to the F distribution with df 1 = 10 and df 2 = 2 then . On the other hand, if x is N(0, 1) then .

The Inverse Probability Problem

We have seen that before that we can also solve the inverse probability problem: instead of finding the probability P(x < a) for a given value of a, we compute that value of a that results in a given probability p = P(x < a). If x is normal, then we can use the Excel function NORMINV. Similarly, Excel offers the functions TINV and FINV that are similar to NORMINV, but with slight differences in the interpretation of their inputs.

If x is distributed according to a t-distribution with degrees of freedom df, then the Excel function TINV(p, df) returns that value a such that P(x < −a) + P(x > a) = P(|x| > a) = p.
If x is distributed according to an F distribution with degrees of freedom df 1 and df 2, then the Excel function FINV(p, df 1, df 2) returns that value a such that P(x > a) = p.

The functions TINV and TDIST are inverse of each other, as are FINV and FDIST.

The Central Limit Theorem

In the “Introduction” section we saw that we can use frequency distributions to compute probabilities of various events. Then we determined that we could use various normal distributions as a shortcut to compute those probabilities, which was very convenient. Using that technique we were able to compute all kinds of probabilities just based on the fact that we knew the mean and sample standard deviation of the distribution. We had to assume, however, that the (unknown) distribution of the variable in question was normal with the computed mean and standard deviation as parameters.

As it turns out, there is some mathematical justification for that; it says, in effect, that most distributions—in some sense—are “normal.” That theorem, called the Central Limit Theorem, is one of the corner stones of statistics. It has many practical and theoretical implications, some of which we will explore in subsequent chapters.

In this course we will simply state the theorem without any proof. In more advanced courses we would provide a justification or mathematical proof, but for our current purposes it will be enough to understand the theorem and to apply it in subsequent chapters.

If we want to talk colloquially, we have actually already seen the Central Limit Theorem. We noted previously that “most histograms are (more or less) bell-shaped,” which is in fact one way to state the Central Limit Theorem. To state this theorem precisely, we need to specify, among other things, exactly which normal distribution we are talking about.

Central Limit Theorem for Means: Suppose x is a variable for a population whose distribution has a mean m and standard deviation s but whose shape is unknown. Suppose further we repeatedly select random samples of size N from that population and compute the sample mean each time. Finally, we plot the distribution (histogram) of all these sample means. Then the distribution of all sample means is approximately normal (bell-shaped) with mean m (the original mean) and standard deviation .

This theorem is perhaps somewhat hard to understand, so here is a more colloquial restatement.

Central Limit Theorem, colloquial version: No matter what shape the distribution of a population has, the distribution of means computed for samples of size N is approximately bell-shaped (normal). The approximation gets better as N gets larger. Moreover, if we know the mean and standard deviation of the original distribution, the mean for the sample means will be the same as the original one, while the new standard deviation will be the original one divided by the square root of N.

The importance of this theorem is that it allows us to start with an arbitrary and possibly unknown distribution, yet use the normal distribution with appropriate mean and standard deviation to perform various computations, at least approximately.

Example: Roll a single die once and record the number on the upper face. What is the distribution for this experiment? Now roll two dice and record the average of the numbers on the up faces. What is the distribution for this experiment? Finally, roll three dice, record the average, and determine the distribution. Relate your results to the Central Limit Theorem.

If we roll a single die, the numbers 1 to 6 are all equally likely to come up. Thus, the probability for each outcome is 1/6 so that the distribution looks like Figure 4.16.

Figure 4.16 Uniform distribution for tossing one die

Table 4.4 Average for tossing two dice

	1	2	3	4	5	6
1	2/2	3/2	4/2	5/2	6/2	7/2
2	3/2	4/2	5/2	6/2	7/2	8/2
3	4/2	5/2	6/2	7/2	8/2	9/2
4	5/2	6/2	7/2	8/2	9/2	10/2
5	6/2	7/2	8/2	9/2	10/2	11/2
6	7/2	8/2	9/2	10/2	11/2	12/2

Note that the mean m = 3.5 and the standard deviation s = 1.7078.

If we throw two dice and record their average, we get the outcomes 2/2, 3/2, 4/2, 5/2, 6/2, 7/2, 8/2, 9/2, 10/2, 11/2, and 12/2. We can list these outcomes in Table 4.4, similar to what we did before.

As before, we can determine the probabilities by counting to create the distribution in Figure 4.17.

Now we throw three dice and record the average. There are 6 × 6 × 6 = 216 total possibilities, with probabilities like P(average = 3/3) = P({1,1,1}) = 1/216, P(average = 4/3) = P({1,1,2}, {1,2,1}, {2,1,1}) = 3/216, and so on. We could show the outcomes in a three-dimensional table, but instead we simply show the resulting distribution in Figure 4.18. While we are at it, we also show the result of recording the average of four dice.

We can see that as the sample size N increases, the distribution looks more and more bell-shaped (normal), exactly as the Central Limit Theorem predicts.

The Central Limit Theorem Applet

If you want to see the Central Limit Theorem in action, check out the Central Limit Applet (see Figure 4.19; it requires the latest version of the Java plug-in, which you can download for free).

Try the following:

Click on the preceding link for the Central Limit Theorem applet.
Click on the “Start CLT Applet” button (the applet might take a few seconds to initialize).

Figure 4.17 Distribution for tossing two dice

Figure 4.18 Distribution for tossing three and four dice

Figure 4.19 Central Limit Theorem Applet

Source: http://www.mathcs.org/java/programs/CLT/clt.html

○When you click “Start,” the program will pick a random sample from a population, compute the mean, and mark where that mean is on the x-axis to start a frequency distribution for the sample mean.

○Then the program picks another random sample, computes its mean, marks it in blue, and continues in that fashion—check the “Slow Motion” checkbox to see what the program does in slow motion.

After the program is running for a while, notice that the blue bars are slowly building up to a real frequency distribution (the yellow bars underneath show the distribution of the underlying population from which the random samples are selected).

Now try the following:

Let the program run (at regular speed) for a while. What shape is the distribution of the random samples (blue bars), at least approximately?
Experiment with different distributions (click on [Pick] to choose another distribution). What shape does the distribution of the sample means (blue chart) have when you pick other distributions for the population? Is that true regardless of the underlying population distribution (yellow chart)?
What is the mean for the distribution of the sample means (blue chart) in relation to the mean of the distribution of the original distribution (yellow chart)? The figures for the sample means are shown in the category “Sample Stats,” but make sure to run the program for a while before looking at the numbers. Note that these numbers represent the “sample mean” for the distribution of all sample means, and the “sample standard deviation” for the distribution of all sample means (yes, it sounds odd, but that is what it is).
Is there a relation between the standard deviation of the sample means (blue chart) and that of the original population (yellow chart)? Experiment with sample sizes 16, 25, 36, 49, and 64 to find the relation, but make sure to press the Reset button before using new parameters or sample sizes, and let the program run for a while before estimating the sample stats.

If you have done everything correctly, you have just discovered the Central Limit Theorem! Relax: if you have any trouble with that applet, or if you are not exactly sure what it shows and how it works, do not worry. In this class we are interested in the consequences of the Central Limit Theorem, coming up in the next chapter, and not in that theorem in and of itself.

Proportions and the Binomial Distribution

Of the continuous distributions the normal ones are the most important, but they require a numerical variable. Is there anything we can do for categorical variables? It turns out that if our data is such that it falls into exactly two categories, it can be modeled by a binomial distribution.

Definition: Suppose a random experiment has exactly two outcomes. We (arbitrarily) call one of them success (S) and the other one failure (F). Suppose further that the probability of success is p (and hence the probability of failure is q = 1 − p). Assume finally that this experiment is repeated independently N times and the random variable x counts the number of successes. Then x is a binomial random variable and has the binomial distribution B(p, N).

Note: A random experiment with exactly two outcomes where the probability of success does not change is sometimes called a Bernoulli Trial, named after the well-known Swiss mathematician Jacob Bernoulli (1654–1705). As so often, the preceding definition sounds really complicated but once you see examples it will become much clearer.

Example: Find the parameters for the following binomial distributions.

Flip a coin 33 times and count the number of heads.
It turns out that individuals with a certain gene have a 0.70 probability of contracting a certain disease. We conduct a study of 100 individuals with that gene to count the number of individuals who will contract the disease.
Consider a population of 25,000 voters in a given state. The proportion of voters who favor candidate A is equal to 0.40.

We can describe each of these situations using our terminology for a binomial distribution. In the first case of flipping a single coin, we (arbitrarily) consider heads to be a success (this would be our Bernoulli trial). Then the probability of success is . Since we repeat this experiment 33 times, the random variable x counting the number of successes is B(1/2, 33).

In the second case we consider it a success to contract the disease (which may sound odd). Then p = 0.7 and since we repeat this 100 times, we have a B(0.7, 100) distribution. We could just as well consider it a success not to contract the disease (avoiding a disease does sound more successful). In that case p = 1 − 0.7 = 0.3 and this variable, call it y, counting the number of successes, is B(0.3, 100).

For the last case we consider a vote for candidate A a success so that p = 0.4. Since we have 25,000 voters, our distribution is B(0.4, 25,000).

It turns out that there is a (relatively) simple formula for a binomial distribution but it requires the formula , where and 0! = 1. We pronounce n! as “n factorial” and as “n choose k” or “choose k out of n.” For example and Note that “n choose k” always comes out an integer even though at first glance that seems unlikely.

Definition: Suppose a random variable x is B(p, N). Then x is a discrete binomial random variable and its distribution is

The mean of x is m = np and the variance is

This is pretty nifty. It gives us the probability of getting k successes in N total tries, each of which has probability of success p.

Example: Create the probability distribution for counting heads in flipping a coin six times.

Some of the probabilities are easy to determine. For example, the probability of obtaining six heads should clearly be (0.5)6. Also, the probability of obtaining no heads is equal to the probability of getting six tails, which again is (0.5)6. For the probabilities in between we need to apply the preceding formula; see Table 4.5 and Figure 4.20 for the results.

Table 4.5 Binomial distribution B(1/2, 6)

k	P(X = k)
0
1
2
3
4
5
6

Figure 4.20 Binomial distribution B(1/2, 6)

Using Excel to Compute the Binomial Distribution

Excel, of course, includes functions to easily compute P(x = k) if x is B(p, N).

FACT(n)	computes n!
COMBIN(n, k)	computes “n choose k,” that is, the number of ways to select k objects from n objects
BINOMDIST(x, N, p, false)	computes probability of obtaining k successes in N trials if the probability of success is p

Example: Use Excel to create the distribution chart for B(0.2, 40), that is, the distribution of selecting k successes out of 40 trials with probability of success p = 0.2. Describe the distribution. Find Q1 and Q3 as well as the mean and the median.

Using Excel’s BINOM.DIST (or BINOMDIST in Excel 2007) function it is easy to create the probability distribution for B(0.2, 40); see Figure 4.21. We can then create the cumulative probability chart to determine Q1 = 7, Q3 = 11, and the median = 9, as explained in Chapter 3. The mean of B(0.2, 40) is 8. The distribution in Figure 4.21 looks approximately normal but shifted slightly to the right.

Figure 4.21 Distribution for B(0.2, 40)

Let us conclude this chapter with an interesting application that might give you pause for thought.

Example: The National Aeronautics and Space Administration (NASA) flew a total of 135 space shuttle missions from 1982 to 2011. In 1986 the shuttle Challenger, the 25th shuttle launch, broke apart 73 sec into its flight, leading to the deaths of all seven crew members. A subsequent investigation, led by famous physicist Richard Feynman, found out that a simple O-ring failure caused by cold weather resulted in this disaster. Assuming that space shuttle launches are independent and that the probability of each successful launch is approximately 99.2 percent, find the probability of 25 successful shuttle launches in a row. Determine how many successful launches it takes before the probability of N successful launches in a row drops to 50 percent. Should you then launch the N + 1 shuttle? What is the probability of 135 successful launches in a row?

CH004_F021inline_BDC.tif

Let x be a variable that counts the number of successful shuttle launches. Because of our assumptions, x has a binomial distribution with B(0.992, N). Thus, the probability of 25 successful launches (and no failures) is or about 82 percent. The probability of 135 successes in a row is similarly or only 33 percent.

To find N such that the probability of N successful launches drops below 50 percent we need to solve . We can take the natural logarithm on both sides to find that . Thus, the probability of 87 successful launches has dropped to less than 50 percent. However, the probability of the 88th launch is again 99.2 percent, regardless of what happened before. Such is the nature of the binomial distribution: no matter how close the probability of success p is to 1, the probability of N successes in a row eventually drops to zero! Still, each trial has again a probability of success p, regardless of how many successes in a row already happened. To put this in words: in a binomial distribution it is certain that disaster will strike eventually, but you cannot predict when.

Excel Demonstration

Recall that to solve a probability problem using a binomial distribution, you would need the number of successes you are looking for, the number of trials, and the probability of success. In the following text is a quick tip sheet on the scripts to use in Excel for specific situations. Note that TRUE/FALSE represents whether or not we are looking for the cumulative probability. In other words, if we are looking for the probability of 5, and if we were to say TRUE in our script, that means we would get the probability of everything up to and including 5 (the cumulative amount). If we say FALSE (meaning we are saying NO to the cumulative) that means we only want 5, not the cumulative value.

Rules for Finding Binomial Probabilities in Excel

If the question asks you to find the probability of exactly one number, use BINOMDIST(successes,trials,probability,FALSE).
If the question asks you to find the probability of up to and including a number (less than or equal to a number), use BINOMDIST(successes,trials,probability,TRUE).
If the question asks you to find the probability of less than a number, use BINOMDIST(successes,trials,probability,TRUE) − BINOMDIST(successes,trials,probability,FALSE).
If the question asks you to find the probability of at least one number (greater than or equal to a number), use (1 − BINOMDIST(successes,trials,probability,TRUE)) + BINOMDIST(successes,trials,probability,FALSE).
If the question asks you to find the probability of greater than a number, use 1 − BINOMDIST(successes,trials,probability,TRUE).

Example: Company P, the paper products manufacturer, has a customer service and return department. A customer service representative’s records show that the probability that a newly sold product needing to be returned in the first 90 days is 0.05. If a sample of three new products is selected:

a.What is the probability that none needs to be returned?

b.What is the probability that at least one needs to be returned?

c.What is the probability that more than one needs to be returned?

The key parameters to this problem are: sample size (trials) N = 3, probability of “success” p = 5 percent, and number of “successes” (a) exactly 0, (b) 1 or more, or (c) more than one. To solve in Excel, we would use the following formulas:

a.BINOMDIST(0,3,0.05,FALSE) to get the result of: 0.857375 or around 86 percent

b.1 − BINOMDIST(0,3,0.05,FALSE) to get the result of: 0.142625 or around 14 percent

c.1 − BINOMDIST(1,3,0.05,TRUE) to get the result of: 0.142625 or around 14 percent

Example: Company S, the accounting firm, knows that to resolve client inquires on the same day is highly important for keeping client satisfaction. This means the client relations department works quickly to resolve inquiries on the same day. Past data from the customer relationship management database indicates that the likelihood is 0.70 that client inquiries that come in on a Monday (the busiest day of the week) will be resolved on the same day. For the first five inquiries submitted on a given Monday:

a.What is the probability that all 5 will be resolved on the same day?

b.What is the probability that at least 3 will be resolved on the same day?

c.What is the probability that fewer than 2 will be resolved on the same day?

The key parameters for this problem are: sample size (trials) N = 5, probability of success p = 70 percent, and number of successes (a) exactly 5, (b) 3 or more, or (c) fewer than two. To solve this in Excel, we would use the following formulas:

a.=BINOMDIST(5,5,0.7,FALSE) to get the result of: 0.16807 or around 17 percent

b.=1 − BINOMDIST(3,5,0.7,TRUE) + BINOMDIST(3,5,0.7,FALSE) to get the result of 0.83692 or around 84 percent

c.=BINOMDIST(2,5,0.7,TRUE) − BINOMDIST(2,5,0.7,FALSE) to get the result of 0.03078 or about 3 percent

Solving Probability Problems Using Normal Distribution Techniques

Company P determined that its truck drivers making deliveries are spending a lot of time on the road, and the trucks are becoming worn out quickly. Company P determined that the annual distance traveled per truck is normally distributed, with a mean of 72,000 miles and a standard deviation of 17,000 miles.

a.What proportion of trucks can be expected to travel between 38,000 and 62,000 miles in the year?

b.What percentage of the trucks travel less than 35,000 miles in the year?

c.What percentage of the trucks travel more than 57,000 miles in the year?

d.How many miles will be traveled by more than 72 percent of the trucks?

In Excel 2013, go to Formulas, and click Insert Function (which is the fx entry in the input line). Type in NORMDIST and select GO. Double-click on the NORMDIST option (see Figure 4.22).

To solve question A of finding the probability of traveling between 38,000 and 62,000, find the probability of 62,000 and then subtract from the probability of 3800,000:

To find the probability of 62,000, input 62,000 for X, 72,000 for MEAN, 17,000 for Standard Deviation, and TRUE for Cumulative. The probability for 62,000 is 0.27 (see Figure 4.23).
Now change X to 38,000 and the probability is 0.02.
Subtract 0.27 and 0.02 to get 0.25 or 25 percent as the answer to A.

Figure 4.22 Dialog for NORMDIST function

Figure 4.23 The standard normal probability for x = 62,000

To solve question B, leave the same dialog box up and leave the mean, standard deviation, and cumulative value as they were. Replace X with 35,000 to get 0.01 or 1 percent for B.

To solve question C, find the probability of less than or equal to 57,000 first by replacing X with 57,000 and leaving cumulative as True. This gives you 0.18 or 18 percent as the probability of getting less than or equal to 57,000. To find the probability of getting more than 55,000, subtract 0.18 from 1 to get 0.82 or 82 percent, which is the solution.

To solve question D, we must use the inverse function of NORMDIST since we want to compute a cut-off value for x to get a specified probability. Go to Insert Function and type in NORMINV (see Figure 4.24).

You are asked to find how many miles will be traveled by more than 72 percent of the trucks, so you are looking for the instances that fall above 72 percent, which is the top 27 percent. If it had asked us for less than 72 percent of the trucks then we would have been concerned with those trucks in the 0 to 72 percent range, not the 73 to 100 percent range. Input the following numbers into the dialog box shown in Figure 4.24.

Figure 4.24 Dialog for NORMINV function

As you can see, the answer to D is 61,582 miles.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 4: Probability Theory

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 4: Probability Theory