In this section, we will briefly examine the properties of various probability distributions. Many of these distributions are used for Bayesian analysis; thus, a brief synopsis is needed. We will also illustrate how to generate and display these distributions using matplotlib
. In order to avoid repeating import statements for every code snippet in each section, I will present the following standard set of Python code imports that need to be run before any of the code snippets mentioned in the following command. You only need to run these imports once per session. The imports are as follows:
In [1]: import pandas as pd import numpy as np from matplotlib import pyplot as plt from matplotlib import colors import matplotlib.pyplot as plt import matplotlib %matplotlib inline
One of the steps that we have to take in a Bayesian analysis is to fit our data to a probability distribution. Selecting the correct distribution can be somewhat of an art and often requires statistical knowledge and experience, but we can follow a few guidelines to help us along the way; these are as follows:
A statistical trial is a repeatable experiment with a set of well-defined outcomes that are known as the sample space. A Bernoulli trial is a Yes/No experiment where the random X variable is assigned the value of 1 in the case of a Yes and 0 in the case of a No. The event of tossing a coin and seeing whether it turns up heads is an example of a Bernoulli trial.
There are two classes of probability distributions: discrete and continuous. In the following sections, we will discuss the differences between these two classes of distributions and take a tour of the main distributions.
In this scenario, the variable can take only certain distinct values such as integers. An example of a discrete random variable is the number of heads obtained when we flip a coin 5 times; the possible values are {0,1,2,3,4,5}. We cannot obtain 3.82 heads for example. The range of values the random variable can take is specified by what is known as a probability mass function (pmf).
The discrete uniform distribution is a distribution that models an event with a finite set of possible outcomes where each outcome is equally likely to be observed. For outcomes, each has a probability of occurrence of .
An example of this is throwing a fair die. The probability of any of the six outcomes is . The PMF is given by , and the expected value and variance are given by and respectively.
In [13]: from matplotlib import pyplot as plt import matplotlib.pyplot as plt X=range(0,11) Y=[1/6.0 if x in range(1,7) else 0.0 for x in X] plt.plot(X,Y,'go-', linewidth=0, drawstyle='steps-pre', label="p(x)=1/6") plt.legend(loc="upper left") plt.vlines(range(1,7),0,max(Y), linestyle='-') plt.xlabel('x') plt.ylabel('p(x)') plt.ylim(0,0.5) plt.xlim(0,10) plt.title('Discrete uniform probability distribution with p=1/6') plt.show()
The Bernoulli distribution measures the probability of success in a trial; for example, the probability that a coin toss turns up a head or a tail. This can be represented by a random X variable that takes a value of 1 if the coin turns up as heads and 0 if it is tails. The probability of turning up heads or tails is denoted by p and q=1-p respectively.
This can be represented by the following pmf:
The expected value and variance are given by the following formula:
The reference for this information is at http://en.wikipedia.org/wiki/Bernoulli_distribution.
We now plot the Bernoulli distribution using matplotlib
and scipy.stats
as follows:
In [20]:import matplotlib from scipy.stats import bernoulli a = np.arange(2) colors = matplotlib.rcParams['axes.color_cycle'] plt.figure(figsize=(12,8)) for i, p in enumerate([0.0, 0.2, 0.5, 0.75, 1.0]): ax = plt.subplot(1, 5, i+1) plt.bar(a, bernoulli.pmf(a, p), label=p, color=colors[i], alpha=0.5) ax.xaxis.set_ticks(a) plt.legend(loc=0) if i == 0: plt.ylabel("PDF at $k$") plt.suptitle("Bernoulli probability for various values of $p$") Out[20]:
The binomial distribution is used to represent the number of successes in n independent Bernoulli trials that is, .
Using the coin toss example, this distribution models the chance of getting X heads over trials. For 100 tosses, the binomial distribution models the likelihood of getting 0 heads (extremely unlikely) to 50 heads (highest likelihood) to 100 heads (also extremely unlikely). This ends up making the binomial distribution symmetrical when the odds are perfectly even and skewed when the odds are far less even. The pmf is given by the following expression:
The expectation and variance are given respectively by the following expression:
In [5]: from scipy.stats import binom clrs = ['blue','green','red','cyan','magenta'] plt.figure(figsize=(12,6)) k = np.arange(0, 22) for p, color in zip([0.001, 0.1, 0.3, 0.6, 0.999], clrs): rv = binom(20, p) plt.plot(k, rv.pmf(k), lw=2, color=color, label="$p$=" + str(round(p,1))) plt.legend() plt.title("Binomial distribution PMF") plt.tight_layout() plt.ylabel("PDF at $k$") plt.xlabel("$k$") Out[5]:
The Poisson distribution models the probability of a number of events within a given time interval, assuming that these events occur with a known average rate and successive events occur independent of the time since the previous event.
A concrete example of a process that can be modeled by a Poisson distribution would be if an individual received an average of, say, 23 e-mails per day. If we assume that the arrival times for the e-mails are independent of each other, the total number of e-mails an individual receives each day can be modeled by a Poisson distribution.
Another example could be the number of trains that stop at a particular station each hour. The pmf for a Poisson distribution is given by the following expression:
Where is the rate parameter that represents the expected number of events/arrivals that occur per unit time, and is the random variable that represents the number of events/arrivals.
The expectation and variance are given respectively by the following formula:
For more information, refer to http://en.wikipedia.org/wiki/Poisson_process.
The pmf is plotted using matplotlib
for various values as follows:
In [11]: %matplotlib inline import numpy as np import matplotlib import matplotlib.pyplot as plt from scipy.stats import poisson colors = matplotlib.rcParams['axes.color_cycle'] k=np.arange(15) plt.figure(figsize=(12,8)) for i, lambda_ in enumerate([1,2,4,6]): plt.plot(k, poisson.pmf(k, lambda_), '-o', label="$lambda$=" + str(lambda_), color=colors[i]) plt.legend() plt.title("Possion distribution PMF for various $lambda$") plt.ylabel("PMF at $k$") plt.xlabel("$k$") plt.show() Out[11]:
For independent Bernoulli trials, the Geometric distribution measures the number of trials X needed to get one success. It can also represent the number of failures, , before the first success.
The pmf is given by the following expression:
The preceding expression makes sense since , and if it takes k trials to get one success (p), this means that we must have had failures which are equal to .
The expectation and variance are given as follows:
The following command explains the preceding formula clearly:
In [12]: from scipy.stats import geom p_vals=[0.01,0.2,0.5,0.8,0.9] x = np.arange(geom.ppf(0.01,p),geom.ppf(0.99,p)) colors = matplotlib.rcParams['axes.color_cycle'] for p,color in zip(p_vals,colors): x = np.arange(geom.ppf(0.01,p),geom.ppf(0.99,p)) plt.plot(x,geom.pmf(x,p),'-o',ms=8,label='$p$=' + str(p)) plt.legend(loc='best') plt.ylim(-0.5,1.5) plt.xlim(0,7.5) plt.ylabel("Pmf at $k$") plt.xlabel("$k$") plt.title("Geometric distribution PMF") Out[12]:
Also for independent Bernoulli trials, the negative binomial distribution measures the number of tries, , before a specified number of successes, r, occur. An example would be the number of coin tosses it would take to obtain 5 heads. The pmf is given as follows:
The expectation and variance are given respectively by the following expression:
We can see that the negative binomial is a generalization of the geometric distribution, with the geometric distribution being a special case of the negative binomial, where .
The code and plot are shown as follows:
In [189]: from scipy.stats import nbinom from matplotlib import colors clrs = matplotlib.rcParams['axes.color_cycle'] x = np.arange(0,11) n_vals = [0.1,1,3,6] p=0.5 for n, clr in zip(n_vals, clrs): rv = nbinom(n,p) plt.plot(x,rv.pmf(x), label="$n$=" + str(n), color=clr) plt.legend() plt.title("Negative Binomial Distribution PMF") plt.ylabel("PMF at $x$") plt.xlabel("$x$")
In a continuous probability distribution, the variable can take on any real number. It is not limited to a finite set of values as with the discrete probability distribution. For example, the average weight of a healthy newborn baby can range approximately between 6-9 lbs. Its weight can be 7.3 lbs for example. A continuous probability distribution is characterized by a probability density function (PDF).
The sum of all probabilities that the random variable can assume is 1. Thus, the area under the graph of the probability density function is 1.
The uniform distribution models a random variable X that can take any value within the range with equal probability.
The PDF is given by , for and otherwise.
The expectation and variance are given by the following expression:
A continuous uniform probability distribution is generated and plotted for various sample sizes in the following code and figure:
In [11]: np.random.seed(100) # seed the random number generator # so plots are reproducible subplots = [111,211,311] ctr = 0 fig, ax = plt.subplots(len(subplots), figsize=(10,12)) nsteps=10 for i in range(0,3): cud = np.random.uniform(0,1,nsteps) # generate distrib count, bins, ignored = ax[ctr].hist(cud,15,normed=True) ax[ctr].plot(bins,np.ones_like(bins),linewidth=2, color='r') ax[ctr].set_title('sample size=%s' % nsteps) ctr += 1 nsteps *= 100 fig.subplots_adjust(hspace=0.4) plt.suptitle("Continuous Uniform probability distributions for various sample sizes" , fontsize=14)
The exponential distribution models the waiting time between two events in a Poisson process. A Poisson process is a process that follows a Poisson distribution in which events occur unpredictably with a known average rate. The exponential distribution can be described as the continuous limit of the Geometric distribution and is also Markovian (memoryless).
A memoryless random variable exhibits the property whereby its future state depends only on relevant information about the current time and not the information from further in the past. An example of modeling a Markovian/memoryless random variable is modeling short-term stock price behavior and the idea that it follows a random walk. This leads to what is called the Efficient Market hypothesis in Finance. For more information, refer to http://en.wikipedia.org/wiki/Random_walk_hypothesis.
The PDF of the exponential distribution is given by =. The expectation and variance are given by the following expression:
For a reference, refer to the link at http://en.wikipedia.org/wiki/Exponential_distribution.
The plot of the distribution and code is given as follows:
In [15]: import scipy.stats clrs = colors.cnames x = np.linspace(0,4, 100) expo = scipy.stats.expon lambda_ = [0.5, 1, 2, 5] plt.figure(figsize=(12,4)) for l,c in zip(lambda_,clrs): plt.plot(x, expo.pdf(x, scale=1./l), lw=2, color=c, label = "$lambda = %.1f$"%l) plt.legend() plt.ylabel("PDF at $x$") plt.xlabel("$x$") plt.title("Pdf of an Exponential random variable for various $lambda$");
The most important distribution in statistics is arguably the normal/Gaussian distribution. It models the probability distribution around a central value with no left or right bias. There are many examples of phenomena that follow the normal distribution, such as:
The normal distribution's importance is underlined by the central limit theorem, which states that the mean of many random variables drawn independently from the same distribution is approximately normal, regardless of the form of the original distribution. Its expected value and variance are given as follows:
The PDF of the normal distribution is given by the following expression:
The following code and plot explains the formula:
In [54]: import matplotlib from scipy.stats import norm X = 2.5 dx = 0.1 R = np.arange(-X,X+dx,dx) L = list() sdL = (0.5,1,2,3) for sd in sdL: f = norm.pdf L.append([f(x,loc=0,scale=sd) for x in R]) colors = matplotlib.rcParams['axes.color_cycle'] for sd,c,P in zip(sdL,colors,L): plt.plot(R,P,zorder=1,lw=1.5,color=c, label="$sigma$=" + str(sd)) plt.legend() ax = plt.axes() ax.set_xlim(-2.1,2.1) ax.set_ylim(0,1.0) plt.title("Normal distribution Pdf") plt.ylabel("PDF at $mu$=0, $sigma$")
Reference for the Python code for the plotting of the distributions can be found at: http://bit.ly/1E17nYx.
The normal distribution can also be regarded as the continuous limit of the binomial distribution and other distributions as . We can see this for the binomial distribution in the command and plots as follows:
In [18]:from scipy.stats import binom from matplotlib import colors cols = colors.cnames n_values = [1, 5,10, 30, 100] subplots = [111+100*x for x in range(0,len(n_values))] ctr = 0 fig, ax = plt.subplots(len(subplots), figsize=(6,12)) k = np.arange(0, 200) p=0.5 for n, color in zip(n_values, cols): k=np.arange(0,n+1) rv = binom(n, p) ax[ctr].plot(k, rv.pmf(k), lw=2, color=color) ax[ctr].set_title("$n$=" + str(n)) ctr += 1 fig.subplots_adjust(hspace=0.5) plt.suptitle("Binomial distribution PMF (p=0.5) for various values of n", fontsize=14)
As n increases, the binomial distribution approaches the normal distribution. In fact, for n>=30, this is clearly seen in the preceding plots.