Bayesian statistics isn't just another method. It is an entirely alternative paradigm for practicing statistics. It uses probability models for making inferences, given the data that we have collected. This can be expressed in a fundamental expression as P(H|D).
Here, H is our hypothesis, that is, the thing we're trying to prove, and D is our data or observations.
As a reminder from our previous discussion, the diachronic form of Bayes' theorem is as follows:
Here, P(H) is an unconditional prior probability that represents what we know before we conduct our trial. P(D|H) is our likelihood function or probability of obtaining the data we observe, given that our hypothesis is true.
P(D) is the probability of the data, also known as the normalizing constant. This can be obtained by integrating the numerator over H.
The likelihood function is the most important piece in our Bayesian calculation and encapsulates all of the information concerning the unknowns in the data. It has some semblance to a reverse probability mass function.
One argument against adopting a Bayesian approach is that the calculation of the prior can be subjective. There are many arguments in favor of this approach; among them, one being that external prior information can be included as mentioned previously.
The likelihood value represents an unknown integral, which in simple cases can be obtained by analytic integration.
Monte Carlo (MC) integration is needed for more complicated use cases involving higher-dimensional integrals and can be used to compute the likelihood function.
MC integration can be computed via a variety of sampling methods, such as uniform sampling, stratified sampling, and importance sampling. In Monte Carlo Integration, we can approximate the integral as follows:
We can approximate the integral by the following finite sum:
where, x is a sample vector from g. The proof that this estimate is a good one can be obtained from the law of large numbers and by making sure that the simulation error is small.
In conducting Bayesian analysis in Python, we will need a module that will enable us to calculate the likelihood function using the Monte Carlo method that was described earlier. The PyMC
library fulfills that need. It provides a Monte Carlo method known commonly as
Markov Chain Monte Carlo (MCMC). I will not delve further into the technical details of MCMC, but the interested reader can find out more about MCMC implementation in PyMC
at the following references:
MCMC is not a universal panacea; there are some drawbacks to the approach, and one of them is the slow convergence of the algorithm.
Here, we will try to use Bayesian inference and model an interesting dataset. The dataset in question consists of the author's Facebook (FB) post history over time. We have scrubbed the FB history data and saved the dates in the fb_post_dates.txt
file. Here is what the data in the file looks like:
head -2 ../fb_post_dates.txt Tuesday, September 30, 2014 | 2:43am EDT Tuesday, September 30, 2014 | 2:22am EDT
Thus, we see a datetime series, representing the date and time at which the author posted on FB. First, we read the file into DataFrame, separating timestamp into Date and Time columns:
In [91]: filePath="./data/fb_post_dates.txt" fbdata_df=pd.read_csv(filePath, sep='|', parse_dates=[0], header=None,names=['Date','Time'])
Next, we inspect the data as follows:
In [92]: fbdata_df.head() #inspect the data Out[92]: Date Time 0 2014-09-30 2:43am EDT 1 2014-09-30 2:22am EDT 2 2014-09-30 2:06am EDT 3 2014-09-30 1:07am EDT 4 2014-09-28 9:16pm EDT
Now, we index the data by Date, creating a DatetimeIndex
so that we can run resample on it to count by month as follows:
In [115]: fbdata_df_ind=fbdata_df.set_index('Date') fbdata_df_ind.head(5) Out[115]: Time Date 2014-09-30 2:43am EDT 2014-09-30 2:22am EDT 2014-09-30 2:06am EDT 2014-09-30 1:07am EDT 2014-09-28 9:16pm EDT
We display information about the index as follows:
In [116]: fbdata_df_ind.index Out[116]: <class 'pandas.tseries.index.DatetimeIndex'> [2014-09-30, ..., 2007-04-16] Length: 7713, Freq: None, Timezone: None
We now obtain count of posts by month, using resample:
In [99]: fb_mth_count_=fbdata_df_ind.resample('M', how='count') fb_mth_count_.rename(columns={'Time':'Count'}, inplace=True) # Rename fb_mth_count_.head() Out[99]: Count Date 2007-04-30 1 2007-05-31 0 2007-06-30 5 2007-07-31 50 2007-08-31 24
The Date format is shown as the last day of the month. Now, we create a scatter plot of FB post counts from 2007-2015, and we make the size of the dots proportional to the values in matplotlib
:
In [108]: %matplotlib inline import datetime as dt #Obtain the count data from the DataFrame as a dictionary year_month_count=fb_bymth_count.to_dict()['Count'] size=len(year_month_count.keys()) #get dates as list of strings xdates=[dt.datetime.strptime(str(yyyymm),'%Y%m') for yyyymm in year_month_count.keys()] counts=year_month_count.values() plt.scatter(xdates,counts,s=counts) plt.xlabel('Year') plt.ylabel('Number of Facebook posts') plt.show()
The question we would like to investigate is whether there was a change in behavior at some point over the time period. Specifically, we wish to identify whether there was a specific period at which the mean number of FB posts changed. This is often referred to as the Switchpoint or changepoint in a time series.
We can make use of the Poisson distribution to model this. You might recall that the Poisson distribution can be used to model time series count data. (Refer to http://bit.ly/1JniIqy for more about this.)
If we represent our monthly FB post count by , we can represent our model as follows:
The parameter is the rate parameter of the Poisson distribution, but we don't know what its value is. If we examine the scatter plot of the FB time series count data, we can see that there was a jump in the number of posts sometime around mid to late 2010, perhaps coinciding with the start of the 2010 World Cup in South Africa, which the author attended.
The parameter is the Switchpoint, which is when the rate parameter changes, while and are the values of the parameter before and after the Switchpoint respectively. This can be represented as follows:
Note that the variables specified above are all Bayesian random variables. For Bayesian random variables that represent one's beliefs about their values, we need to model them using a probability distribution. We would like to infer the values of and , which are unknown. In PyMC
, we can represent random variables using the Stochastic and Deterministic classes. We note that the exponential distribution is the amount of time between Poisson events. Hence, in the case of and , we choose the exponential distribution to model them since they can be any positive number:
In the case of , we will choose to model it using the uniform distribution, which reflects our belief that it is equally likely that the Switchpoint can occur on any day within the entire time period. Hence, we have this:
Here, , corresponds to the lower and upper boundaries of the year . Let us now use PyMC
to represent the model that we developed earlier. We will now use PyMC
to see whether we can detect a Switchpoint in the FB post data. In addition to the scatter plot, we can also display the data in a bar chart. In order to do that first of all we need to obtain a count of FB posts ordered by month in a list:
In [69]: fb_activity_data = [year_month_count[k] for k in sorted(year_month_count.keys())] fb_activity_data[:5] Out[70]: [1, 0, 5, 50, 24] In [71]: fb_post_count=len(fb_activity_data)
We render the bar plot using matplotlib
:
In [72]: from IPython.core.pylabtools import figsize import matplotlib.pyplot as plt figsize(8, 5) plt.bar(np.arange(fb_post_count), fb_activity_data, color="#49a178") plt.xlabel("Time (months)") plt.ylabel("Number of FB posts") plt.title("Monthly Facebook posts over time") plt.xlim(0,fb_post_count);
Looking at the preceding bar chart, can one conclude whether there was a change in FB frequency posting behavior over time? We can use PyMC
on the model that we have developed to help us find out the change as follows:
In [88]: # Define data and stochastics import pymc as pm switchpoint = pm.DiscreteUniform('switchpoint', lower=0, upper=len(fb_activity_data)-1, doc='Switchpoint[month]') avg = np.mean(fb_activity_data) early_mean = pm.Exponential('early_mean', beta=1./avg) late_mean = pm.Exponential('late_mean', beta=1./avg) late_mean Out[88]:<pymc.distributions.Exponential 'late_mean' at 0x10ee56d50>
Here, we define a method for the rate parameter, r, and we model the count data using a Poisson distribution as discussed previously:
In [89]: @pm.deterministic(plot=False) def rate(s=switchpoint, e=early_mean, l=late_mean): ''' Concatenate Poisson means ''' out = np.zeros(len(fb_activity_data)) out[:s] = e out[s:] = l return out fb_activity = pm.Poisson('fb_activity', mu=rate, value=fb_activity_data, observed=True) fb_activity Out[89]: <pymc.distributions.Poisson 'fb_activity' at 0x10ed1ee50>
In the preceding code snippet, @pm.deterministic
is a decorator that denotes that the rate function is deterministic, meaning that its values are entirely determined by other variables—in this case, e, s, and l. The decorator is necessary in order to tell PyMC
to convert the rate function into a deterministic object. If we do not specify the decorator, an error occurs. (For more information, refer to http://bit.ly/1zj8U0o for information on Python decorators.)
For more information, refer to the following web pages:
We now create a model with the FB Count data (fb_activity
) and the (early_mean
, late_mean
, and rate
respectively) parameters.
Next, using Pymc
, we create an MCMC object that enables us to fit our data using Markov Chain Monte Carlo methods. We then call the sample on the resulting MCMC object to do the fitting:
In [94]: fb_activity_model=pm.Model([fb_activity,early_mean, late_mean,rate]) In [95]: from pymc import MCMC fbM=MCMC(fb_activity_model) In [96]: fbM.sample(iter=40000,burn=1000, thin=20) [-----------------100%-----------------] 40000 of 40000 complete in 11.0 sec
Fitting the model using MCMC involves using Markov-Chain Monte Carlo methods to generate a probability distribution for the posterior, P(s,e,l | D). It uses the Monte Carlo process to repeatedly simulate sampling of the data and does this until the algorithm seems to converge to a steady state, based on multiple criteria. This is a Markov process because successive samples are dependent only on the previous sample. (For further reference on Markov chain convergence, refer to http://bit.ly/1IETkhC.)
The generated samples are referred to as traces. We can view what the marginal posterior distribution of the parameters looks like by viewing a histogram of its trace:
In [97]: from pylab import hist,show %matplotlib inline hist(fbM.trace('late_mean')[:]) Out[97]: (array([ 15., 61., 214., 421., 517., 426., 202., 70., 21., 3.]), array([ 102.29451192, 103.25158404, 104.20865616, 105.16572829, 106.12280041, 107.07987253, 108.03694465, 108.99401677, 109.95108889, 110.90816101, 111.86523313]), <a list of 10 Patch objects>)
In [98]:plt.hist(fbM.trace('early_mean')[:]) Out[98]: (array([ 20., 105., 330., 489., 470., 314., 147., 60., 3., 12.]), array([ 49.19781192, 50.07760882, 50.95740571, 51.83720261, 52.71699951, 53.59679641, 54.47659331, 55.35639021, 56.2361871 , 57.115984 , 57.9957809 ]), <a list of 10 Patch objects>)
Here, we see what the Switchpoint in terms of number of months looks like:
In [99]: fbM.trace('switchpoint')[:] Out[99]: array([38, 38, 38, ..., 35, 35, 35]) In [150]: plt.hist(fbM.trace('switchpoint')[:]) Out[150]: (array([ 1899., 0., 0., 0., 0., 0., 0., 0., 0., 51.]), array([ 35. , 35.3, 35.6, 35.9, 36.2, 36.5, 36.8, 37.1, 37.4, 37.7, 38. ]), <a list of 10 Patch objects>)
We can see that the Switchpoint is in the neighborhood of the months numbering 35-38. Here, we use matplotlib
to display the marginal posterior distributions of e, s, and l in a single figure:
In [141]: early_mean_samples=fbM.trace('early_mean')[:] late_mean_samples=fbM.trace('late_mean')[:] switchpoint_samples=fbM.trace('switchpoint')[:] In [142]: from IPython.core.pylabtools import figsize figsize(12.5, 10) # histogram of the samples: fig = plt.figure() fig.subplots_adjust(bottom=-0.05) n_mths=len(fb_activity_data) ax = plt.subplot(311) ax.set_autoscaley_on(False) plt.hist(early_mean_samples, histtype='stepfilled', bins=30, alpha=0.85, label="posterior of $e$", color="turquoise", normed=True) plt.legend(loc="upper left") plt.title(r"""Posterior distributions of the variables $e, l, s$""",fontsize=16) plt.xlim([40, 120]) plt.ylim([0, 0.6]) plt.xlabel("$e$ value",fontsize=14) ax = plt.subplot(312) ax.set_autoscaley_on(False) plt.hist(late_mean_samples, histtype='stepfilled', bins=30, alpha=0.85, label="posterior of $l$", color="purple", normed=True) plt.legend(loc="upper left") plt.xlim([40, 120]) plt.ylim([0, 0.6]) plt.xlabel("$l$ value",fontsize=14) plt.subplot(313) w = 1.0 / switchpoint_samples.shape[0] * np.ones_like(switchpoint_samples) plt.hist(switchpoint_samples, bins=range(0,n_mths), alpha=1, label=r"posterior of $s$", color="green", weights=w, rwidth=2.) plt.xlim([20, n_mths - 20]) plt.xlabel(r"$s$ (in days)",fontsize=14) plt.ylabel("probability") plt.legend(loc="upper left") plt.show()
PyMC also has plotting functionality. (It uses matplotlib
.) In the following plots, we display a time series plot, an autocorrelation plot (acorr), and a histogram of the samples drawn for the early mean, late mean, and the Switchpoint. The histogram is useful to visualize the posterior distribution. The autocorrelation plot shows whether values in the previous period are strongly related to values in the current period.
In [100]: from pymc.Matplot import plot plot(fbM) Plotting late_mean Plotting switchpoint Plotting early_mean
The following is the late mean plot:
Here, we display the Switchpoint plot:
Here, we display the early mean plot:
From the output of PyMC, we can conclude that the Switchpoint is around 35-38 months from the start of the time series. This corresponds to sometime around March-July 2010. The author can testify that this was a banner year for him with respect to the use of FB since it was the year of the football (soccer) World Cup finals that were held in South Africa, which he attended.