Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Monte Carlo estimation of the likelihood function and PyMC

Bayesian statistics isn't just another method. It is an entirely alternative paradigm for practicing statistics. It uses probability models for making inferences, given the data that we have collected. This can be expressed in a fundamental expression as P(H|D).

Here, H is our hypothesis, that is, the thing we're trying to prove, and D is our data or observations.

As a reminder from our previous discussion, the diachronic form of Bayes' theorem is as follows:

Here, P(H) is an unconditional prior probability that represents what we know before we conduct our trial. P(D|H) is our likelihood function or probability of obtaining the data we observe, given that our hypothesis is true.

P(D) is the probability of the data, also known as the normalizing constant. This can be obtained by integrating the numerator over H.

The likelihood function is the most important piece in our Bayesian calculation and encapsulates all of the information concerning the unknowns in the data. It has some semblance to a reverse probability mass function.

One argument against adopting a Bayesian approach is that the calculation of the prior can be subjective. There are many arguments in favor of this approach; among them, one being that external prior information can be included as mentioned previously.

The likelihood value represents an unknown integral, which in simple cases can be obtained by analytic integration.

Monte Carlo (MC) integration is needed for more complicated use cases involving higher-dimensional integrals and can be used to compute the likelihood function.

MC integration can be computed via a variety of sampling methods, such as uniform sampling, stratified sampling, and importance sampling. In Monte Carlo Integration, we can approximate the integral as follows:

We can approximate the integral by the following finite sum:

where, x is a sample vector from g. The proof that this estimate is a good one can be obtained from the law of large numbers and by making sure that the simulation error is small.

In conducting Bayesian analysis in Python, we will need a module that will enable us to calculate the likelihood function using the Monte Carlo method that was described earlier. The PyMC library fulfills that need. It provides a Monte Carlo method known commonly as Markov Chain Monte Carlo (MCMC). I will not delve further into the technical details of MCMC, but the interested reader can find out more about MCMC implementation in PyMC at the following references:

Monte Carlo Integration in Bayesian Estimation at http://bit.ly/1bMALeu
Markov Chain Monte Carlo Maximum Likelihood at http://bit.ly/1KBP8hH
Bayesian Statistical Analysis Using Python-Part 1| SciPy 2014, Chris Fonnesbeck at http://www.youtube.com/watch?v=vOBB_ycQ0RA

MCMC is not a universal panacea; there are some drawbacks to the approach, and one of them is the slow convergence of the algorithm.

Bayesian analysis example – Switchpoint detection

Here, we will try to use Bayesian inference and model an interesting dataset. The dataset in question consists of the author's Facebook (FB) post history over time. We have scrubbed the FB history data and saved the dates in the fb_post_dates.txt file. Here is what the data in the file looks like:

head -2 ../fb_post_dates.txt 
Tuesday, September 30, 2014 | 2:43am EDT
Tuesday, September 30, 2014 | 2:22am EDT

Thus, we see a datetime series, representing the date and time at which the author posted on FB. First, we read the file into DataFrame, separating timestamp into Date and Time columns:

In [91]: filePath="./data/fb_post_dates.txt"
         fbdata_df=pd.read_csv(filePath,  sep='|', parse_dates=[0], header=None,names=['Date','Time'])

Next, we inspect the data as follows:

In [92]: fbdata_df.head()  #inspect the data
Out[92]:   Date       Time
0  2014-09-30   2:43am EDT
1  2014-09-30   2:22am EDT
2  2014-09-30   2:06am EDT
3  2014-09-30   1:07am EDT
4  2014-09-28   9:16pm EDT

Now, we index the data by Date, creating a DatetimeIndex so that we can run resample on it to count by month as follows:

In [115]: fbdata_df_ind=fbdata_df.set_index('Date')
          fbdata_df_ind.head(5)
Out[115]:                      Time
         Date
         2014-09-30  2:43am EDT
         2014-09-30  2:22am EDT
         2014-09-30  2:06am EDT
         2014-09-30  1:07am EDT
         2014-09-28  9:16pm EDT

We display information about the index as follows:

In [116]: fbdata_df_ind.index
Out[116]: <class 'pandas.tseries.index.DatetimeIndex'>
          [2014-09-30, ..., 2007-04-16]
          Length: 7713, Freq: None, Timezone: None

We now obtain count of posts by month, using resample:

In [99]: fb_mth_count_=fbdata_df_ind.resample('M', how='count')
         fb_mth_count_.rename(columns={'Time':'Count'},
                              inplace=True)   # Rename 
         fb_mth_count_.head()
Out[99]:            Count
        Date
        2007-04-30  1
        2007-05-31  0
        2007-06-30  5
        2007-07-31  50
        2007-08-31  24

The Date format is shown as the last day of the month. Now, we create a scatter plot of FB post counts from 2007-2015, and we make the size of the dots proportional to the values in matplotlib:

In [108]: %matplotlib inline
          import datetime as dt
#Obtain the count data from the DataFrame as a dictionary
          year_month_count=fb_bymth_count.to_dict()['Count'] 
          size=len(year_month_count.keys())
#get dates as list of strings
          xdates=[dt.datetime.strptime(str(yyyymm),'%Y%m') 
                  for yyyymm in year_month_count.keys()] 
          counts=year_month_count.values()
          plt.scatter(xdates,counts,s=counts)
         plt.xlabel('Year')
          plt.ylabel('Number of Facebook posts')
          plt.show()

Bayesian analysis example – Switchpoint detection

The question we would like to investigate is whether there was a change in behavior at some point over the time period. Specifically, we wish to identify whether there was a specific period at which the mean number of FB posts changed. This is often referred to as the Switchpoint or changepoint in a time series.

We can make use of the Poisson distribution to model this. You might recall that the Poisson distribution can be used to model time series count data. (Refer to http://bit.ly/1JniIqy for more about this.)

If we represent our monthly FB post count by , we can represent our model as follows:

The parameter is the rate parameter of the Poisson distribution, but we don't know what its value is. If we examine the scatter plot of the FB time series count data, we can see that there was a jump in the number of posts sometime around mid to late 2010, perhaps coinciding with the start of the 2010 World Cup in South Africa, which the author attended.

The parameter is the Switchpoint, which is when the rate parameter changes, while and are the values of the parameter before and after the Switchpoint respectively. This can be represented as follows:

Note that the variables specified above are all Bayesian random variables. For Bayesian random variables that represent one's beliefs about their values, we need to model them using a probability distribution. We would like to infer the values of and , which are unknown. In PyMC, we can represent random variables using the Stochastic and Deterministic classes. We note that the exponential distribution is the amount of time between Poisson events. Hence, in the case of and , we choose the exponential distribution to model them since they can be any positive number:

In the case of , we will choose to model it using the uniform distribution, which reflects our belief that it is equally likely that the Switchpoint can occur on any day within the entire time period. Hence, we have this:

Here, , corresponds to the lower and upper boundaries of the year . Let us now use PyMC to represent the model that we developed earlier. We will now use PyMC to see whether we can detect a Switchpoint in the FB post data. In addition to the scatter plot, we can also display the data in a bar chart. In order to do that first of all we need to obtain a count of FB posts ordered by month in a list:

In [69]: fb_activity_data = [year_month_count[k] for k in 
                             sorted(year_month_count.keys())]
        fb_activity_data[:5]

Out[70]: [1, 0, 5, 50, 24]

In [71]: fb_post_count=len(fb_activity_data)

We render the bar plot using matplotlib:

In [72]: from IPython.core.pylabtools import figsize
         import matplotlib.pyplot as plt
           figsize(8, 5)
        plt.bar(np.arange(fb_post_count),
         fb_activity_data, color="#49a178")
        plt.xlabel("Time (months)")
        plt.ylabel("Number of FB posts")
        plt.title("Monthly Facebook posts over time")
        plt.xlim(0,fb_post_count);

Looking at the preceding bar chart, can one conclude whether there was a change in FB frequency posting behavior over time? We can use PyMC on the model that we have developed to help us find out the change as follows:

In [88]: # Define data and stochastics
        import pymc as pm
        switchpoint = pm.DiscreteUniform('switchpoint',
                                          lower=0,
                                    upper=len(fb_activity_data)-1,
                                       doc='Switchpoint[month]')
        avg = np.mean(fb_activity_data)
        early_mean = pm.Exponential('early_mean', beta=1./avg)
        late_mean = pm.Exponential('late_mean', beta=1./avg)
        late_mean
Out[88]:<pymc.distributions.Exponential 'late_mean' at 0x10ee56d50>

Here, we define a method for the rate parameter, r, and we model the count data using a Poisson distribution as discussed previously:

In [89]: @pm.deterministic(plot=False)
         def rate(s=switchpoint, e=early_mean, l=late_mean):
             ''' Concatenate Poisson means '''
             out = np.zeros(len(fb_activity_data))
             out[:s] = e
             out[s:] = l
         return out
 
         fb_activity = pm.Poisson('fb_activity', mu=rate,   
                       value=fb_activity_data, observed=True)
         fb_activity
Out[89]: <pymc.distributions.Poisson 'fb_activity' at 0x10ed1ee50>

In the preceding code snippet, @pm.deterministic is a decorator that denotes that the rate function is deterministic, meaning that its values are entirely determined by other variables—in this case, e, s, and l. The decorator is necessary in order to tell PyMC to convert the rate function into a deterministic object. If we do not specify the decorator, an error occurs. (For more information, refer to http://bit.ly/1zj8U0o for information on Python decorators.)

For more information, refer to the following web pages:

We now create a model with the FB Count data (fb_activity) and the (early_mean, late_mean, and rate respectively) parameters.

Next, using Pymc, we create an MCMC object that enables us to fit our data using Markov Chain Monte Carlo methods. We then call the sample on the resulting MCMC object to do the fitting:

In [94]: fb_activity_model=pm.Model([fb_activity,early_mean,
                                     late_mean,rate])
In [95]: from pymc import MCMC
           fbM=MCMC(fb_activity_model)
In [96]: fbM.sample(iter=40000,burn=1000, thin=20)
           [-----------------100%-----------------] 40000 of 40000 
          complete in 11.0 sec

Fitting the model using MCMC involves using Markov-Chain Monte Carlo methods to generate a probability distribution for the posterior, P(s,e,l | D). It uses the Monte Carlo process to repeatedly simulate sampling of the data and does this until the algorithm seems to converge to a steady state, based on multiple criteria. This is a Markov process because successive samples are dependent only on the previous sample. (For further reference on Markov chain convergence, refer to http://bit.ly/1IETkhC.)

The generated samples are referred to as traces. We can view what the marginal posterior distribution of the parameters looks like by viewing a histogram of its trace:

In [97]: from pylab import hist,show
           %matplotlib inline
        hist(fbM.trace('late_mean')[:])
Out[97]: (array([  15.,   61.,  214.,  421.,  517.,  426.,  202.,
                   70.,   21.,    3.]),
          array([ 102.29451192,  103.25158404,  104.20865616,
                  105.16572829,  106.12280041,  107.07987253,
                  108.03694465,  108.99401677,  109.95108889,
                  110.90816101,  111.86523313]),
    <a list of 10 Patch objects>)

In [98]:plt.hist(fbM.trace('early_mean')[:])
Out[98]: (array([  20.,  105.,  330.,  489.,  470.,  314.,  147.,
                   60.,    3.,   12.]),
          array([ 49.19781192,  50.07760882,  50.95740571,
                  51.83720261,  52.71699951,  53.59679641,
                  54.47659331,  55.35639021,  56.2361871 ,
                  57.115984  ,  57.9957809 ]),
        <a list of 10 Patch objects>)

Here, we see what the Switchpoint in terms of number of months looks like:

In [99]: fbM.trace('switchpoint')[:]
Out[99]: array([38, 38, 38, ..., 35, 35, 35])

In [150]: plt.hist(fbM.trace('switchpoint')[:])
Out[150]: (array([ 1899.,     0.,     0.,     0.,     0.,     0.,
                   0., 0., 0.,    51.]),
           array([ 35. ,  35.3,  35.6,  35.9,  36.2,  36.5,  36.8,
                   37.1,  37.4, 37.7,  38. ]),
          <a list of 10 Patch objects>)

Hist Switchpoint

We can see that the Switchpoint is in the neighborhood of the months numbering 35-38. Here, we use matplotlib to display the marginal posterior distributions of e, s, and l in a single figure:

In [141]: early_mean_samples=fbM.trace('early_mean')[:]
          late_mean_samples=fbM.trace('late_mean')[:]
          switchpoint_samples=fbM.trace('switchpoint')[:]
In [142]: from IPython.core.pylabtools import figsize
       figsize(12.5, 10)
    # histogram of the samples:
    fig = plt.figure()
    fig.subplots_adjust(bottom=-0.05)

    n_mths=len(fb_activity_data)
    ax = plt.subplot(311)
    ax.set_autoscaley_on(False)

     plt.hist(early_mean_samples, histtype='stepfilled',
              bins=30, alpha=0.85, label="posterior of $e$",
              color="turquoise", normed=True)
    plt.legend(loc="upper left")
    plt.title(r"""Posterior distributions of the variables
               $e, l, s$""",fontsize=16)
    plt.xlim([40, 120])
     plt.ylim([0, 0.6])
    plt.xlabel("$e$ value",fontsize=14)

     ax = plt.subplot(312)
    ax.set_autoscaley_on(False)
    plt.hist(late_mean_samples, histtype='stepfilled',
              bins=30, alpha=0.85, label="posterior of $l$",
              color="purple", normed=True)
    plt.legend(loc="upper left")
    plt.xlim([40, 120])
     plt.ylim([0, 0.6])
    plt.xlabel("$l$ value",fontsize=14)
    plt.subplot(313)
    w = 1.0 / switchpoint_samples.shape[0] *
         np.ones_like(switchpoint_samples)
    plt.hist(switchpoint_samples, bins=range(0,n_mths), alpha=1,
              label=r"posterior of $s$", color="green",
              weights=w, rwidth=2.)
    plt.xlim([20, n_mths - 20])
    plt.xlabel(r"$s$ (in days)",fontsize=14)
    plt.ylabel("probability")
    plt.legend(loc="upper left")

     plt.show()

marginal posterior distributions

PyMC also has plotting functionality. (It uses matplotlib.) In the following plots, we display a time series plot, an autocorrelation plot (acorr), and a histogram of the samples drawn for the early mean, late mean, and the Switchpoint. The histogram is useful to visualize the posterior distribution. The autocorrelation plot shows whether values in the previous period are strongly related to values in the current period.

In [100]: from pymc.Matplot import plot
       plot(fbM)
    Plotting late_mean
    Plotting switchpoint
    Plotting early_mean

The following is the late mean plot:

pymc_comprehensive_late_mean

Here, we display the Switchpoint plot:

Pymc comprehensive Switchpoint

Here, we display the early mean plot:

Pymc comprehensive early mean

From the output of PyMC, we can conclude that the Switchpoint is around 35-38 months from the start of the time series. This corresponds to sometime around March-July 2010. The author can testify that this was a banner year for him with respect to the use of FB since it was the year of the football (soccer) World Cup finals that were held in South Africa, which he attended.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Monte Carlo estimation of the likelihood function and PyMC

Create new playlist

Sign In

Sign Up

Monte Carlo estimation of the likelihood function and PyMC

Bayesian analysis example – Switchpoint detection

Table of Contents for
Monte Carlo estimation of the likelihood function and PyMC