Conditional random fields

The conditional random field (CRF) is a discriminative machine learning algorithm introduced by John Lafferty, Andrew McCallum, and Fernando Pereira [7:9] at the turn of the century as an alternative to the HMM. The algorithm was originally developed to assign labels to a set of observation sequences as found.

Let's consider a concrete example to understand the conditional relation between the observations and the label data.

Introduction to CRF

Let's consider the problem of detecting a foul during a soccer game using a combination of video and audio. The objective is to assist the referee and analyze the behavior of the players to determine whether an action on the field is dangerous (red card), inappropriate (yellow card), in doubt to be replayed, or legitimate. The following image is an example of segmentation of a video frame for image processing:

Introduction to CRF

The analysis of the video consists of segmenting each video frame and extracting image features such as colors or edges [7:10]. A simple segmentation scheme consists of breaking down each video frame into tiles or groups of pixels indexed by their coordinates on the screen. A time sequence is then created for each tile Sij, as represented in the following image:

Introduction to CRF

The image segment Sij is one of the labels that are associated with multiple observations. The same features extraction process applies to the audio associated with the video. The relation between the video/image segment and the hidden state of the altercation between the soccer players is illustrated by the following model graph:

Introduction to CRF

Undirected graph representation of CRF for soccer infraction detection

Conditional random fields (CRFs) are discriminative models that can be regarded as a structured output extension of the logistic regression. CRFs address the problem of labeling a sequence of data such as assigning a tag to each word in a sentence. The objective is to estimate the correlation among the output (observed) values Y conditional on the input values (features) X.

The correlation between the output and input values is described as a graph (also known as a graph-structured CRF). A good example of graph-structured CRF are cliques. Cliques are sets of connected nodes in a graph for which each vertex has an edge connecting it to every other vertex in the clique.

Such models are complex and their implementation is challenging. Most real-world problems related to time series or ordered sequences of data can be solved as a correlation between a linear sequence of observations and a linear sequence of input data much like HMM. Such a model is known as the linear chain structured graph CRF or linear chain CRF for short.

Introduction to CRF

One main advantage of the linear chain CRF is that the maximum likelihood, p(Y|X, w), can be estimated using dynamic programming techniques such as the Viterbi algorithm used in the HMM. From now on, the section focuses exclusively on the linear chain CRF to stay consistent with the HMM described in the previous section.

Linear chain CRF

Let's consider a random variable X={xi}0:n-1 representing n observations and a random variable Y representing a corresponding sequence of labels Y={yj}0:n-1. The hidden Markov model estimates the joint probability p(X,Y) as any generative model requires the enumeration of all the sequences of observations.

If each element of Y, yj obeys the first order of the Markov property, then (Y, X) is a CRF. The likelihood is defined as a conditional probability p(Y|X, w), where w is the model parameters vector.

Note

Observation dependencies

The purpose of CRF models is to estimate the maximum likelihood of p(Y|X, w). Therefore, independence between observations X is not required.

A graphical model is a probabilistic model for which a graph denotes the conditional independence between random variables (vertices). The conditional and joint probabilities of random variables are represented as edges. The graph for generic conditional random fields can indeed be complex. The most common and simplistic graph is the linear chain CRF.

A first order linear chain conditional random field can be visualized as an undirected graphical model, which illustrates the conditional probability of a label Yj given a set of observations X:

Linear chain CRF

Linear, conditional, random field undirected graph

The Markov property simplifies the conditional probabilities of Y, given X, by considering only the neighbor labels p(Y1|X, Yj j ≠1) = p(Y1|X, Y0, Y2) and p(Yi|X, Yj j ≠i) = p(Yi|X, Yi-1, Yi+1).

The conditional random fields introduce a new set of entities and a new terminology:

  • Potential functions (fi): These are strictly positive, real value functions that represent a set of constraints on the configurations of random variables. They do not have any obvious probabilistic interpretation.
  • Identity potential functions: These are potential functions I(x, t) that take 1 if the condition on the feature x at time t is true, and 0 otherwise.
  • Transition feature functions: Simply known as feature functions, ti, are potential functions that take a sequence of features {Xi}, the previous label Yt-1, the current label Yt, and an index i. The transition feature function outputs a real value function. In a text analysis, a transition feature function would be defined by a sentence as a sequence of observed features, the previous word, the current word, and a position of a word in a sentence. Each transition feature function is assigned a weight that is similar to the weights or parameters in the logistic regression. Transition feature functions play a similar role as the state transition factors aij in HMM but without a direct probabilistic interpretation.
  • State feature functions sj are potential functions that take the sequence of features {Xi}, the current label Yi, and the index i. They play a similar role as the emission factors in the HMM.

A CRF defines the log probability of a particular label sequence Y, given a sequence of observations X as the normalized product of the transition feature and state feature functions. In other words, the likelihood of a particular sequence Y, given the observed features X, is a logistic regression.

The mathematical notation to compute the conditional probabilities in the case of a first order linear chain CRF is described in the following information box.

Note

CRF conditional distribution

  • The log probability of a label's sequence y, given an observation x:
    Linear chain CRF
  • Transition feature functions with I(a) = 1 if a true, 0 otherwise:
    Linear chain CRF
  • Using the notation:
    Linear chain CRF
  • Conditional distribution of labels y, given x, using the Markov property:
    Linear chain CRF

The weights wj are sometimes referred as Linear chain CRF in scientific papers, which may confuse the reader. W is used to avoid any confusion with the Linear chain CRF regularization factor.

Now, let's get acquainted with the conditional random fields algorithm and its implementation by Sunita Sarawagi.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset