CRF and text analytics

Most of the examples used to demonstrate the capabilities of conditional random fields are related to text mining, intrusion detection, or bioinformatics. Although these applications have a great commercial merit, they are not suitable as an introductory test case because they usually require a lengthy description of the model and the training process.

The feature functions model

For our example, we will select a simple problem: how to collect and aggregate an analyst's recommendation on any given stock from different sources with different formats.

Analysts at brokerage firms and investment funds routinely publish the list of recommendations or rating for any stock. These analysts used different rating schemes from buy/hold/sell; A, B, C rating; and stars rating to market perform/neutral/market underperform. For this example, the rating is normalized as follows:

  • 0 for a strong sell, (or F or 1 star rating)
  • 1 for sell (D, 2 stars, marker underperform)
  • 2 for neutral (C, hold, 3 stars, market perform, and so on)
  • 3 for buy (B, 4 stars, market overperform, and so on)
  • 4 from strong buy (A, 5 stars, highly recommended, and so on)

Here is an example of recommendations by stock analysts:

Macquarie upgraded AUY from Neutral to Outperform rating

Raymond James initiates Ainsworth Lumber as Outperform

BMO Capital Markets upgrades Bear Creek Mining to Outperform

Goldman Sachs adds IBM to its conviction list

The objective is to extract the name of the financial institution that publishes the recommendation or rating, the stock rated, the previous rating, if available, and the new rating. The output can be inserted into a database for further trend analysis, prediction, or simply the creation of reports.

Note

Scope of the application

Ratings from analysts are updated every day through different protocols (feed, emails, blogs, web pages, and so on). The data has to be extracted from HTML, JSON, plain text, or XML format before being processed. In this exercise, we assume that the input has already been converted into plain text (ASCII) using a regular expression or another classifier.

The first step is to define the labels Y representing the categories or semantics of the rating. A segment or sequence is defined as a recommendation sentence. After reviewing the different recommendations, we are able to specify the following seven labels:

  • Source of the recommendation (Goldman Sachs and so on)
  • Action (upgrades, initiates, and so on)
  • Stock (either the company name or the stock ticker symbol)
  • From (optional keyword)
  • Rating (optional previous rating)
  • To
  • Rating (new rating for the stock)

The training set is generated from the raw data by tagging the different components of the recommendation. The first (or initiate) rating for a stock does not have the fields 4 and 5 defined.

For example:

Citigroup // Y(0) = 1
upgraded // Y(1)
Macys // Y(2)
from // Y(3)
Buy // Y(4)
to // Y(5)
Strong Buy //Y(6) = 7

Note

Tagging

Tagging a word may have a different meaning depending on the context. In natural language processing (NLP), tagging refers to the process of assigning an attribute (adjective, pronoun, verb, proper name, and so on) to a word in a sentence [7:11].

A training sequence can be visualized with the following undirected graph:

The feature functions model

An example of a recommendation as a CRF training sequence

You may wonder why we need to tag the "From" and "To" labels in the creation of the training set. The reason is that these keywords may not always be stated and/or their positions in the recommendation differ from one source to another.

Software design

The implementation of the conditional random fields follows the design template for classifier, as explained in the Design template for classifiers section in Appendix A, Basic Concepts.

Its key components are as follows:

  • A CrfModel model of the type Model is initialized through training during the instantiation of the classifier.
  • The predictive or classification routine is implemented as a data transformation that implements the PipeOperator trait.
  • The conditional random field classifier, Crf, has four parameters: the number of labels (or number of features), nLabels; configuration of type CrfConfig; the sequence of delimiters of the type CrfSeqDelimiter; and the labeled (or tagged) observations taggedObs.
  • The CrfRecommendation class is required by the CRF library to implement the DataSequence interface. The class is used to recommend (or estimate) the next label.
  • CrfSeqIter implements the DataIter iteration interface to traverse the labeled data sequence during training, as required by the CRF library.

The key software components of the conditional random fields are described in the following UML class diagram:

Software design

UML class diagram for the conditional random fields

The DataSequence and DataIter interfaces are grayed out to indicate that these are defined in the IITB's CRF Java library.

Implementation

The test case uses the IITB's CRF Java implementation from the Indian Institute of Technology at Bombay by Sunita Sarawagi. The JAR files can be downloaded from Source Forge (http://sourceforge.net/projects/crf/).

The library is available as JAR files and source code. Some of the functionality, such as the selection of a training algorithm, is not available through the API. The components (JAR files) of the library are as follows:

  • CRF for the implementation of the CRF algorithm
  • LBFGS for limited-memory Broyden-Fletcher-Goldfarb-Shanno nonlinear optimization of convex functions (used in training)
  • CERN Colt library for manipulation of a matrix
  • GNU generic hash container for indexing

The training of the conditional random field for sequences requires defining a few key interfaces:

  • DataSequence to specify the mechanism to access observations and labels for training and test data
  • DataIter to iterate through the sequence of data created using the DataSequence interface
  • FeatureGenerator to aggregate all the features types

These interfaces have default implementations bundled in the CRF Java library [7:12].

Tip

The scope of the IITB CRF Java library evaluation

The CRF library has been evaluated with three simple text analytics test cases. Although the library is certainly robust enough to illustrate the internal workings of the CRF, I cannot vouch for its scalability or applicability in other fields of interest such as bioinformatics or process control.

Building the training set

The first step is to implement the structure of the training sequence, which implements the DataIter interface. The training file consists of a pair of files:

  • Raw recommendations (such as Raymond James upgrades Gentiva Health Services from Underperform to Market perform)
  • Tagged recommendations (such as Raymond James [1] upgrades [2] Gentiva Health Services [3], from [4] Underperform [5] to [6] Market perform [7])

Let's define the model for the CRF classifier. As mentioned earlier, the model for the CRF is similar to the logistic regression model and consists of the weights parameter:

class CrfModel(val weights: DblVector) extends Model

The tagged recommendations file requires a delimiter class, CrfSeqDelimiter. It delineates the sequence of observations using the following parameters:

  • obsDelim is a regular expression to break down data input into a sequence of observations
  • labelsDelim generates a sequence of labels from the data input
  • trainingDelim generates a sequence of training tuples from the training set

The CrfSeqDelimiter class is defined as follows:

class CrfSeqDelimiter(val obsDelim: String, val labelsDelim: String, val trainingDelim:String)

The main purpose of the IITB CRF Java library's DataIter interface is to define the methods to iterate through a sequence of data, tags, or observations. The three methods are as follows:

  • hasNext tests if the sequence has another entry
  • next returns the next data or entry in the sequence and increments the iterator cursor
  • startScan initializes the DataIter iterator

The CrfSeqIter sequence iterator uses the iitb.segment.DataCruncher class to read a training set from a file (a file with tagged words):

class CrfSeqIter(val nLabels: Int, val input: String, val delim: SeqDelimiter) extends DataIter {
   lazy val trainData = DataCruncher.readTagged(nLabels, input, input, delim.obsDelim, delim.labelsDelim, delim.trainingDelim, new labelMap)
   
   override def hasNext: Boolean = trainData.hasNext
   override def next: DataSequence = trainData.next
   override def startScan: Unit = trainData.startScan
}

The trainData training set is initialized only once when any of the DataIter overridden methods is invoked. The class is merely an adapter to the generation of the training set.

Generating tags

The second step consists of selecting the mechanism and class to generate the features observations. The extraction of the features from any data set requires implementation of the FeatureGenerator interface in order to access all the features observations from any kind of features.

Our problem is a simple linear tagging of data sequences (recommendations from analysts). Therefore, we can use the iitb.Model.FeatureGenImpl default implementation. Our tagging class, TaggingGenerator makes FeatureGenImpl as a subclass and specifies the model specification as a CompleteModel. The IITB CRF library supports both linear chain model of CompleteModel with a single edge iterator and the nested chain CRF model of the type NestedModel with a nested edge iterator. The complete model does not make any assumption regarding the independence between labels Y:

val addFeature = true
class TaggingGenerator (val nLabels: Int) extends FeatureGenImpl(new CompleteModel(nLabels),nLabels,addFeature)

The class is defined within the scope of the Crf class and does not have to be exposed to the client code. The last parameter of FeatureGenImpl, addFeature, is set as true to allow the tags of dictionary to be built iteratively during the training.

Extracting data sequences

The CrfTrainingSet class implements the DataSequence interface. It is used to access all the raw analyst's recommendations and rating regarding stocks. The class needs to implement the following methods:

  • set_y to assign a label index to a position k
  • y to retrieve a label y at position y
  • x to retrieve an observed feature vector at position k
  • length to retrieve the number of entries in the sequence

The CrfTrainingSet class can be implemented as follows:

class CrfTrainingSet(val nLabels: Int, val entry: String, val delim: String) extends DataSequence {
    val words = entry.split(delim)
    val map = new Array[Int](nLabels)
   
    override def set_y(k: Int, label: Int): Unit = map(k) = label
    override def y(k: Int): Int = map(k)
    override def length: Int = words.size
    override def x(k: Int): Object = words(k)
}

The class takes an analyst's recommendation regarding a stock, entry, as an input and breaks it down into words, using the delimiter or regular expression, delim.

CRF control parameters

The execution of the CRF algorithm is controlled by a wide variety of configuration parameters. For the sake of simplicity, we use the default configuration parameters, CrfConfig, to control the execution of the learning algorithm, with the exception of the following four variables:

  • Initialization of the weights, w0, using either a predefined or a random value between 0 and 1 (default 0)
  • Maximum number of iterations used in the computation of the weights during the learning phase maxIters (default 50)
  • The scaling factor lamdba for the L2 penalty function, used to reduce observations with a high value (default 1.0)
  • Convergence criteria, eps, used in computing the optimum values for the weights wj (default 1e-4)

Note

Advanced configuration

The CRF model of the iitb library is highly configurable. It allows developers to specify a state-label undirected graph with any combination of flat and nested dependencies between states. The source code includes several training algorithms such as the exponential gradient.

The test case does not assume any dependence between states:

class CrfConfig(w0: Double, maxIters: Int, lambda: Double, eps: Double) extends Config

Putting it all together

The objective of the training is to compute the weights wj that maximize the conditional log-likelihood without the L2 penalty function.

Note

Conditional log-likelihood for a linear chain CRF training set, Putting it all together is given as follows:

Putting it all together

Learning: Maximization of loss function and L2 penalty is given as follows:

Putting it all together

Maximizing the log-likelihood function Putting it all together is equivalent to minimizing the loss with L2 penalty. The function is convex, and therefore, any variant gradient descent (greedy) algorithm can be applied iteratively.

The Crf class implements the learning, train, and classification methods. Like any other classifiers, Crf implements the PipeOperator trait; so, the classification can be included in a workflow. The class also implements the Supervised trait to force the developer to define a validation routine for the CRF:

class Crf(nLabels: Int, config: CrfConfig, delims: SeqDelimiter, taggedObs: String) extends PipeOperator[String, Double] with Supervised[String] {
  val features = new TaggingGenerator(nLabels) //1
  lazy val crf = new CRF(nLabels, features, config.params) //2
  val model: Option[CrfModel] = {
    features.train(seqIter) //3
    Some(new CrfModel(crf.train(seqIter))) //4
}
…

The computation of the CRF weights during training uses either methods defined in IITB's CRF library or methods described in the previous sections.

Once the features have been extracted from the data sequence input file (line 1), the CRF algorithm is instantiated (line 2) with the number of labels, extracted features, and the configuration. The model is trained using the iterator for features seqIter (line 3), and then returns a CrfModel instance (vector of weights) (line 4) if training succeeds, None otherwise.

The predictive method implements the data transformation operator, |>. It takes a new observation (analyst's recommendation on a stock) and returns the maximum likelihood, as shown here:

def |> : PartialFunction[String, Double] = {
  case obs: String if(obs.length > 1 && model != None) => {
    val dataSeq = new CrfTrainingSet(nLabels,obs,delims.obsDelim)
    crf.apply(dataSeq)
  }
}

The data transformation implements the Viterbi algorithm to extract the best sequence of labels for a newly observed recommendation, obs. It invokes the apply method of the iitb.crf.CRF class. The code to validate the arguments/parameters of the class and methods are omitted along with the exception handler for the sake of readability.

Tests

The client code to execute the test consists of defining the number of labels (tags for recommendation), the L2 penalty factor, LAMBDA, and the delimiting string:

val LAMBDA = 0.5; val EPS = 1e-3
val NLABELS = 9; val MAX_ITERS = 100; val W0 = 0.7
val PATH = "resources/data/chap7/rating"

val config = CrfConfig(W0, MAX_ITERS, LAMBDA, EPS)
val delimiters = CrfSeqDelimiter(",	/ -():.;'?#`&_", "//", "
")

Crf(NLABELS, config, delimiters, PATH).weights match {
  case Some(weights) => weights
  case None => { … }
}

For these tests, the initial value for the weights (with respect to the maximum number of iterations for the maximization of the log likelihood, and the convergence criteria) are set to 0.7 (with respect to 100 and 1e-3). The delimiters for labels sequence, observed features sequence, and the training set are customized for the format of input data files, rating.raw and rating.tagged.

The training convergence profile

The first training run discovered 136 features from 34 analyst's stock recommendations. The algorithm converged after 21 iterations. The value of the log of the likelihood for each of those iterations is plotted to illustrate the convergence toward a solution of optimum w:

The training convergence profile

Visualization of the log conditional probability of CRF during training

The training phase converges fairly quickly toward a solution. It can be explained by the fact that there is little variation in the six-field format of the analyst's recommendations. A loose or free-style format would have required a larger number of iterations during training to converge.

Impact of the size of the training set

The second test evaluates the impact of the size of the training set on the convergence of the training algorithm. It consists of computing the difference Δw of the model parameters (weights) between two consecutive iterations {w,i}t+1 and {wi}t:

Impact of the size of the training set

The test is run on 163 randomly chosen recommendations using the same model but with two different training sets:

  • 34 analyst stock recommendations
  • 55 stock recommendations

The larger training set is a super set of the 34 recommendations set. The following graph illustrates the comparison of features generated with 34 and 55 CRF training sequences:

Impact of the size of the training set

The disparity between the test runs using two different size of training set is very small. This can be easily explained by the fact that there is a small variation in the format between the analyst's recommendations.

Impact of the L2 regularization factor

The third test evaluates the impact of the L2 regularization penalty on the convergence toward the optimum weights/features. The test is similar to the first test with different value of Impact of the L2 regularization factor. The following charts plot log [p(Y|X, w)] for different values of Impact of the L2 regularization factor = 1/σ2 (02, 0.5, and 0.8):

Impact of the L2 regularization factor

Impact of the L2 penalty on convergence of the CRF training algorithm

The log of the conditional probability decreases or the conditional probability increases with the number of iterations. The lower the L2 regularization factor, the higher the conditional probability.

The variation of the analysts' recommendations within the training set is fairly small, which limits the risk of overfitting. A free-style recommendation format would have been more sensitive to overfitting.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset