Most of the examples used to demonstrate the capabilities of conditional random fields are related to text mining, intrusion detection, or bioinformatics. Although these applications have a great commercial merit, they are not suitable as an introductory test case because they usually require a lengthy description of the model and the training process.
For our example, we will select a simple problem: how to collect and aggregate an analyst's recommendation on any given stock from different sources with different formats.
Analysts at brokerage firms and investment funds routinely publish the list of recommendations or rating for any stock. These analysts used different rating schemes from buy/hold/sell; A, B, C rating; and stars rating to market perform/neutral/market underperform. For this example, the rating is normalized as follows:
Here is an example of recommendations by stock analysts:
Macquarie upgraded AUY from Neutral to Outperform rating
Raymond James initiates Ainsworth Lumber as Outperform
BMO Capital Markets upgrades Bear Creek Mining to Outperform
Goldman Sachs adds IBM to its conviction list
The objective is to extract the name of the financial institution that publishes the recommendation or rating, the stock rated, the previous rating, if available, and the new rating. The output can be inserted into a database for further trend analysis, prediction, or simply the creation of reports.
Scope of the application
Ratings from analysts are updated every day through different protocols (feed, emails, blogs, web pages, and so on). The data has to be extracted from HTML, JSON, plain text, or XML format before being processed. In this exercise, we assume that the input has already been converted into plain text (ASCII) using a regular expression or another classifier.
The first step is to define the labels Y representing the categories or semantics of the rating. A segment or sequence is defined as a recommendation sentence. After reviewing the different recommendations, we are able to specify the following seven labels:
The training set is generated from the raw data by tagging the different components of the recommendation. The first (or initiate) rating for a stock does not have the fields 4 and 5 defined.
For example:
Citigroup // Y(0) = 1 upgraded // Y(1) Macys // Y(2) from // Y(3) Buy // Y(4) to // Y(5) Strong Buy //Y(6) = 7
A training sequence can be visualized with the following undirected graph:
You may wonder why we need to tag the "From" and "To" labels in the creation of the training set. The reason is that these keywords may not always be stated and/or their positions in the recommendation differ from one source to another.
The implementation of the conditional random fields follows the design template for classifier, as explained in the Design template for classifiers section in Appendix A, Basic Concepts.
Its key components are as follows:
CrfModel
model of the type Model
is initialized through training during the instantiation of the classifier.PipeOperator
trait.Crf
, has four parameters: the number of labels (or number of features), nLabels
; configuration of type CrfConfig
; the sequence of delimiters of the type CrfSeqDelimiter
; and the labeled (or tagged) observations taggedObs
.CrfRecommendation
class is required by the CRF library to implement the DataSequence
interface. The class is used to recommend (or estimate) the next label.CrfSeqIter
implements the DataIter
iteration interface to traverse the labeled data sequence during training, as required by the CRF library.The key software components of the conditional random fields are described in the following UML class diagram:
The DataSequence
and DataIter
interfaces are grayed out to indicate that these are defined in the IITB's CRF Java library.
The test case uses the IITB's CRF Java implementation from the Indian Institute of Technology at Bombay by Sunita Sarawagi. The JAR files can be downloaded from Source Forge (http://sourceforge.net/projects/crf/).
The library is available as JAR files and source code. Some of the functionality, such as the selection of a training algorithm, is not available through the API. The components (JAR files) of the library are as follows:
The training of the conditional random field for sequences requires defining a few key interfaces:
DataSequence
to specify the mechanism to access observations and labels for training and test dataDataIter
to iterate through the sequence of data created using the DataSequence
interfaceFeatureGenerator
to aggregate all the features typesThese interfaces have default implementations bundled in the CRF Java library [7:12].
The scope of the IITB CRF Java library evaluation
The CRF library has been evaluated with three simple text analytics test cases. Although the library is certainly robust enough to illustrate the internal workings of the CRF, I cannot vouch for its scalability or applicability in other fields of interest such as bioinformatics or process control.
The first step is to implement the structure of the training sequence, which implements the DataIter
interface. The training file consists of a pair of files:
Let's define the model for the CRF classifier. As mentioned earlier, the model for the CRF is similar to the logistic regression model and consists of the weights
parameter:
class CrfModel(val weights: DblVector) extends Model
The tagged recommendations file requires a delimiter class, CrfSeqDelimiter
. It delineates the sequence of observations using the following parameters:
obsDelim
is a regular expression to break down data input into a sequence of observationslabelsDelim
generates a sequence of labels from the data inputtrainingDelim
generates a sequence of training tuples from the training setThe CrfSeqDelimiter
class is defined as follows:
class CrfSeqDelimiter(val obsDelim: String, val labelsDelim: String, val trainingDelim:String)
The main purpose of the IITB CRF Java library's DataIter
interface is to define the methods to iterate through a sequence of data, tags, or observations. The three methods are as follows:
hasNext
tests if the sequence has another entrynext
returns the next data or entry in the sequence and increments the iterator cursorstartScan
initializes the DataIter
iteratorThe CrfSeqIter
sequence iterator uses the iitb.segment.DataCruncher
class to read a training set from a file (a file with tagged words):
class CrfSeqIter(val nLabels: Int, val input: String, val delim: SeqDelimiter) extends DataIter { lazy val trainData = DataCruncher.readTagged(nLabels, input, input, delim.obsDelim, delim.labelsDelim, delim.trainingDelim, new labelMap) override def hasNext: Boolean = trainData.hasNext override def next: DataSequence = trainData.next override def startScan: Unit = trainData.startScan }
The trainData
training set is initialized only once when any of the DataIter
overridden methods is invoked. The class is merely an adapter to the generation of the training set.
The second step consists of selecting the mechanism and class to generate the features observations. The extraction of the features from any data set requires implementation of the FeatureGenerator
interface in order to access all the features observations from any kind of features.
Our problem is a simple linear tagging of data sequences (recommendations from analysts). Therefore, we can use the iitb.Model.FeatureGenImpl
default implementation. Our tagging class, TaggingGenerator
makes FeatureGenImpl
as a subclass and specifies the model specification as a CompleteModel
. The IITB CRF library supports both linear chain model of CompleteModel
with a single edge iterator and the nested chain CRF model of the type NestedModel
with a nested edge iterator. The complete model does not make any assumption regarding the independence between labels Y:
val addFeature = true class TaggingGenerator (val nLabels: Int) extends FeatureGenImpl(new CompleteModel(nLabels),nLabels,addFeature)
The class is defined within the scope of the Crf
class and does not have to be exposed to the client code. The last parameter of FeatureGenImpl
, addFeature
, is set as true to allow the tags of dictionary to be built iteratively during the training.
The CrfTrainingSet
class implements the DataSequence
interface. It is used to access all the raw analyst's recommendations and rating regarding stocks. The class needs to implement the following methods:
set_y
to assign a label index to a position k
y
to retrieve a label y
at position y
x
to retrieve an observed feature vector at position k
length
to retrieve the number of entries in the sequenceThe CrfTrainingSet
class can be implemented as follows:
class CrfTrainingSet(val nLabels: Int, val entry: String, val delim: String) extends DataSequence { val words = entry.split(delim) val map = new Array[Int](nLabels) override def set_y(k: Int, label: Int): Unit = map(k) = label override def y(k: Int): Int = map(k) override def length: Int = words.size override def x(k: Int): Object = words(k) }
The class takes an analyst's recommendation regarding a stock, entry
, as an input and breaks it down into words, using the delimiter or regular expression, delim
.
The execution of the CRF algorithm is controlled by a wide variety of configuration parameters. For the sake of simplicity, we use the default configuration parameters, CrfConfig
, to control the execution of the learning algorithm, with the exception of the following four variables:
maxIters
(default 50)lamdba
for the L2 penalty function, used to reduce observations with a high value (default 1.0)eps
, used in computing the optimum values for the weights wj (default 1e-4)Advanced configuration
The CRF model of the iitb
library is highly configurable. It allows developers to specify a state-label undirected graph with any combination of flat and nested dependencies between states. The source code includes several training algorithms such as the exponential gradient.
The test case does not assume any dependence between states:
class CrfConfig(w0: Double, maxIters: Int, lambda: Double, eps: Double) extends Config
The objective of the training is to compute the weights wj that maximize the conditional log-likelihood without the L2 penalty function.
Maximizing the log-likelihood function is equivalent to minimizing the loss with L2 penalty. The function is convex, and therefore, any variant gradient descent (greedy) algorithm can be applied iteratively.
The Crf
class implements the learning
, train
, and classification
methods. Like any other classifiers, Crf
implements the PipeOperator
trait; so, the classification can be included in a workflow. The class also implements the Supervised
trait to force the developer to define a validation routine for the CRF:
class Crf(nLabels: Int, config: CrfConfig, delims: SeqDelimiter, taggedObs: String) extends PipeOperator[String, Double] with Supervised[String] { val features = new TaggingGenerator(nLabels) //1 lazy val crf = new CRF(nLabels, features, config.params) //2 val model: Option[CrfModel] = { features.train(seqIter) //3 Some(new CrfModel(crf.train(seqIter))) //4 } …
The computation of the CRF weights during training uses either methods defined in IITB's CRF library or methods described in the previous sections.
Once the features have been extracted from the data sequence input file (line 1
), the CRF algorithm is instantiated (line 2
) with the number of labels, extracted features, and the configuration. The model is trained using the iterator for features seqIter
(line 3
), and then returns a CrfModel
instance (vector of weights) (line 4
) if training succeeds, None
otherwise.
The predictive method implements the data transformation operator, |>
. It takes a new observation (analyst's recommendation on a stock) and returns the maximum likelihood, as shown here:
def |> : PartialFunction[String, Double] = {
case obs: String if(obs.length > 1 && model != None) => {
val dataSeq = new CrfTrainingSet(nLabels,obs,delims.obsDelim)
crf.apply(dataSeq)
}
}
The data transformation implements the Viterbi algorithm to extract the best sequence of labels for a newly observed recommendation, obs
. It invokes the apply
method of the iitb.crf.CRF
class. The code to validate the arguments/parameters of the class and methods are omitted along with the exception handler for the sake of readability.
The client code to execute the test consists of defining the number of labels (tags for recommendation), the L2 penalty factor, LAMBDA
, and the delimiting string:
val LAMBDA = 0.5; val EPS = 1e-3 val NLABELS = 9; val MAX_ITERS = 100; val W0 = 0.7 val PATH = "resources/data/chap7/rating" val config = CrfConfig(W0, MAX_ITERS, LAMBDA, EPS) val delimiters = CrfSeqDelimiter(", / -():.;'?#`&_", "//", " ") Crf(NLABELS, config, delimiters, PATH).weights match { case Some(weights) => weights case None => { … } }
For these tests, the initial value for the weights (with respect to the maximum number of iterations for the maximization of the log likelihood, and the convergence criteria) are set to 0.7 (with respect to 100 and 1e-3). The delimiters for labels sequence, observed features sequence, and the training set are customized for the format of input data files, rating.raw
and rating.tagged
.
The first training run discovered 136 features from 34 analyst's stock recommendations. The algorithm converged after 21 iterations. The value of the log of the likelihood for each of those iterations is plotted to illustrate the convergence toward a solution of optimum w:
The training phase converges fairly quickly toward a solution. It can be explained by the fact that there is little variation in the six-field format of the analyst's recommendations. A loose or free-style format would have required a larger number of iterations during training to converge.
The second test evaluates the impact of the size of the training set on the convergence of the training algorithm. It consists of computing the difference Δw of the model parameters (weights) between two consecutive iterations {w,i}t+1 and {wi}t:
The test is run on 163 randomly chosen recommendations using the same model but with two different training sets:
The larger training set is a super set of the 34 recommendations set. The following graph illustrates the comparison of features generated with 34 and 55 CRF training sequences:
The disparity between the test runs using two different size of training set is very small. This can be easily explained by the fact that there is a small variation in the format between the analyst's recommendations.
The third test evaluates the impact of the L2 regularization penalty on the convergence toward the optimum weights/features. The test is similar to the first test with different value of . The following charts plot log [p(Y|X, w)] for different values of = 1/σ2 (02, 0.5, and 0.8):
The log of the conditional probability decreases or the conditional probability increases with the number of iterations. The lower the L2 regularization factor, the higher the conditional probability.
The variation of the analysts' recommendations within the training set is fairly small, which limits the risk of overfitting. A free-style recommendation format would have been more sensitive to overfitting.