Naïve Bayes and text mining

The multinomial Naïve Bayes classifier is particularly suited for text mining. Naïve Bayes is used to classify the following entities:

  • E-mails as legitimate versus spam
  • Business news stories
  • Movie reviews and scoring
  • Technical papers as per field of expertise

This third use case consists of predicting the direction of a stock, Tesla Motors Inc, (ticker symbol: TSLA) give the financial news. The features are the frequency of occurrence of some specific terms related to the stock. It is unclear how fast the investor or trader reacts to the news and influence, if any, of the value of a stock. Therefore, the delayed response time, as depicted in the following chart, should be a feature of the proposed model:

Naïve Bayes and text mining

The feature market response delay would play a role in the training, only if the variance of the observations is significant. The distribution of the frequencies of the delay in the market response to any newsworthy articles regarding TSLA shows that the stock prices react within the same day in 82 percent of the case, as seen here:

Naïve Bayes and text mining

The frequency peak for a market response delay of 1.75 days can be explained by the fact that some news are released over the weekend and investors have to wait till the following Monday to impact the stock price. The second challenge is to assign any shift of stock price to a specific news release, taking into account that some news can be redundant and simultaneous.

Therefore, the model features for predicting the stock price, prt+1, are the relative frequency, fi, of occurrence of a term Ti within a time window [t-n, t], where t and n are trading days.

The following graphical model formally describes the causal relation or conditional dependency of the direction of the stock price between two consecutive trading sessions t and t+1, given the relative frequency of appearance of some terms in the media:

Naïve Bayes and text mining

The Bayesian model for the prediction of stock movement given financial news

For this exercise, the observation sets are the corpus of news feeds and articles released by the most prominent financial news organizations, such as Bloomberg or CNBC. The first step is to devise a methodology to extract and select the most relevant terms associated with a specific stock.

Basics of information retrieval

A full discussion of information retrieval and text mining is beyond the scope of this book [5:11]. For the sake of simplicity, the model will rely on a very simple model for extracting relevant terms and computing their relative frequency. The following 10-step sequence of actions describe one of numerous methodologies to extract the most relevant terms from a corpus:

  1. Create or extract the timestamp for each news article.
  2. Extract the title, paragraph, and sentences of each article using a Markovian classifier.
  3. Extract the terms from each sentence using regular expressions.
  4. Correct terms for typos using a dictionary and metric such as the Levenstein distance.
  5. Remove the nonstop words.
  6. Perform stemming and lemmatization.
  7. Extract bags of words and generate a list of n-grams (as a sequence of n terms).
  8. Apply a tagging model build using a maximum entropy or conditional random field to extract nouns and adjectives (such as NN, NNP, and so on).
  9. Match the terms against a dictionary that supports senses, hyponyms, and synonyms, such as WordNet.
  10. Disambiguate word sense using DBpedia [5:12].

Note

Text extraction from the web

The methodology discussed in this section does not include the process of searching and extracting news and articles from the Web that requires additional steps such as searching, crawling, and scraping [5:13].

Implementation

Let's apply the text mining methodology template to predict the direction of a stock, given the financial news. The algorithm relies on a sequence of 8 simple steps:

  1. Extracting all news with a reference to a specific stock or company in the news feed.
  2. Extracting the timestamp or date of the article using a regular expression.
  3. Grouping all the news articles related to the stock for a specific date t into a document Dt.
  4. Ordering the documents Dt as per the timestamp.
  5. Extracting the terms {Ti,D} from each sentence of the document Dt and ranking them by their relative frequency.
  6. Aggregating the terms {Tt,i} for all the documents sharing the same release date t.
  7. Computing the relative frequency, rtf, of each term, {Tt,i}, for the date t, as the ratio of number of its occurrences in all the articles released at t to the total number of its occurrences of the term in the entire corpus.
  8. Normalizing the relative frequency for the average number of articles per date, nrtf.

Note

The relative term frequency for term ti with nia occurrences in article a released on the date Dt is given as:

Implementation

The relative term frequency normalized by the average number of articles per day, Na/D is given as:

Implementation

Extraction of terms

First, let's define the features set for the financial terms as the NewsArticles class parameterized for the date type T. For the sake of simplicity, the type of date value is explicitly viewbounded to Long. The NewsArticles class is a container of the news articles and press releases relevant to a specific stock. At its core, a news article is defined by its release or publication, and the list of tuple of terms and their relative frequency. The NewsArticles class is defined as follows:

@implicitNotFound("NewsArticles. Ordering not explicitly defined")
class NewsArticles[T <% Long](implicit val order: Ordering[T]) {
   val articles = new HashMap[T, Map[String, Double]]
   …
}

Tip

The @implicitNotFound annotation

I recommend using the implicitNotFound annotation for every implicit class and method parameter. A declaration may be obvious to one software developer but not obvious to another developer.

The NewsArticles class uses the mutable HashMap data structure to manage the set of articles. An article is defined by:

  • Its release date (type T)
  • Its map of tuples {term contained in the article, relative frequency (or weight) of the term}, wTerms

The weight of a term is computed as the ratio of the number of occurrences of this term in the article, to the total number of occurrences in the entire corpus of articles related to the stock.

The implicit Ordering class parameter is required for sorting.

The map articles is populated with the overloaded operator +=:

def += (date: T, wTerms: Map[String, Double]): Unit = { //1
  def merge(m1: Map[String, Double], m2: Map[String, Double]): Map[String, Double] = { //2
    (m1.keySet ++ m2.keySet).foldLeft(new HashMap[String, Double])((m, x) => {
       var wt = 0.0
       if(m1.contains(x)) wt += m1(x)
       if(m2.contains(x)) wt += m2(x)
       m.put(x, wt)
       m 
    }).toMap
  }
  articles.put(date, if( articles.contains(date)) merge(articles(date), wTerms) else wTerms) //3
}

The += method adds new sets (mutable hash map) of pairs (terms, relative frequency), wTerms, released at a specific date, to the existing map of news articles (line 1). The terms related to different articles from the same date are merged using the local merge function (line 2). Finally, the list of key-value pairs (term, frequency) is ordered by their timestamp of the type T.

The second method, toOrderedArray, consists of ordering the articles per their release date:

def toOrderedArray: Array[(T, Map[String, Double])] = articles.toArray.sortWith( _._1 < _._1)

Scoring of terms

The scoring of the terms is actually performed by the TermsScore class, parameterized by date and the score method:

class TermsScore[T <% Long](toDate: String =>T, toWords: String => Array[String], lexicon: Map[String, String])(implicit val order: Ordering[T]) {
   def score(corpus: Corpus): Option[NewsArticles[T]]
}

The TermsScore class parameterized for the type of release date has three parameters:

  • A toDate function to extract the date from each news article. The function can be implemented as a regular expression or a group of regular expressions.
  • A toWords function to extract the nonstop terms from the content of the article. The function can be quite elaborate, as described in the previous section. It may require creating classifiers to extract sentences, n-grams, and tags.
  • A lexicon function that simulates the lemmatization and stemming of the most common terms. The lexicon function is implemented as a map that attaches a semantic equivalent to each term as a poor man's lemmatization. For example, "China", "Chinese", and "Shanghai" are semantically associated to the term "China".

The type for date T is view bounded by the Long type because it is assumed that any date can be potentially converted into time in milliseconds. The Ordering[T] class is provided as an implicit attribute to order the news articles as per their release date.

The relative frequency of a term t is computed arbitrarily, as the ratio of the number of occurrences of t for a specific date to the total number of terms.

Let's look at the scoring method:

type Corpus = (String, String, String) //1
def score(corpus: Corpus): Option[NewsArticles[T]] = {  //2
  val docs = rank(corpus)

  val cnts = docs.map(doc => (doc._1,count(doc._3)) )//3
  val totals = cnts
                  .map(_._2)  //4
                  .foldLeft(Counter[String])((s,cnt)=>s ++ cnt) 
  val articles = NewsArticles[T]
  cnts.foreach(cnt =>articles +=(cnt._1,(cnt._2/totals).toMap))
  articles
  …

The score method processes the training set or corpus of the news articles related to a stock and returns a set of NewsArticles instances.

The corpus type (line 1) defines the three essential components of a news article: a timestamp, a title, and a body or content. The rank method (line 2) extracts the release date from each news article and orders them as per increasing date.

The frequency of terms is computed for each document or group of news articles associated with a date (line 3) using the count method. The count method matches each term extracted from the news article to the entries of the lexicon map. The counters of the Counter: Map[String, Int] type collect the number of occurrences of each term. The next instruction (line 4) aggregates the counts for the entire corpus that is used to compute the relative frequencies (line 5).

The rank method uses a sequence of Scala methods map and sortWith to order the articles as per date (line 6):

def rank(corpus: Corpus): Option[CorpusType[T]] = {
   corpus.map(doc => (toDate(doc._1.trim), doc._2, doc._3)))
         .sortWith( _._1 < _._1)  //6
}

The scoring method is protected by a Scala exception handler (line 7). Finally, the count method matches a term with an entry in the lexicon and updates the count if a match is found (line 8):

def count(term: String): Counter[String] = 
  toWords(term).foldLeft(new Counter[String])((cnt, w) => 
    if( lexicon.contains(w)) cnt + lexicon(w)  //8
    else cnt  
  )

Testing

For testing purpose, let's select the news articles mentioning Tesla Motors and its ticker symbol TSLA over a period of two months.

Retrieving textual information

First, you need to define the three parameters of the scoring TermsScore class: toDate, toWords, and lexicon.

The private toDate method converts a string into a date defined as a Long data type:

def toDate(date: String): Long = {
  val idx1 = date.indexOf(".")
  val idx2 = date.lastIndexOf(".")
  if(idx1 != -1 && idx2 != -1) 
    (date.substring(0, idx1) + date.substring(idx1+1, idx2)).toLong
  else -1L
}

The toWords method uses simple regular expressions, regExpr, to replace any punctuation into a . character (line 1), used as a word delimiter (line 2). All words shorter than three characters are discounted (line 3):

def toWords(txt: String): Array[String] = {
  val regExpr = "['|,|.|?|!|:|"]"  
  txt.trim.toLowerCase
          .replace(regExpr,"&@") //1
          .split("&@")  //2
          .filter(_.length > 2) //3
}

Finally, the lexicon contains the terms that need to be monitored. In this particular period of time, the news media were looking for any announcement regarding Tesla Motors' foray into the Chinese market, issues with the batteries, and any plan to deploy electrical vehicle charger stations. The set of terms regarding these issues is limited, and therefore, the lexicon can be built manually:

val LEXICON = Map[String, String](
  "tesla"->"Tesla","tsla"->"TSLA","china"->"China","chinese"-> "China", ....) 

Note

The semantic analysis

This example uses a very primitive semantic map (lexicon) for the sake of illustrating the benefits and inner workings of the multinomial Naïve Bayes algorithm. Commercial applications involving sentiment analysis or topic analysis require a deeper understanding of semantic associations and extraction of topics using advanced generative models, such as the latent Dirichlet allocation.

The client code to train and validate the model executes the entire workflow, from extracting and scoring the news articles and press releases to generating the normalized labeled data and computing the F1 measure.

The output (or labeled data) TSLA_QUOTES consists of the stock price for Tesla Motors:

val TSLA_QUOTES = Array[Double](250.56, 254.84, … )

The first step is to load and clean all the articles (corpus) defined in the pathname directory (line 1). This task is performed by the DocumentsSource class (described in the Extraction of documents section under Scala programming in Appendix A, Basic Concepts):

val corpus: Corpus = DocumentsSource(pathName) |> {  //1
val ts = new TermsScore[Long](toDate, toWords, LEXICON)
ts.score(corpus) match { //2
  case Some(terms) => {
    var prevQ = 0.0
    val diff = TSLA_QUOTES.map( q => {
       val delta = if(q > prevQ) 1 else 0
       prevQ = q; delta
    })
    val columns = LEXICON.values.foldLeft(new HashSet[String])((hs, key) => {hs.add(key); hs}).toArray
    val fqLabels = terms.toOrderedArray  //3
                        .zip(diff)  //4
                        .map( x => (x._1._2, x._2))
                        .map(lbl =>(columns  //5
                        .map(f =>if( lbl._1.contains(f) ) lbl._1(f) else 0.0), lbl._2))
    val xt = XTSeries[(Array[Double], Int)](fqLabels)
    val nb = NaiveBayes[Double](xt)  //6
    ….

Next, the TermsScore.score method extracts and scores the more relevant terms from the corpus, using the normalized relative frequency defined in steps 7 and 8 of the information retrieval process (line 2). The terms are then ordered by date (line 3) and zipped with the labels (direction of the next trading day's stock price) (line 4). The lexicon is used to generate the final labeled observations (features = terms relative frequency, label= direction of stock price) (line 5). Finally, the model is built by invoking the NaiveBayes.apply constructor (line 6), which consists of running the algorithm through the training set.

Evaluation

The following chart describes the frequency of occurrences of some of the terms related to either Tesla Motors or its stock ticker TSLA:

Evaluation

Plot of the relative frequency of a partial list of stock-related terms

The next chart plots the labeled data, which is the direction of the stock price for the day following the press release(s) or news article(s):

Evaluation

Plot of the stock price and movement for Tesla Motors stock

This chart displays the historical price of the stock TSLA with the direction (UP or DOWN). The classification of 15 percent of the labeled data selected for validation has an F1 measure of 0.71. You need to keep in mind that no preprocessing or clustering was performed to isolate the most relevant features/keywords. The keywords were selected according the frequency of their occurrence in the financial news.

It is fair to assume that some of the keywords have a more significant impact on the direction of the stock price than others. One simple but interesting exercise is to record the value of the F1 score for a validation for which only the observations that have a high number of occurrences of a specific keyword are used, as shown here:

Evaluation

Bar chart representing predominant keywords in predicting TSLA stock movement

The bar chart shows that the terms China, representing all the mentions of the activities of Tesla Motors in China, and Charger, which covers all the references to the charging stations, have a significant positive impact on the direction of the stock with a probability averaging 75 percent. The terms under the category Risk have a negative impact on the direction of the stock with a probability of 68 percent, or a positive impact of the direction of the stock with a probability of 32 percent. Within the remaining eight categories, 72 percent of them were unusable as a predictor of the direction of the stock price.

This approach can be used for selecting features as an alternative to mutual information for using more elaborate classifiers. However, it should not be regarded as the primary methodology for the features selection, but instead as a by-product of the Naïve Bayes in case a very small number of features (less than 10 percent) are predominant in the model. This result can always be validated by computing the principal components, for which the normalized cumulative variance (eigenvalues) of the most predominant features is 90 percent or more.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset