Naïve Bayes and text mining

The multinomial Naïve Bayes classifier is particularly suited for text mining. The Naïve Bayes formula is quite effective to classify the following entities:

  • E-mail spams
  • Business news stories
  • Movie reviews
  • Technical papers as per field of expertise

This third use case consists of predicting the direction of a stock given the financial news. There are two type of news that affect the stock of a particular company:

  • Macro trends: Economic or social news such as conflicts, economic trends, or labor market statistics
  • Micro updates: Financial or market news related to a specific company such as earnings, change in ownership, or press releases

Macroeconomic news related to a specific company have the potential to affect the sentiments of investors toward the company and may lead to a sudden shift in the price of its stock. Another important feature is the average time it takes for investors to react to the news and affect the price of the stock.

  • Long-term investors may react within days or even weeks
  • Short-term traders adjust their positions within hours, sometimes within the same trading session

The average time the market takes to react to a significant financial news on a company is illustrated in the following chart:

Naïve Bayes and text mining

An illustration of the reaction of investors on the price of a stock following a news release

The delay in the market response is a relevant feature only if the variance of the response time is significant. The distribution of the frequencies of the delay in the market response to any newsworthy articles regarding TSLA is fairly constant. It shows that the stock prices react within the same day in 82 percent of the cases, as seen in the following bar chart:

Naïve Bayes and text mining

The distribution of the frequencies of the reaction of investors on the price of a stock following a news release

The frequency peak for a market response delay of 1.75 days can be explained by the fact that some news are released over the weekend and investors have to wait until the following Monday to drive the stock price higher or lower. Another challenge is to assign any shift of a stock price to a specific news release, taking into account that some news can be redundant, confusing, or simultaneous.

Therefore, the model features for predicting the stock price prt+1 are the relative frequency fi of an occurrence of a term Ti within a time window [t-n, t], where t and n are trading days.

The following graphical model formally describes the causal relation or conditional dependency of the relative change of the stock price between two consecutive trading sessions t and t + 1, given the relative frequency of appearance of some key terms in the media:

Naïve Bayes and text mining

The Bayesian model for the prediction of the stock movement given financial news

For this exercise, the observation sets are the corpus of news feeds and articles released by the most prominent financial news organizations, such as Bloomberg or CNBC. The first step is to devise a methodology to extract and select the most relevant terms associated with a specific stock.

Basics of information retrieval

A full discussion of information retrieval and text mining is beyond the scope of this book [5:11]. For the sake of simplicity, the model will rely on a very simple scheme for extracting relevant terms and computing their relative frequency. The following 10-step sequence of actions describe one of the numerous methodologies used to extract the most relevant terms from a corpus:

  1. Create or extract the timestamp for each news article.
  2. Extract the title, paragraph, and sentences of each article using a Markovian classifier.
  3. Extract the terms from each sentence using regular expressions.
  4. Correct terms for typos using a dictionary and metric such as the Levenstein distance.
  5. Remove the nonstop words.
  6. Perform stemming and lemmatization.
  7. Extract bags of words and generate a list of n-grams (as a sequence of n terms).
  8. Apply a tagging model build using a maximum entropy or conditional random field to extract nouns and adjectives (for example, NN, NNP, and so on).
  9. Match the terms against a dictionary that supports senses, hyponyms, and synonyms, such as WordNet.
  10. Disambiguate word sense using Wikipedia's repository DBpedia [5:12].

Note

Text extraction from the Web

The methodology discussed in this section does not include the process of searching and extracting news and articles from the Web that requires additional steps such as search, crawling, and scraping [5:13].

Implementation

Let's apply the text mining methodology template to predict the direction of a stock, given the financial news. The algorithm relies on a sequence of seven simple steps:

  1. Searching and loading the news articles related to a given company and its stock as a Ɗt document of the Document type.
  2. Extracting the date: T timestamp of the article using a regular expression.
  3. Ordering the Ɗt documents as per the timestamp.
  4. Extracting the {Ti,D} terms from the content of each Ɗt document.
  5. Aggregating the {Tt,D} terms for all the Ɗt documents that share the same publication date t.
  6. Computing the rtf relative frequency of each {Ti,D} term for the date t, as the ratio of number of its occurrences in all the articles released at t to the total number of its occurrences of the term in the entire corpus.
  7. Normalizing the relative frequency for the average number of articles per date, nrtf.

Note

Text analysis metrics

M9: The relative frequency of occurrences for term (or keyword) ti with nia occurrences in the article a is defined as follows:

Implementation

M10: The relative frequency of occurrences of a term ti normalized by the daily average number of articles for which Na is the total number of articles and Nd is the number of days in the survey is defined as follows:

Implementation

The news articles are minimalist documents with a timestamp, title, and content, as implemented by the Document class:

case class Document[T <: AnyVal]( //1
date: T, title: String, content: String)
(implicit f: T => Double)   

The date timestamp has a type bounded to the Long type, so T can be converted to the current time in milliseconds of the JVM (line 1).

Analyzing documents

This section is dedicated to the implementation of the simple text analyzer. Its purpose is to convert a set of documents of the Document type; in our case, news articles, into a distribution of relative frequencies of keywords.

The TextAnalyzer class implements a data transformation of the ETransform type, as described in the Monadic data transformation section in Chapter 2, Hello World!. It transforms a sequence of documents into a sequence of relative frequency distribution.

The TextAnalyzer class has the following two arguments (line 4):

  • A simple text parser, parser, that extracts an array of keywords from the title and content of each news articles (line 2).
  • A lexicon type that lists keywords used in monitoring news related to a company and their synonyms. The synonyms or terms that are semantically similar to each keywords are defined in an immutable map.

The code will be as follows:

type TermsRF = Map[String, Double]  
type TextParser = String => Array[String] //2
type Lexicon = immutable.Map[String, String]  //3
type Corpus[T] = Seq[Document[T]]

class TextAnalyzer[T <: AnyVal](  //4
     parser: TextParser, 
     lexicon: Lexicon)(implicit f: T => Double)
  extends ETransform[Lexicon](lexicon) {

  type U = Corpus[T]    //5
  type V = Seq[TermsRF] //6
  
  override def |> : PartialFunction[U, Try[V]] = {
     case docs: U => Try( score(docs) )
  }
  
  def score(corpus: Corpus[T]): Seq[TermsRF]  //7
  def quantize(termsRFSeq: Seq[TermsRF]): //8
          Try[(Array[String], XVSeries[Double])]
  def count(term: String): Counter[String] //9
}

The U type of an input into the data transformation |> is the corpus or sequence of news articles (line 5). The V type of the output from the data transformation is the sequence of relative frequency distribution of the TermsRF type (line 6).

The score private method does the heavy lifting for the class (line 7). The quantize method creates a homogenous set of observed features (line 8) and the count method counts the number of occurrences of terms or keywords across the documents or news articles that share the same publication date (line 9).

The following diagram describes the different components of the text mining process:

Analyzing documents

An illustration of the components of the text mining procedure

Extracting the frequency of relative terms

Let's dive into the score method:

def score(corpus: Corpus[T]): Seq[TermsRF] = {
  val termsCount = corpus.map(doc =>  //10
      (doc.date, count(doc.content))) //Seq[(T, Counter[String])]

  val termsCountMap = termsCount.groupBy( _._1).map{ 
     case (t, seq) => (t, seq.aggregate(new Counter[String])
                         ((s, cnt) => s ++ cnt._2, _ ++ _)) //11
  }
  val termsCountPerDate = termsCountMap.toSeq
         .sortWith( _._1 < _._1).unzip._2  //12
  val allTermsCounts = termsCountPerDate
          .aggregate(new Counter[String])((s, cnt) => 
                               s ++ cnt, _ ++ _) //13

  termsCountPerDate.map( _ /allTermsCounts).map(_.toMap) //14
}

The first step in the execution of the score method is the computation of the number of occurrences of keywords of the lexicon type on each of the document/news article (line 10). The computation of the number of occurrences is implemented by the count method:

def count(term: String): Counter[String] = 
   parser(term)./:(new Counter[String])((cnt, w) =>   //16
   if(lexicon.contains(w)) cnt + lexicon(w) else cnt)

The method relies on the term Counter counting class that subclasses mutable.Map[String, Int], as described in the Counter section under Scala programming in the Appendix A, Basic Concepts. It uses a fold to update the count for each of the terms associated with a keyword (line 16). The count term for the entire corpus is computed by aggregating the terms count for all the documents (line 11).

The next step consists of aggregating the count of the keywords across the document for each timestamp. A termsCountMap map with the date as the key and keywords counter, as values are generated by invoking the groupBy higher-order method (line 11). Next, the score method extracts a sorted sequence of keywords counts, termsCountPerDate (line 12). The total counts for each keyword over the allTermsCounts entire corpus (line 13) is used to compute the relative or normalized keywords frequencies (formulas M9 and M10) (line 14).

Generating the features

There is no guarantee that all the news articles associated with a specific publication date are used in the model. The quantize method assigns a relative frequency of 0.0 for keywords that are missing from the news articles, as illustrated in the following table:

Generating the features

A table on relative frequencies of keywords as per the publishing date

The quantize method transforms a sequence of term-relative frequencies into a pair keywords and observations:

def quantize(termsRFSeq: Seq[TermsRF]): 
             Try[(Array[String], XVSeries[Double])] = Try {
  val keywords = lexicon.values.toArray.distinct //15
  val relFrequencies = 
      termsRFSeq.map( tf =>  //16
          keywords.map(key => 
              if(tf.contains(key)) tf.get(key).get else 0.0))
  (keywords, relFrequencies.toVector) //17
}

The quantize method extracts an array of keywords from the lexicon (line 15). The relFrequencies vector of features is generated by assigning the relative 0.0 keyword frequency for keywords that are not detected across the news articles published at a specific date (line 16). Finally, the key-value pair (keywords and relative keyword frequency) (line 17).

Note

Sparse relative frequencies vector

Text analysis and natural language processing deals with very large feature sets, with potentially hundreds of thousands of features or keywords. Such computations would be almost intractable if it was not for the fact that the vast majority of keywords are not present in each document. It is a common practice to use sparse vectors and sparse matrices to reduce the memory consumption during training.

Testing

For testing purpose, let's select the news articles that mention Tesla Motors and its ticker symbol TSLA over a period of 2 months.

Retrieving the textual information

Let's start implementing and defining the two components of TextAnalyzer: the parsing function and the lexicon variable:

val pathLexicon = "resources/text/lexicon.txt"
val LEXICON = loadLexicon  //18

def parse(content: String): Array[String] = {
  val regExpr = "['|,|.|?|!|:|"]"
  content.trim.toLowerCase.replace(regExpr," ") //19
  .split(" ") //20
  .filter( _.length > 2) //21
}

The lexicon is loaded from a file (line 18). The parse method uses a simple regExpr regular expression to replace any punctuation into a space character (line 19), which is used as a word delimiter (line 20). All the words shorter than three characters are discounted (line 21).

Let's describe the workflow to load, parse, and analyze news articles related to the company, Tesla Inc. and its stock, ticker symbol TSLA.

The first step is to load and clean all the articles (corpus) defined in the pathCorpus directory (line 22). This task is performed by the DocumentsSource class, as described in the Data extraction section under Scala programming in the Appendix A, Basic Concepts:

val pathCorpus = "resources/text/chap5/"   //22
val dateFormat = new SimpleDateFormat("MM.dd.yyyy")
val pfnDocs = DocumentsSource(dateFormat, pathCorpus) |>  //23

val textAnalyzer = TextAnalyzer[Long](parse, LEXICON)
val pfnText = textAnalyzer |>   //24

for {
  corpus <- pfnDocs(None)  //25
  termsFreq <- pfnText(corpus)  //26
  featuresSet <- textAnalyzer.quantize(termsFreq) //27
  expected <- Try(difference(TSLA_QUOTES, diffInt)) //28
  nb <- NaiveBayes[Double](1.0, 
             featuresSet._2.zip(expected))//29
} yield {
  show(s"Naive Bayes model${nb.toString(quantized._1)}")
   …
}

A document source is fully defined by the path of the data input files and the format used in the timestamp (line 23). The text analyzer and its explicit pfnText data transformation is instantiated (line 24). The text processing pipeline is defined by the following steps:

  1. The transformation of an input source file into a corpus (a sequence of news articles) using the pfnDoc partial function (line 25).
  2. The transformation of a corpus into a sequence of a termsFreq relative keyword frequency vector using the pfnText partial function (line 26).
  3. The transformation of a sequence of relative keywords frequency vector into a featuresSet using quantize (line 27) (refer to the The differential operator section under Time series in Scala in Chapter 3, Data Preprocessing).
  4. The creation of the binomial NaiveBayes model using the pair (featuresSet._2 and expected) as training data (line 29).

The expected class values (0,1) are extracted from the daily stock price for Tesla Motors, TSLA_QUOTES:

val TSLA_QUOTES = Array[Double](250.56, 254.84, … )

Note

The semantic analysis

This example uses a very primitive semantic map (lexicon) for the sake of illustrating the benefits and inner workings of the multinomial Naïve Bayes algorithm. Commercial applications involving sentiment analysis or topic analysis require a deeper understanding of semantic associations and extraction of topics using advanced generative models, such as the Latent Dirichlet allocation.

Evaluating the text mining classifier

The following chart describes the frequency of occurrences of some of the keywords related to either Tesla Motors or its stock ticker TSLA:

Evaluating the text mining classifier

A graph of the relative frequency of a partial list of stock-related terms

The following chart plots the expected change in the direction of the stock price for the trading day following the press release(s) or news article(s):

Evaluating the text mining classifier

A graph of the stock price and movement for the Tesla Motors stock

The preceding chart displays the historical price of the stock TSLA with the direction (UP and DOWN). The classification of 15 percent of the labeled data selected for the validation of the classifier has an F1 score of 0.71. You need to keep in mind that no preprocessing or clustering was performed to isolate the most relevant features/keywords. We initially selected the keywords according to the frequency of their occurrences in the financial news.

It is fair to assume that some of the keywords have a more significant impact on the direction of the stock price than others. One simple but interesting exercise is to record the value of the F1 score for a validation for which only the observations that have a high number of occurrences of a specific keyword are used, as shown in the following graph:

Evaluating the text mining classifier

A bar chart representing predominant keywords in predicting the TSLA stock movement

The preceding bar chart shows that the terms China, representing all the mentions of the activities of Tesla Motors in China, and Charger, which covers all the references to the charging stations, have a significant positive impact on the direction of the stock with a probability averaging to 75 percent. The terms under the Risk category have a negative impact on the direction of the stock with a probability of 68 percent, or a positive impact of the direction of the stock with a probability of 32 percent. Within the remaining eight categories, 72 percent of them were unusable as a predictor of the direction of the stock price.

This approach can be used for selecting features as an alternative to mutual information for using classifiers that are more elaborate. However, it should not regarded as the primary methodology for selecting features, but instead as a by-product of the Naïve Bayes formula applied to models with a very small number of relevant features. Techniques such as the principal components analysis, as described in the Principal components analysis section under Dimension reduction in Chapter 4, Unsupervised Learning, are available to reduce the dimension of the problem and make Naïve Bayes a viable classifier.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset