The multinomial Naïve Bayes classifier is particularly suited for text mining. The Naïve Bayes formula is quite effective to classify the following entities:
This third use case consists of predicting the direction of a stock given the financial news. There are two type of news that affect the stock of a particular company:
Macroeconomic news related to a specific company have the potential to affect the sentiments of investors toward the company and may lead to a sudden shift in the price of its stock. Another important feature is the average time it takes for investors to react to the news and affect the price of the stock.
The average time the market takes to react to a significant financial news on a company is illustrated in the following chart:
The delay in the market response is a relevant feature only if the variance of the response time is significant. The distribution of the frequencies of the delay in the market response to any newsworthy articles regarding TSLA is fairly constant. It shows that the stock prices react within the same day in 82 percent of the cases, as seen in the following bar chart:
The frequency peak for a market response delay of 1.75 days can be explained by the fact that some news are released over the weekend and investors have to wait until the following Monday to drive the stock price higher or lower. Another challenge is to assign any shift of a stock price to a specific news release, taking into account that some news can be redundant, confusing, or simultaneous.
Therefore, the model features for predicting the stock price prt+1 are the relative frequency fi of an occurrence of a term Ti within a time window [t-n, t], where t and n are trading days.
The following graphical model formally describes the causal relation or conditional dependency of the relative change of the stock price between two consecutive trading sessions t and t + 1, given the relative frequency of appearance of some key terms in the media:
For this exercise, the observation sets are the corpus of news feeds and articles released by the most prominent financial news organizations, such as Bloomberg or CNBC. The first step is to devise a methodology to extract and select the most relevant terms associated with a specific stock.
A full discussion of information retrieval and text mining is beyond the scope of this book [5:11]. For the sake of simplicity, the model will rely on a very simple scheme for extracting relevant terms and computing their relative frequency. The following 10-step sequence of actions describe one of the numerous methodologies used to extract the most relevant terms from a corpus:
Let's apply the text mining methodology template to predict the direction of a stock, given the financial news. The algorithm relies on a sequence of seven simple steps:
Document
type.date: T
timestamp of the article using a regular expression.Text analysis metrics
M9: The relative frequency of occurrences for term (or keyword) ti with nia occurrences in the article a is defined as follows:
M10: The relative frequency of occurrences of a term ti normalized by the daily average number of articles for which Na is the total number of articles and Nd is the number of days in the survey is defined as follows:
The news articles are minimalist documents with a timestamp, title, and content, as implemented by the Document
class:
case class Document[T <: AnyVal]( //1
date: T, title: String, content: String)
(implicit f: T => Double)
The date
timestamp has a type bounded to the Long
type, so T
can be converted to the current time in milliseconds of the JVM (line 1
).
This section is dedicated to the implementation of the simple text analyzer. Its purpose is to convert a set of documents of the Document
type; in our case, news articles, into a distribution of relative frequencies of keywords.
The TextAnalyzer
class implements a data transformation of the ETransform
type, as described in the Monadic data transformation section in Chapter 2, Hello World!. It transforms a sequence of documents into a sequence of relative frequency distribution.
The TextAnalyzer
class has the following two arguments (line 4
):
parser
, that extracts an array of keywords from the title and content of each news articles (line 2
).lexicon
type that lists keywords used in monitoring news related to a company and their synonyms. The synonyms or terms that are semantically similar to each keywords are defined in an immutable map.type TermsRF = Map[String, Double] type TextParser = String => Array[String] //2 type Lexicon = immutable.Map[String, String] //3 type Corpus[T] = Seq[Document[T]] class TextAnalyzer[T <: AnyVal]( //4 parser: TextParser, lexicon: Lexicon)(implicit f: T => Double) extends ETransform[Lexicon](lexicon) { type U = Corpus[T] //5 type V = Seq[TermsRF] //6 override def |> : PartialFunction[U, Try[V]] = { case docs: U => Try( score(docs) ) } def score(corpus: Corpus[T]): Seq[TermsRF] //7 def quantize(termsRFSeq: Seq[TermsRF]): //8 Try[(Array[String], XVSeries[Double])] def count(term: String): Counter[String] //9 }
The U
type of an input into the data transformation |>
is the corpus or sequence of news articles (line 5
). The V
type of the output from the data transformation is the sequence of relative frequency distribution of the TermsRF
type (line 6
).
The score
private method does the heavy lifting for the class (line 7
). The quantize
method creates a homogenous set of observed features (line 8
) and the count
method counts the number of occurrences of terms or keywords across the documents or news articles that share the same publication date (line 9
).
The following diagram describes the different components of the text mining process:
Let's dive into the score
method:
def score(corpus: Corpus[T]): Seq[TermsRF] = { val termsCount = corpus.map(doc => //10 (doc.date, count(doc.content))) //Seq[(T, Counter[String])] val termsCountMap = termsCount.groupBy( _._1).map{ case (t, seq) => (t, seq.aggregate(new Counter[String]) ((s, cnt) => s ++ cnt._2, _ ++ _)) //11 } val termsCountPerDate = termsCountMap.toSeq .sortWith( _._1 < _._1).unzip._2 //12 val allTermsCounts = termsCountPerDate .aggregate(new Counter[String])((s, cnt) => s ++ cnt, _ ++ _) //13 termsCountPerDate.map( _ /allTermsCounts).map(_.toMap) //14 }
The first step in the execution of the score
method is the computation of the number of occurrences of keywords of the lexicon
type on each of the document/news article (line 10
). The computation of the number of occurrences is implemented by the count
method:
def count(term: String): Counter[String] = parser(term)./:(new Counter[String])((cnt, w) => //16 if(lexicon.contains(w)) cnt + lexicon(w) else cnt)
The method relies on the term Counter
counting class that subclasses mutable.Map[String, Int]
, as described in the Counter section under Scala programming in the Appendix A, Basic Concepts. It uses a fold to update the count for each of the terms associated with a keyword (line 16
). The count
term for the entire corpus is computed by aggregating the terms count for all the documents (line 11
).
The next step consists of aggregating the count of the keywords across the document for each timestamp. A termsCountMap
map with the date as the key and keywords counter, as values are generated by invoking the groupBy
higher-order method (line 11
). Next, the score
method extracts a sorted sequence of keywords counts, termsCountPerDate
(line 12
). The total counts for each keyword over the allTermsCounts
entire corpus (line 13
) is used to compute the relative or normalized keywords frequencies (formulas M9 and M10) (line 14
).
There is no guarantee that all the news articles associated with a specific publication date are used in the model. The quantize
method assigns a relative frequency of 0.0 for keywords that are missing from the news articles, as illustrated in the following table:
The quantize
method transforms a sequence of term-relative frequencies into a pair keywords and observations:
def quantize(termsRFSeq: Seq[TermsRF]): Try[(Array[String], XVSeries[Double])] = Try { val keywords = lexicon.values.toArray.distinct //15 val relFrequencies = termsRFSeq.map( tf => //16 keywords.map(key => if(tf.contains(key)) tf.get(key).get else 0.0)) (keywords, relFrequencies.toVector) //17 }
The quantize
method extracts an array of keywords from the lexicon (line 15
). The relFrequencies
vector of features is generated by assigning the relative 0.0
keyword frequency for keywords that are not detected across the news articles published at a specific date (line 16
). Finally, the key-value pair (keywords and relative keyword frequency) (line 17).
Sparse relative frequencies vector
Text analysis and natural language processing deals with very large feature sets, with potentially hundreds of thousands of features or keywords. Such computations would be almost intractable if it was not for the fact that the vast majority of keywords are not present in each document. It is a common practice to use sparse vectors and sparse matrices to reduce the memory consumption during training.
For testing purpose, let's select the news articles that mention Tesla Motors and its ticker symbol TSLA over a period of 2 months.
Let's start implementing and defining the two components of TextAnalyzer
: the parsing
function and the lexicon
variable:
val pathLexicon = "resources/text/lexicon.txt" val LEXICON = loadLexicon //18 def parse(content: String): Array[String] = { val regExpr = "['|,|.|?|!|:|"]" content.trim.toLowerCase.replace(regExpr," ") //19 .split(" ") //20 .filter( _.length > 2) //21 }
The lexicon is loaded from a file (line 18
). The parse
method uses a simple regExpr
regular expression to replace any punctuation into a space character (line 19
), which is used as a word delimiter (line 20
). All the words shorter than three characters are discounted (line 21).
Let's describe the workflow to load, parse, and analyze news articles related to the company, Tesla Inc. and its stock, ticker symbol TSLA.
The first step is to load and clean all the articles (corpus) defined in the pathCorpus
directory (line 22
). This task is performed by the DocumentsSource
class, as described in the Data extraction section under Scala programming in the Appendix A, Basic Concepts:
val pathCorpus = "resources/text/chap5/" //22 val dateFormat = new SimpleDateFormat("MM.dd.yyyy") val pfnDocs = DocumentsSource(dateFormat, pathCorpus) |> //23 val textAnalyzer = TextAnalyzer[Long](parse, LEXICON) val pfnText = textAnalyzer |> //24 for { corpus <- pfnDocs(None) //25 termsFreq <- pfnText(corpus) //26 featuresSet <- textAnalyzer.quantize(termsFreq) //27 expected <- Try(difference(TSLA_QUOTES, diffInt)) //28 nb <- NaiveBayes[Double](1.0, featuresSet._2.zip(expected))//29 } yield { show(s"Naive Bayes model${nb.toString(quantized._1)}") … }
A document source is fully defined by the path of the data input files and the format used in the timestamp (line 23
). The text analyzer and its explicit pfnText
data transformation is instantiated (line 24
). The text processing pipeline is defined by the following steps:
pfnDoc
partial function (line 25
).termsFreq
relative keyword frequency vector using the pfnText
partial function (line 26
).featuresSet
using quantize
(line 27
) (refer to the The differential operator section under Time series in Scala in Chapter 3, Data Preprocessing).NaiveBayes
model using the pair (featuresSet._2
and expected
) as training data (line 29
).The expected class values (0,1) are extracted from the daily stock price for Tesla Motors, TSLA_QUOTES
:
val TSLA_QUOTES = Array[Double](250.56, 254.84, … )
The semantic analysis
This example uses a very primitive semantic map (lexicon) for the sake of illustrating the benefits and inner workings of the multinomial Naïve Bayes algorithm. Commercial applications involving sentiment analysis or topic analysis require a deeper understanding of semantic associations and extraction of topics using advanced generative models, such as the Latent Dirichlet allocation.
The following chart describes the frequency of occurrences of some of the keywords related to either Tesla Motors or its stock ticker TSLA:
The following chart plots the expected change in the direction of the stock price for the trading day following the press release(s) or news article(s):
The preceding chart displays the historical price of the stock TSLA with the direction (UP and DOWN). The classification of 15 percent of the labeled data selected for the validation of the classifier has an F1 score of 0.71. You need to keep in mind that no preprocessing or clustering was performed to isolate the most relevant features/keywords. We initially selected the keywords according to the frequency of their occurrences in the financial news.
It is fair to assume that some of the keywords have a more significant impact on the direction of the stock price than others. One simple but interesting exercise is to record the value of the F1 score for a validation for which only the observations that have a high number of occurrences of a specific keyword are used, as shown in the following graph:
The preceding bar chart shows that the terms China, representing all the mentions of the activities of Tesla Motors in China, and Charger, which covers all the references to the charging stations, have a significant positive impact on the direction of the stock with a probability averaging to 75 percent. The terms under the Risk category have a negative impact on the direction of the stock with a probability of 68 percent, or a positive impact of the direction of the stock with a probability of 32 percent. Within the remaining eight categories, 72 percent of them were unusable as a predictor of the direction of the stock price.
This approach can be used for selecting features as an alternative to mutual information for using classifiers that are more elaborate. However, it should not regarded as the primary methodology for selecting features, but instead as a by-product of the Naïve Bayes formula applied to models with a very small number of relevant features. Techniques such as the principal components analysis, as described in the Principal components analysis section under Dimension reduction in Chapter 4, Unsupervised Learning, are available to reduce the dimension of the problem and make Naïve Bayes a viable classifier.