Most of the text analysis data mining algorithms operate on vector data. We can use a vector space model to represent text data as a set of vectors. For an example, we can build a vector space model by taking the set of all terms that appear in the dataset and by assigning an index to each term in the term set. Number of terms in the term set is the dimensionality of the resulting vectors and each dimension of the vector corresponds to a term. For each document, the vector contains the number of occurrences of each term at the index location assigned to that particular term. This creates vector space model using term frequencies in each document, similar to the result of the computation we perform in the Generating an inverted index using Hadoop MapReduce recipe of Chapter 7, Searching and Indexing.
However, creating vectors using the preceding term count model gives a lot of weight to the terms that occur frequently across many documents (for example, the, is, a, are, was, who, and so on), although these frequent terms have only a very minimal contribution when it comes to defining the meaning of a document. The Term frequency-inverse document frequency (TF-IDF) model solves this issue by utilizing the inverted document frequencies (IDF) to scale the term frequencies (TF). IDF is typically calculated by first counting the number of documents (DF) the term appears in, inversing it (1/DF), and normalizing it by multiplying with the number of documents and using the logarithm of the resultant value as shown roughly by the following equation:
TF-IDFi= TFi X log (N/DFi)
In this recipe, we'll create TF-IDF vectors from a text dataset using a built-in utility tool of Apache Mahout.
Install and deploy Hadoop MapReduce and HDFS. Export the HADOOP_HOME
environment variable to point to your Hadoop installation root folder.
Download and install Apache Mahout. Export the MAHOUT_HOME
environment variable to point to your Mahout installation root folder. Refer to the Installing Mahout recipe of Chapter 5, Hadoop Ecosystem, for more information on installing Mahout.
The following steps show you how to build a vector model of the 20news
dataset:
20news
dataset from http://qwone.com/~jason/20Newsgroups/20news-19997.tar.gz.>bin/hadoop fs -mkdir 20news-all >bin/Hadoop fs –put <extracted_folder> 20news-all
MAHOUT_HOME
. Generate Hadoop sequence files from the uploaded text data:>bin/mahout seqdirectory -i 20news-all -o 20news-seq
>bin/mahout seq2sparse –i 20news-seq -o 20news-vector
This launches a series of MapReduce computations, as shown in the following screenshot; wait for the completion of these computations:
tfidf-vectors
folder contains the TF-IDF model vectors, the tf-vectors
folder contains the term count model vectors and the dictionary.file-0
folder contains the term to term-index mapping.>/bin/hadoop dfs -ls 20news-vector Found 7 items drwxr-xr-x - usupergroup 0 2012-11-27 16:53 /user/u/20news-vector /df-count -rw-r--r-- 1 usupergroup 7627 2012-11-27 16:51 /user/u/20news-vector/dictionary.file-0 -rw-r--r-- 1 usupergroup 8053 2012-11-27 16:53 /user/u/20news-vector/frequency.file-0 drwxr-xr-x - usupergroup 0 2012-11-27 16:52 /user/u/20news-vector/tf-vectors drwxr-xr-x - usupergroup 0 2012-11-27 16:54 /user/u/20news-vector/tfidf-vectors drwxr-xr-x - usupergroup 0 2012-11-27 16:50 /user/u/20news-vector/tokenized-documents drwxr-xr-x - usupergroup 0 2012-11-27 16:51 /user/u/20news-vector/wordcount
<term index>:<TF-IDF value>
:>bin/mahout seqdumper -i 20news-vector/tfidf-vectors/part-r-00000 …… Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.mahout.math.VectorWritable Key: /54492: Value: {225:3.374729871749878,400:1.5389964580535889,321:1.0,324:2.386294364929199,326:2.386294364929199,315:1.0,144:2.0986123085021973,11:1.0870113372802734,187:2.652313232421875,134:2.386294364929199,132:2.0986123085021973,......} ……
Hadoop sequence files store the data as binary key-value pairs and supports data compression. Mahout's seqdirectory
command converts the text files into Hadoop SequenceFile
by using the filename of the text file as the key and the contents of the text file as the value. The seqdirectory
command stores all the text contents into a single SequenceFile
. However, it's possible for us to specify a chuck size to control the actual storage of the SequenceFile
data blocks in the HDFS. Following are a selected set of options for the seqdirectory
command:
> bin/mahout seqdirectory –i <HDFS path to text files> -o <HDFS output directory for sequence file> -ow If present, overwrite the output directory -chunk<chunk size> In MegaBytes.Defaults to 64mb -prefix<keyprefix> The prefix to be prepended to the key
The seq2sparse
command is an Apache Mahout tool that supports the generation of sparse vectors from SequenceFiles
containing text data. It supports the generation of both TF as well as TF-IDF vector models. This command executes as a series of MapReduce computations. Following are a selected set of options for the seq2sparse
command:
bin/mahout seq2sparse -i <HDFS path to the text sequence file> -o <HDFS output directory> -wt{tf|tfidf} -chunk <max dictionary chunksize inmb to keep in memory> --minSupport<minimum support> --minDF<minimum document frequency> --maxDFPercent<MAX PERCENTAGE OF DOCS FOR DF
minSupport
is the minimum frequency for the word to be considered as a feature. minDF
is the minimum number of documents the word needs to be in. maxDFPercent
is the maximum value of the expression (document frequency of a word/total number of documents) in order for that word to be considered as a good feature in the document. This helps remove high frequency features such as stop words.
You can use the Mahout seqdumper
command to dump the contents of a SequenceFile
format that uses the Mahout Writable
data types as plain text:
bin/mahout seqdumper -i <HDFS path to the sequence file> -o <output directory> --count Output only the number of key value pairs. --numItems Max number of key value pairs to output --facets Output the value counts per key.