Creating TF and TF-IDF vectors for the text data

Most of the text analysis data mining algorithms operate on vector data. We can use a vector space model to represent text data as a set of vectors. For an example, we can build a vector space model by taking the set of all terms that appear in the dataset and by assigning an index to each term in the term set. Number of terms in the term set is the dimensionality of the resulting vectors and each dimension of the vector corresponds to a term. For each document, the vector contains the number of occurrences of each term at the index location assigned to that particular term. This creates vector space model using term frequencies in each document, similar to the result of the computation we perform in the Generating an inverted index using Hadoop MapReduce recipe of Chapter 7, Searching and Indexing.

Creating TF and TF-IDF vectors for the text data

The term frequencies and the resulting document vectors

However, creating vectors using the preceding term count model gives a lot of weight to the terms that occur frequently across many documents (for example, the, is, a, are, was, who, and so on), although these frequent terms have only a very minimal contribution when it comes to defining the meaning of a document. The Term frequency-inverse document frequency (TF-IDF) model solves this issue by utilizing the inverted document frequencies (IDF) to scale the term frequencies (TF). IDF is typically calculated by first counting the number of documents (DF) the term appears in, inversing it (1/DF), and normalizing it by multiplying with the number of documents and using the logarithm of the resultant value as shown roughly by the following equation:

TF-IDFi= TFi X log (N/DFi)

In this recipe, we'll create TF-IDF vectors from a text dataset using a built-in utility tool of Apache Mahout.

Getting ready

Install and deploy Hadoop MapReduce and HDFS. Export the HADOOP_HOME environment variable to point to your Hadoop installation root folder.

Download and install Apache Mahout. Export the MAHOUT_HOME environment variable to point to your Mahout installation root folder. Refer to the Installing Mahout recipe of Chapter 5, Hadoop Ecosystem, for more information on installing Mahout.

How to do it…

The following steps show you how to build a vector model of the 20news dataset:

  1. Download and extract the 20news dataset from http://qwone.com/~jason/20Newsgroups/20news-19997.tar.gz.
  2. Upload the extracted data to the HDFS:
    >bin/hadoop fs -mkdir 20news-all
    >bin/Hadoop fs –put  <extracted_folder> 20news-all
    
  3. Go to MAHOUT_HOME. Generate Hadoop sequence files from the uploaded text data:
    >bin/mahout seqdirectory -i 20news-all -o 20news-seq
    
  4. Generate TF and TF-IDF sparse vector models from the text data in the sequence files:
    >bin/mahout seq2sparse –i 20news-seq  -o 20news-vector  
    

    This launches a series of MapReduce computations, as shown in the following screenshot; wait for the completion of these computations:

    How to do it…
  5. Check the output folder by using the following command. The tfidf-vectors folder contains the TF-IDF model vectors, the tf-vectors folder contains the term count model vectors and the dictionary.file-0 folder contains the term to term-index mapping.
    >/bin/hadoop dfs -ls 20news-vector
    Found 7 items
    drwxr-xr-x   - usupergroup          0 2012-11-27 16:53 /user/u/20news-vector /df-count
    -rw-r--r--   1 usupergroup       7627 2012-11-27 16:51 /user/u/20news-vector/dictionary.file-0
    -rw-r--r--   1 usupergroup       8053 2012-11-27 16:53 /user/u/20news-vector/frequency.file-0
    drwxr-xr-x   - usupergroup          0 2012-11-27 16:52 /user/u/20news-vector/tf-vectors
    drwxr-xr-x   - usupergroup          0 2012-11-27 16:54 /user/u/20news-vector/tfidf-vectors
    drwxr-xr-x   - usupergroup          0 2012-11-27 16:50 /user/u/20news-vector/tokenized-documents
    drwxr-xr-x   - usupergroup          0 2012-11-27 16:51 /user/u/20news-vector/wordcount
    
  6. Optionally, you can also use the following command to dump the TF-IDF vectors as text. The key is the filename and the contents of the vectors are in the format <term index>:<TF-IDF value>:
    >bin/mahout seqdumper -i 20news-vector/tfidf-vectors/part-r-00000
    ……
    Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.mahout.math.VectorWritable
    Key: /54492: Value: {225:3.374729871749878,400:1.5389964580535889,321:1.0,324:2.386294364929199,326:2.386294364929199,315:1.0,144:2.0986123085021973,11:1.0870113372802734,187:2.652313232421875,134:2.386294364929199,132:2.0986123085021973,......}
    ……
    

How it works…

Hadoop sequence files store the data as binary key-value pairs and supports data compression. Mahout's seqdirectory command converts the text files into Hadoop SequenceFile by using the filename of the text file as the key and the contents of the text file as the value. The seqdirectory command stores all the text contents into a single SequenceFile. However, it's possible for us to specify a chuck size to control the actual storage of the SequenceFile data blocks in the HDFS. Following are a selected set of options for the seqdirectory command:

> bin/mahout seqdirectory –i <HDFS path to text files>
   -o <HDFS output directory for sequence file>
   -ow                   If present, overwrite the output directory
   -chunk<chunk size>    In MegaBytes.Defaults to 64mb
   -prefix<keyprefix>    The prefix to be prepended to the key

The seq2sparse command is an Apache Mahout tool that supports the generation of sparse vectors from SequenceFiles containing text data. It supports the generation of both TF as well as TF-IDF vector models. This command executes as a series of MapReduce computations. Following are a selected set of options for the seq2sparse command:

bin/mahout seq2sparse -i <HDFS path to the text sequence file>
  -o <HDFS output directory>
  -wt{tf|tfidf}
  -chunk <max dictionary chunksize inmb to keep in memory>
  --minSupport<minimum support>
  --minDF<minimum document frequency>
  --maxDFPercent<MAX PERCENTAGE OF DOCS FOR DF

minSupport is the minimum frequency for the word to be considered as a feature. minDF is the minimum number of documents the word needs to be in. maxDFPercent is the maximum value of the expression (document frequency of a word/total number of documents) in order for that word to be considered as a good feature in the document. This helps remove high frequency features such as stop words.

You can use the Mahout seqdumper command to dump the contents of a SequenceFile format that uses the Mahout Writable data types as plain text:

bin/mahout seqdumper -i <HDFS path to the sequence file>
   -o <output directory>
   --count             Output only the number of key value pairs.
   --numItems          Max number of key value pairs to output
   --facets            Output the value counts per key.

See also

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset