Most of the text searching systems rely on inverted index to look up the set of documents that contains a given word or a term. In this recipe, we are going to build a simple inverted index that computes a list of terms in the documents, the set of documents that contains each term, and the term frequency in each of the documents. Retrieval of results from an inverted index can be as simple as returning the set of documents that contains the given terms or can involve much more complex operations such as returning the set of documents ordered based on a particular ranking.
You must have Apache Hadoop (preferably version 1.0.x) configured and installed to follow this recipe. Apache Ant for the compiling and building the source code.
In the following steps, we will use a MapReduce program to build an inverted index for a text dataset.
$HADOOP_HOME
environmental variable pointing to the root of your local Apache Hadoop installation.> bin/hadoop dfs -mkdir input > bin/hadoop dfs -put *.txt input
You can download the text versions of Project Gutenberg books by following the instructions given at the following link. Make sure to provide the filetypes
query parameter of the download request as txt
. Unzip the downloaded files. You can use the unzipped text files as the text data set for this recipe. http://www.gutenberg.org/wiki/Gutenberg:Information_About_Robot_Access_to_our_Pages
ant build
command from the unzipped directory.build/c7-samples.jar
to your Hadoop home directory.
Provide the HDFS directory where you uploaded the input data in step 2 as the first argument and provide a path to store the output as the second argument.> bin/hadoop jar c7-samples.jar chapter7.TextOutInvertedIndexer input output
> bin/hadoop dfs -cat output/* ARE three.txt:1,one.txt:1,four.txt:1,two.txt:1, AS three.txt:2,one.txt:2,four.txt:2,two.txt:2, AUGUSTA three.txt:1, About three.txt:1,two.txt:1, Abroad three.txt:2, ……
src/chapter9/InvertIndexer.java
program uses the Hadoop Sequence Files and Map Writable to output an index, which is more friendly for machine processing and more efficient for storage. You can run this version of the program by substituting the command in step 6 with the following command:> bin/hadoop jar c7-samples.jar chapter7.InvertedIndexer input output
Map function receives a chunk of an input document as the input and outputs the term and <docid, 1>
pair for each word. In the Map function, we first replace all the non-alphanumeric characters from the input text value before tokenizing it.
public void map(Object key, Text value, ……… { String valString = value.toString().replaceAll("[^a-zA-Z0-9]+"," "); StringTokenizer itr = new StringTokenizer(valString); StringTokenizer(value.toString()); FileSplit fileSplit = (FileSplit) context.getInputSplit(); String fileName = fileSplit.getPath().getName(); while (itr.hasMoreTokens()) { term.set(itr.nextToken()); docFrequency.set(fileName, 1); context.write(term, docFrequency); } }
We use the getInputSplit()
method of the MapContext
to obtain a reference to InputSplit
of assigned to the current Map task. The InputSplits
for this computation are instances of FileSplit
due to the usage of FileInputFormat
based InputFormat
. Then we use the getPath()
method of FileSplit
to obtain the path of the file containing the current split and extract the filename from it. We use this extracted filename as the document ID when constructing the inverted index.
The reduce function receives IDs and frequencies of all the documents that contain the term (key) as the input. The reduce function outputs the term and a list of document IDs and the number of occurrences of the term in each document as the output:
public void reduce(Text key, Iterable<TermFrequencyWritable> values,Context context) …………{ HashMap<Text, IntWritable> map = new HashMap<Text, IntWritable>(); for (TermFrequencyWritable val : values) { Text docID = new Text(val.getDocumentID()); int freq = val.getFreq().get(); if (map.get(docID) != null) { map.put(docID, new IntWritable(map.get(docID).get() + freq)); } else { map.put(docID, new IntWritable(freq)); } } MapWritable outputMap = new MapWritable(); outputMap.putAll(map); context.write(key, outputMap); }
In the preceding model, we output a record for each word, generating a large amount of Map task to Reduce task intermediate data. We use the following combiner to aggregate the terms emitted by the Map tasks, reducing the size and amount of Map to Reduce intermediate data transfer.
public void reduce(Text key, Iterable<TermFrequencyWritable> values …… { int count = 0; String id = ""; for (TermFrequencyWritable val : values) { count++; if (count == 1) { id = val.getDocumentID().toString(); } } TermFrequencyWritable writable = new TermFrequencyWritable(); writable.set(id, count); context.write(key, writable); }
In the driver program, we set the Mapper, Reducer, and the combiner classes. Also we specify both output value and the map output value properties as we use different value types for the Map tasks and the Reduce tasks.
Job job = new Job(conf, "Inverted Indexer"); … job.setMapperClass(IndexingMapper.class); job.setReducerClass(IndexingReducer.class); job.setCombinerClass(IndexingCombiner.class); … job.setMapOutputValueClass(TermFrequencyWritable.class); job.setOutputValueClass(MapWritable.class); job.setOutputFormatClass(SequenceFileOutputFormat.class);
The older MapReduce API of Apache Hadoop (org.apache.hadoop.mapred.*
) supports a file format called MapFile
that can be used to store an index in to the data stored in SequenceFiles. MapFile is very useful when we need to random access records stored in a large SequenceFile. We can utilize the MapFiles to store a secondary index in to our inverted index. You can use MapFileOutputFormat
to output MapFiles, which would consist of a SequenceFile containing the actual data and another file containing the index to the SequenceFile.
We can improve this indexing program by performing optimizations into such as filtering-stop words, substituting words with word stems, and storing more information about the context of the word, making the indexing a much more complex problem. Luckily, there exist several open source indexing frameworks that we can use for the indexing purposes. In this chapter we'll be using Apache Lucene-based Apache Solr and ElasticIndex for indexing purposes.
The Creating TF and TF-IDF vectors for the text data recipe of Chapter 9, Mass Text Data Processing.