Generating an inverted index using Hadoop MapReduce

Most of the text searching systems rely on inverted index to look up the set of documents that contains a given word or a term. In this recipe, we are going to build a simple inverted index that computes a list of terms in the documents, the set of documents that contains each term, and the term frequency in each of the documents. Retrieval of results from an inverted index can be as simple as returning the set of documents that contains the given terms or can involve much more complex operations such as returning the set of documents ordered based on a particular ranking.

Getting ready

You must have Apache Hadoop (preferably version 1.0.x) configured and installed to follow this recipe. Apache Ant for the compiling and building the source code.

How to do it...

In the following steps, we will use a MapReduce program to build an inverted index for a text dataset.

  1. Export the $HADOOP_HOME environmental variable pointing to the root of your local Apache Hadoop installation.
  2. Create a directory in HDFS and upload a text data set. This data set should consist of one or more text files.
    > bin/hadoop dfs -mkdir input
    > bin/hadoop dfs -put *.txt input
    

    Note

    You can download the text versions of Project Gutenberg books by following the instructions given at the following link. Make sure to provide the filetypes query parameter of the download request as txt. Unzip the downloaded files. You can use the unzipped text files as the text data set for this recipe. http://www.gutenberg.org/wiki/Gutenberg:Information_About_Robot_Access_to_our_Pages

  3. Unzip the resources bundle for this chapter and change to that directory.
  4. Compile the source by running ant build command from the unzipped directory.
  5. Copy the resulting build/c7-samples.jar to your Hadoop home directory.
  6. Run the inverted indexing MapReduce job using the following command from the Hadoop home directory. Provide the HDFS directory where you uploaded the input data in step 2 as the first argument and provide a path to store the output as the second argument.
    >  bin/hadoop jar c7-samples.jar chapter7.TextOutInvertedIndexer input output
    
  7. Check the output directory for the results by running the following command. The output will consist of the term followed by a comma-separated list of filename and frequency.
    >  bin/hadoop dfs -cat output/*
    ARE three.txt:1,one.txt:1,four.txt:1,two.txt:1,
    AS three.txt:2,one.txt:2,four.txt:2,two.txt:2,
    AUGUSTA three.txt:1,
    About three.txt:1,two.txt:1,
    Abroad three.txt:2,
    ……
    
  8. We used the text outputting invert indexing MapReduce program in step 6 for the clarity of understanding the algorithm. The src/chapter9/InvertIndexer.java program uses the Hadoop Sequence Files and Map Writable to output an index, which is more friendly for machine processing and more efficient for storage. You can run this version of the program by substituting the command in step 6 with the following command:
    >  bin/hadoop jar c7-samples.jar chapter7.InvertedIndexer input output
    

How it works...

Map function receives a chunk of an input document as the input and outputs the term and <docid, 1> pair for each word. In the Map function, we first replace all the non-alphanumeric characters from the input text value before tokenizing it.

public void map(Object key, Text value, ……… {
    String valString = value.toString().replaceAll("[^a-zA-Z0-9]+"," ");
    StringTokenizer itr = new StringTokenizer(valString);
     StringTokenizer(value.toString());

    FileSplit fileSplit = (FileSplit) context.getInputSplit();
    String fileName = fileSplit.getPath().getName();
    while (itr.hasMoreTokens()) {
        term.set(itr.nextToken());
        docFrequency.set(fileName, 1);
        context.write(term, docFrequency);
    }
}

We use the getInputSplit() method of the MapContext to obtain a reference to InputSplit of assigned to the current Map task. The InputSplits for this computation are instances of FileSplit due to the usage of FileInputFormat based InputFormat. Then we use the getPath() method of FileSplit to obtain the path of the file containing the current split and extract the filename from it. We use this extracted filename as the document ID when constructing the inverted index.

The reduce function receives IDs and frequencies of all the documents that contain the term (key) as the input. The reduce function outputs the term and a list of document IDs and the number of occurrences of the term in each document as the output:

public void reduce(Text key, Iterable<TermFrequencyWritable> values,Context context) …………{

    HashMap<Text, IntWritable> map = new HashMap<Text, IntWritable>();
    for (TermFrequencyWritable val : values) {
        Text docID = new Text(val.getDocumentID());
        int freq = val.getFreq().get();
        if (map.get(docID) != null) {
            map.put(docID, new IntWritable(map.get(docID).get() + freq));
        } else {
            map.put(docID, new IntWritable(freq));
        }
    }
    MapWritable outputMap = new MapWritable();
    outputMap.putAll(map);
    context.write(key, outputMap);
}

In the preceding model, we output a record for each word, generating a large amount of Map task to Reduce task intermediate data. We use the following combiner to aggregate the terms emitted by the Map tasks, reducing the size and amount of Map to Reduce intermediate data transfer.

public void reduce(Text key, Iterable<TermFrequencyWritable> values …… {
    int count = 0;
    String id = "";
  for (TermFrequencyWritable val : values) {
    count++;
    if (count == 1) {
      id = val.getDocumentID().toString();
    }
  }
  TermFrequencyWritable writable = new TermFrequencyWritable();
  writable.set(id, count);
  context.write(key, writable);
}

In the driver program, we set the Mapper, Reducer, and the combiner classes. Also we specify both output value and the map output value properties as we use different value types for the Map tasks and the Reduce tasks.

Job job = new Job(conf, "Inverted Indexer");
…
job.setMapperClass(IndexingMapper.class);
job.setReducerClass(IndexingReducer.class);
job.setCombinerClass(IndexingCombiner.class);
…
job.setMapOutputValueClass(TermFrequencyWritable.class);
job.setOutputValueClass(MapWritable.class);
job.setOutputFormatClass(SequenceFileOutputFormat.class);

There's more...

The older MapReduce API of Apache Hadoop (org.apache.hadoop.mapred.*) supports a file format called MapFile that can be used to store an index in to the data stored in SequenceFiles. MapFile is very useful when we need to random access records stored in a large SequenceFile. We can utilize the MapFiles to store a secondary index in to our inverted index. You can use MapFileOutputFormat to output MapFiles, which would consist of a SequenceFile containing the actual data and another file containing the index to the SequenceFile.

We can improve this indexing program by performing optimizations into such as filtering-stop words, substituting words with word stems, and storing more information about the context of the word, making the indexing a much more complex problem. Luckily, there exist several open source indexing frameworks that we can use for the indexing purposes. In this chapter we'll be using Apache Lucene-based Apache Solr and ElasticIndex for indexing purposes.

See also

The Creating TF and TF-IDF vectors for the text data recipe of Chapter 9, Mass Text Data Processing.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset