Exporting data from HDFS into MongoDB

This recipe will use the MongoOutputFormat class to load data from an HDFS instance into a MongoDB collection.

Getting ready

The easiest way to get started with the Mongo Hadoop Adaptor is to clone the Mongo-Hadoop project from GitHub and build the project configured for a specific version of Hadoop. A Git client must be installed to clone this project.

This recipe assumes that you are using the CDH3 distribution of Hadoop.The official Git Client can be found at http://git-scm.com/downloads.

GitHub for Windows can be found at http://windows.github.com/.

GitHub for Mac can be found at http://mac.github.com/.

The Mongo Hadoop Adaptor can be found on GitHub at https://github.com/mongodb/mongo-hadoop. This project needs to be built for a specific version of Hadoop. The resulting JAR file must be installed on each node in the $HADOOP_HOME/lib folder.

The Mongo Java Driver is required to be installed on each node in the $HADOOP_HOME/lib folder. It can be found at https://github.com/mongodb/mongo-java-driver/downloads.

How to do it...

Complete the following steps to copy data form HDFS into MongoDB:

  1. Clone the mongo-hadoop repository with the following command line:
    git clone https://github.com/mongodb/mongo-hadoop.git
  2. Switch to the stable release 1.0 branch:
    git checkout release-1.0
  3. Set the Hadoop version which mongo-hadoop should target. In the folderthat mongo-hadoop was cloned to, open the build.sbt file with a text editor. Change the following line:
    hadoopRelease in ThisBuild := "default"


    hadoopRelease in ThisBuild := "cdh3"
  4. Build mongo-hadoop:
    ./sbt package

    This will create a file named mongo-hadoop-core_cdh3u3-1.0.0.jar in the core/targ

  5. Download the MongoDB Java Driver Version 2.8.0 from https://github.com/mongodb/mongo-java-driver/downloads.
  6. Copy mongo-hadoop and the MongoDB Java Driver to $HADOOP_HOME/lib on each node:
    cp mongo-hadoop-core_cdh3u3-1.0.0.jar mongo-2.8.0.jar $HADOOP_HOME/lib
  7. Create a Java MapReduce program that will read the weblog_entries.txt file from HDFS and write them to MongoDB using the MongoOutputFormat class:
    import java.io.*;
    import org.apache.commons.logging.*;
    import org.apache.hadoop.conf.*;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.*;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
    import org.apache.hadoop.mapreduce.*;
    import org.bson.*;
    import org.bson.types.ObjectId;
    import com.mongodb.hadoop.*;
    import com.mongodb.hadoop.util.*;
    public class ExportToMongoDBFromHDFS {
       private static final Log log =
       public static class ReadWeblogs extends Mapper<LongWritable, Text, ObjectId, BSONObject>{
          public void map(Text key, Text value, Context context) throws IOException, InterruptedException{
             System.out.println("Key: " + key);
             System.out.println("Value: " + value);
             String[] fields = value.toString().split("	");
             String md5 = fields[0];
             String url = fields[1];
             String date = fields[2];
             String time = fields[3];
             String ip = fields[4];
             BSONObject b = new BasicBSONObject();
             b.put("md5", md5);
             b.put("url", url);
             b.put("date", date);
             b.put("time", time);
             b.put("ip", ip);
             context.write( new ObjectId(), b);
       public static void main(String[] args) throws Exception{
          final Configuration conf = new Configuration();
          System.out.println("Configuration: " + conf);
          final Job job = new Job(conf, "Export to Mongo");
          Path in = new Path("/data/weblogs/weblog_entries.txt");
          FileInputFormat.setInputPaths(job, in);
          System.exit(job.waitForCompletion(true) ? 0 : 1 );
  8. Export as a runnable JAR file and run the job:
    hadoop jar ExportToMongoDBFromHDFS.jar
  9. Verify that the weblogs MongoDB collection was populated from the Mongo shell:

How it works...

The Mongo Hadoop Adaptor provides a new Hadoop compatible filesystem implementation, MongoInputFormat, and MongoOutputFormat. These abstractions make working with MongoDB similar to working with any Hadoop compatible filesystem.

