MongoDB is a NoSQL database that was designed for storing and retrieving large amounts of data. MongoDB is often used for user-facing data.
This data must be cleaned and formatted before it can be made available. Apache Pig was designed, in part, with this kind of work in mind. The MongoStorage
class makes it extremely convenient to bulk process the data in HDFS using Pig and then load this data directly into MongoDB. This recipe will use the MongoStorage
class to store data from HDFS into a MongoDB collection.
The easiest way to get started with the Mongo Hadoop Adaptor is to clone the mongo-hadoop
project from GitHub and build the project configured for a specific version of Hadoop. A Git client must be installed to clone this project.
This recipe assumes that you are using the CDH3 distribution of Hadoop.
The official Git Client can be found at http://git-scm.com/downloads.
GitHub for Windows can be found at http://windows.github.com/.
GitHub for Mac can be found at http://mac.github.com/.
The Mongo Hadoop Adaptor can be found on GitHub at https://github.com/mongodb/mongo-hadoop. This project needs to be built for a specific version of Hadoop. The resulting JAR file must be installed on each node in the $HADOOP_HOME/lib
folder.
The Mongo Java Driver is required to be installed on each node in the $HADOOP_HOME/lib
folder. It can be found at https://github.com/mongodb/mongo-java-driver/downloads.
Complete the following steps to copy data from HDFS to MongoDB:
mongo-hadoop
repository:git clone https://github.com/mongodb/mongo-hadoop.git
git checkout release-1.0
mongo-hadoop
should target. In the folder that mongo-hadoop
was cloned to, open the build.sbt
file with a text editor. Change the following line:hadoopRelease in ThisBuild := "default"
to
hadoopRelease in ThisBuild := "cdh3"
mongo-hadoop
:./sbt package
This will create a file named mongo-hadoop-core_cdh3u3-1.0.0.jar
in the core/target
folder. It will also create a file named mongo-hadoop-pig_cdh3u3-1.0.0.jar
in the pig/target
folder.
mongo-hadoop-core
, mongo-hadoop-pig
, and the MongoDB Java Driver to $HADOOP_HOME/lib
on each node:cp mongo-hadoop-core_cdh3u3-1.0.0.jar mongo-2.8.0.jar $HADOOP_HOME/lib
register /path/to/mongo-hadoop/mongo-2.8.0.jar register /path/to/mongo-hadoop/core/target/mongo-hadoop-core-1.0.0.jar register /path/to/mongo-hadoop/pig/target/mongo-hadoop-pig-1.0.0.jar define MongoStorage com.mongodb.hadoop.pig.MongoStorage(); weblogs = load '/data/weblogs/weblog_entries.txt' as (md5:chararray, url:chararry, date:chararray, time:chararray, ip:chararray); store weblogs into 'mongodb://<HOST>:<PORT>/test.weblogs_from_pig' using MongoStorage();
The Mongo Hadoop Adaptor provides a new Hadoop compatible
filesystem implementation, MongoInputFormat
and MongoOutputFormat
. These abstractions make working with MongoDB similar to working with any Hadoop compatible filesystem. MongoStorage
converts Pig types to the BasicDBObjectBuilder
object type, which is used by MongoDB.