Storing large data in Mongo using GridFS

A document size in MongoDB can be up to 16 MB. But does that mean we cannot store data more than 16 MB in size? There are cases where you prefer to store videos and audio files in database rather than in a filesystem for a number of advantages such as a few of them are storing metadata along with them, when accessing the file from an intermediate location, and replicating the contents for high availability if replication is enabled on the MongoDB server instances. GridFS can be used to address such use cases in MongoDB. We will also see how GridFS manages large content that exceeds 16 MB and analyzes the collections it uses for storing the content behind the scene. For test purpose, we will not use data exceeding 16 MB but something smaller to see GridFS in action.

Getting ready

Look at the recipe Installing single node MongoDB in Chapter 1, Installing and Starting the Server and start a single instance of Mongo. That is the only prerequisite for this recipe. Start a Mongo shell and connect to the started server. Additionally, we will use the mongofiles utility to store data in GridFS from command line.

How to do it…

  1. Download the code bundle of the book and save the image file glimpse_of_universe-wide.jpg to your local drive (you may choose any other large file as the matter of fact and provide appropriate names of the file with the commands we execute). For the sake of the example, the image is saved in the home directory. We will split our steps into three parts.
  2. With the server up and running, execute the following command from the operating system's shell with the current directory being the home directory. There are two arguments here. The first one is the name of the file on the local filesystem and the second one is the name that would be attached to the uploaded content in MongoDB.
    $ mongofiles put -l glimpse_of_universe-wide.jpg universe.jpg
    
  3. Let's now query the collections to see how this content is actually stored in the collections behind the scenes. With the shell open, execute the following two queries. Make sure that in the second query, you ensure to mention not selecting the data field.
    > db.fs.files.findOne({filename:'universe.jpg'})
    > db.fs.chunks.find({}, {data:0})
    
  4. Now that we have put a file to GridFS from the operating system's local filesystem, we will see how we can get the file to the local filesystem. Execute the following from the operating system shell:
    $ mongofiles get -l UploadedImage.jpg universe.jpg
    
  5. Finally, we will delete the file we uploaded as follows. From the operating system shell, execute the following:
    $ mongofiles delete universe.jpg
    
  6. Confirm the deletion using the following queries again:
    > db.fs.files.findOne({filename:'universe.jpg'})
    > db.fs.chunks.find({}, {data:0})
    

How it works…

Mongo distribution comes with a tool called mongofiles, which lets us upload the large content to Mongo server that gets stored using the GridFS specification. GridFS is not a different product but a specification that is standard and followed by different drivers for MongoDB for storing data greater than 16 MB, which is the maximum document size. It can even be used for files less than 16 MB, as we did in our recipe, but there isn't really a good reason to do that. There is nothing stopping us from implementing our own way of storing these large files, but it is preferred to follow the standard. This is because all drivers support it and does the heavy lifting of splitting of big file into small chunks and assembling them back when needed.

We kept the image downloaded from the Packt Publishing site and uploaded using mongofiles to MongoDB. The command to do that is put and the -l option gives the name of the file on the local drive that we want to upload. Finally, the name universe.jpg is the name of the file we want it to be stored as on GridFS.

On successful execution, we should see something like the following on the console:

connected to: 127.0.0.1
added file: { _id: ObjectId('5310d531d1e91f93635588fe'), filename: "universe.jpg
", chunkSize: 262144, uploadDate: new Date(1393612082137), md5: 
d894ec31b8c5add
d0c02060971ea05ca", length: 2711259 }
done!

This gives us some details of the upload, the unique _id for the uploaded file, the name of the file, the chunk size, which is the size of the chunk this big file is broken into (by default 256 KB), the date of upload, the checksum of the uploaded content, and the total length of upload. This checksum can be computed beforehand and then compared after the upload to check if the uploaded content was not corrupt.

Execute the following query from the mongo shell in test database:

> db.fs.files.findOne({filename:'universe.jpg'})

We see that the output we saw for the put command of mongofiles same as the document queried above in the collection fs.files. This is the collection where all the uploaded file details are put when some data is added to GridFS. There will be one document per upload. Applications can later also modify this document to add their own custom meta data along with the standard details added to my Mongo when adding the data. Applications can very well use this collection to add details like, the photographer, the location where the image was taken, where was it taken, and details like tags for individuals in the image in this collection if the document is for an image upload.

The file content is something that contains this data. Let's execute the following query:

> db.fs.chunks.find({}, {data:0})

We have deliberately left out the data field from the result selected. Let's look at the structure of the result document:

{
_id: <Unique identifier of type ObjectId representing this chunk>,
file_id: <ObjectId of the document in fs.files for the file whose chunk this document represent>,
n:<The chunk identifier starts with 0, this is useful for knowing the order of the chunks>,
data: <BSON binary content  for the data uploaded for the file>
}

For the file we uploaded, we have 11 chunks of a maximum 256 KB each. When a file is being requested, the fs.chunks collection is searched by the file_id that comes from the _id field of fs.files collection and the field n, which is the chunk's sequence. A unique index is created on these two fields when this collection is created for the first time when a file is uploaded using GridFS for the fast retrieval of chunks using the file ID sorted by chunk sequence number.

Similar to put, the get option is used to retrieve the files from the GridFS and put them on local filesystem. The difference in the command is to use the get instead of put, the -l still is used to provide the name of the file that this file would be saved as on the local filesystem and the final command line parameter is the name of the file as stored in GridFS. This is the value of the filename field in fs.files collection. Finally, the delete command of mongofiles simply removes the entry of the file from fs.files and fs.chunks collections. The name of the file given for delete is again the value present in the filename field of the fs.files collection.

Some important use cases of using GridFS are when there is some user generated contents like large reports on some static data that doesn't change too often and are expensive to generate frequently. Instead of running them all the times, it can be run once and stored until a change in the static data is detected; in which case, the stored report is deleted and re-executed on next request of the data. The filesystem may not always be available to the application to write the files to, in which case this is a good alternative. There are cases where one might be interested in some intermediate chunk of the data stored, in which case the chunk containing the required data be accessed. You get some nice features like the MD5 content of the data, which is stored automatically and is available for use by the application.

Now that we have seen what GridFS is, let's see some scenarios where using GridFS might not be a very good idea. The performance of accessing the content from MongoDB using GridFS and directly from the filesystem will not be same. Direct filesystem access will be faster than GridFS and Proof of Concept (POC) for the system to be developed is recommended to measure the performance hit and see if it is within the acceptable limits; if so, the trade off in performance might be worth for the benefits we get. Also, if your application server is fronted with CDN, you might not actually need a lot of IO for static data stored in GridFS. Since GridFS stores the data in multiple documents in collections, atomically updating them is not possible. If we know the content is less than 16 MB, which is the case in lot of user-generated content, or some small files uploaded, we may skip GridFS altogether and store the content in one document as BSON supports storing binary content in the document. Refer to the previous recipe Storing binary data in Mongo for more details.

We would rarely use mongofiles utility to store, retrieve, and delete data from GridFS. Though it may occasionally be used, we will mostly perform these operations from an application. In the next couple of recipes, we will see how to connect to GridFS to store, retrieve, and delete files using Java and Python clients.

There's more…

Though this is not much to do with Mongo, Openstack is an Infrastructure as a Service (IaaS) platform and offers a variety of services for Compute, Storage, Networking, and so on. One of the image storage service called Glance supports a lot of persistent stores to store the images. One of the supported stores by Glace is MongoDB's GridFS. You can find more information on how to configure Glance to use GridFS at the following URL: http://docs.openstack.org/trunk/config-reference/content/ch_configuring-openstack-image-service.html.

See also

You can refer to the following recipes:

  • Storing data to GridFS from Java client
  • Storing data to GridFS from Python client
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset