Storing large data in MongoDB using GridFS

A document's size in MongoDB can be a maximum of 16 MB, but does that mean we cannot store data that is more than 16 MB in size? There are cases where you prefer to store videos and audio files in a database rather than in the filesystem for a number of advantages, such as, a few of them are storing metadata along with them, accessing the file from an intermediate location, and replicating the contents for high availability if replication is enabled on the MongoDB server instances. GridFS is the way to address such use cases in MongoDB. We will also see how GridFS manages large content that exceeds 16 MB and will analyze the collections it uses for storing the content behind the scenes. For test purposes, we will not be using data exceeding 16 MB but something smaller to see GridFS in action.

Getting ready

Refer to the Single node installation of MongoDB recipe in Chapter 1, Installing and Starting the MongoDB Server, and start a single instance of Mongo. This is the only prerequisite for this recipe. Start a Mongo shell and connect to the started server. Additionally, we will use the mongofiles utility to store data in GridFS from the command line.

How to do it…

  1. Download the code bundle of the book from the book's website and save the image file named glimpse_of_universe-wide.jpg from it to your local drive (you may choose any other large file, as a matter of fact, and provide an appropriate name to the file with the commands we execute). For the sake of the example, the image is saved in the Home directory. We will split our steps into three parts:
  2. With the server up and running, execute the following command from the operating system's shell, with the current directory being the Home directory. There are two arguments here. The first one is the name of the file on the local filesystem, and the second one is the name that will be attached to the uploaded content in MongoDB.
    $ mongofiles put -l glimpse_of_universe-wide.jpg universe.jpg
    
  3. Let us now query the collections to see how this content is actually stored in the collections behind the scenes. With the shell open, execute the following two queries. Make sure that in the second query, you mention not to select the data field:
    > db.fs.files.findOne({filename:'universe.jpg'})
    > db.fs.chunks.find({}, {data:0})
    
  4. Now that we have put a file to GridFS from the operating system's local filesystem, we will see how we can get the file to the local filesystem. Execute the following command from the operating system shell:
    $ mongofiles get -l UploadedImage.jpg universe.jpg
    
  5. Finally, we will delete the file we uploaded as follows. From the operating system shell, execute the following command:
    $ mongofiles delete universe.jpg
    
  6. Confirm the deletion using the following queries again:
    > db.fs.files.findOne({filename:'universe.jpg'})
    > db.fs.chunks.find({}, {data:0})
    

How it works…

Mongo distribution comes with an out-of-the-box tool called mongofiles that lets us upload large content to the Mongo server; this gets stored using the GridFS specification. GridFS is not a different product but a specification that is standard and followed by different drivers for MongoDB to store data greater than 16 MB, which is the maximum document size. It can even be used for files of size less than 16 MB, as we did in our recipe, but there isn't really a good reason to do that. There is nothing stopping us from implementing our own way of storing these large files, but it is preferred to follow the standard because all drivers support it; they do the heavy lifting of splitting the big file into small chunks and reassembling the chunks when needed.

We kept the image downloaded from the book's website and uploaded it using mongofiles to MongoDB. The command to do that is put , and the -l option gives the name of the file on the local drive that we want to upload. Finally, universe.jpg is the name by which we want the file to be stored on GridFS.

On successful execution, we should see the following output:

connected to: 127.0.0.1
added file: { _id: ObjectId('5310d531d1e91f93635588fe'), filename: "universe.jpg
", chunkSize: 262144, uploadDate: new Date(1393612082137), md5: "d894ec31b8c5add
d0c02060971ea05ca", length: 2711259 }
done!

This gives us some details of the upload, namely, the unique _id for the uploaded file, the name of the file, the chunk size (the size of each chunk this big file is broken into, which by default is 256 KB), the date of upload, the checksum of the uploaded content, and the total length of upload. The checksum can be computed beforehand and then compared after the upload, to check whether the uploaded content was corrupted or not.

We executed the following query from the Mongo shell in the test database:

> db.fs.files.findOne({filename:'universe.jpg'})

We see that the output we saw for the put command of mongofiles is the same as the document queried earlier in the fs.files collection. This is the collection where all the uploaded file details are put when some data is added to GridFS. There will be one document per upload. Applications can later also modify this document to add their own custom metadata along with the standard details added by Mongo when adding the data. Applications can very well use this collection to add details. For example, if the document is for an image upload, we can add details such as the name of the photographer, the location where the image was taken, when it was taken, and tags for the individuals in the image in this collection.

The file content is something that contains this data. Let us execute the following query:

> db.fs.chunks.find({}, {data:0})

We have deliberately left out the data field from the result selected. Let us look at the structure of the result document:

{
_id: <Unique identifier of type ObjectId representing this chunk>,
file_id: <ObjectId of the document in fs.files for the file whose chunk this document represent>,
n:<The chunk identifier starts with 0, this is useful for knowing the order of the chunks>,
data: <BSON binary content  for the data uploaded for the file>
}

For the file we uploaded, we have 11 chunks of maximum 256 KB each. When a file is being requested, the fs.chunks collection is searched by file_id, which comes from the _id field of the fs.files collection, and the n field, which is the chunk's sequence. A unique index created on these two fields, when this collection is created for the first time when a file is uploaded using GridFS for fast retrieval of chunks using the file ID, is sorted by the chunk's sequence number.

Similar to put, the get option is used to retrieve the files from the GridFS and put them on a local filesystem. The option -l, which is still used to provide the name, is the name of the file that would be saved on the local filesystem. The final parameter to get the command is the name of the file as stored on GridFS. This is the value of the filename field in the fs.files collection. Finally, the delete command of mongofiles simply removes the entry of the file from the fs.files and fs.chunks collections. The name of the file given for deletion is again the value present in the filename field of the fs.files collection.

There's more…

Some important use cases of using GridFS are when there are some user-generated contents such as large reports on static data that doesn't change too often and is expensive to generate frequently. Instead of running them all the time, they can be run once and stored until a change in static data is detected, in which case the stored report is deleted and executed again on the next request of the data. The filesystem may not always be available to the application to write the files to, in which case this is a good alternative. There are cases where one might be interested in some intermediate chunk of the data stored, in which case the chunk containing the required data can be accessed. You get some nice features; for instance, the MD5 content of the data is stored automatically and is available for use by the application.

Now that we have seen what GridFS is, let us see some scenarios where using GridFS might not be a very good idea. The performance of accessing the content from MongoDB using GridFS and directly from the filesystem will not be the same. Direct filesystem access will be faster than GridFS, and proof of concept (POC) for the system to be developed is recommended to measure the performance, see if it is within the acceptable limits; the trade off in performance might be worth the benefits we get. Also, if your application server is fronted with CDN, you might not actually need a lot of I/O for static data stored in GridFS. As GridFS stores the data in multiple documents in collections, atomically updating them is not possible. If we know the content is less than 16 MB, which is the case in a lot of user-generated content or some small files uploaded, we may skip GridFS altogether and store the content in one document as BSON supports the storage of binary content in the document. For more details, refer to the Storing binary data in MongoDB recipe.

We will rarely be using the mongofiles utility to store, retrieve, and delete data from GridFS.Though it may occasionally be used, the majority of times we will be looking at performing these operations from an application. Thus, in the next couple of recipes, we will see how to connect to GridFS to store, retrieve, and delete files using Java and Python clients.

Though this has nothing much to do with Mongo, Openstack is an Infrastructure as a Service (IaaS) platform and offers a variety of services for computing, storing, networking, and so on. One of the image storage services called Glance supports a lot of persistent stores to store the images. One of the supported stores by Glance is MongoDB's GridFS. You can find more information on how to configure Glance to use GridFS at http://docs.openstack.org/trunk/config-reference/content/ch_configuring-openstack-image-service.html.

See also

  • The Storing data to GridFS from a Java client recipe
  • The Storing data to GridFS from a Python client recipe
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset