Manually padding a document

Without getting too much into the internals of storage, MongoDB uses memory-mapped files, which means that the data is stored in files exactly as it would be in memory; it will use low-level OS services to map these pages to memory. The documents are stored in continuous locations in Mongo data files and the problem arises when the document grows and no longer fits in the space. In such scenarios, Mongo rewrites the document towards the end of the collection with the updated data and clears up the space where it was originally placed (note that this space is not released to the OS as free space).

This is not a big problem for applications, which don't expect the documents to grow in size; however, this is a big performance hit for those who foresee this growth in the document size over a period of time and potentially, a lot of such document movements. The paddingFactor field, that we saw in the Viewing collection stats recipe, gets updated over a period of time, to some extent, and allocates some buffer for the document to grow. However, this is only over a period of time once a lot of documents have already been moved across the collection and the MongoDB server adjusts the padding size. Moreover, at the time of writing, this padding factor cannot be set in any way for the collection beforehand, based on your anticipated increase in the size of the document, to counter this document's rewrites by Mongo, and is set to a default value of 1. However, there is a small trick that does let you do this, and that is what we will see in this recipe. This is a commonly used practice for such requirements.

Getting ready

Nothing is particularly needed for this recipe, unless you plan to try out this simple technique; in which case, you would need a single instance up and running. Refer to the Single node installation of MongoDB recipe in Chapter 1, Installing and Starting the MongoDB Server, for how to start the server.

How to do it…

The idea of this technique is to add some dummy data to the document that is to be inserted. This dummy data's size, in addition to other data in the document, is approximately the same as the anticipated size of the document.

For example, if the average size of the document is estimated to be around 1200 bytes over a period of time, and there is 300 bytes of data present in the document while inserting it, we will add a dummy field that is around 900 bytes in size, so that the total document size sums up to 1200 bytes.

Once the document is inserted, we unset this dummy field, which leaves a hole in the file between the two consecutive documents. This empty space will then be used when the document grows over a period of time, minimizing the document's movements. This is not a foolproof method, as any document growing beyond the anticipated average growth will have to be copied by the server to the end of the collection. Also, documents not growing to the anticipated size tend to waste disk space.

The applications can come up with an intelligent strategy to, perhaps, adjust the size of the padding field based on a field in the document to take care of these shortcomings. However, this is something that is up to the application developers.

Let us now see a sample of this approach:

  1. We define a small function that will add a field called padField with an array of string values to the document as follows:
    function padDocument(doc) {
      doc.padField = []
      for(i = 0 ; i < 20 ; i++) {
        doc.padField[i] = 'Dummy'
      }
    }
    

    It will add an array called padField and a string called Dummy 20 times. There is no restriction on what type you add to the document and how many times it is added as long as it consumes the space you desire. The preceding code snippet is just a sample.

  2. The next step is to insert a document. We will define another function called insert in the following manner:
    function insert(collection, doc) {
      //1. Pad the document with padField
      padDocument(doc);
      //2. Create or store the _id field that would be used later
      if(typeof(doc._id) == 'undefined') {
        _id = ObjectId()
        doc._id = _id
      }
      else {
        _id = doc._id
      }
      //3. Insert the document with the padded field
      collection.insert(doc)
      //4. Remove the padded field. Use the saved _id to find the document to be updated.
      collection.update({'_id':_id}, {$unset:{'padField':1}})
    }
    
  3. We will now put this in to action by inserting a document in the testCol collection in the following manner:
    insert(db.testCol, {i:1})
    
  4. You may query the testCol collection using the following query and check whether the inserted document exists or not:
    > db.testCol.findOne({i:1})
    

Note that on querying, you would not find padField in the testCol collection. However, the space once occupied by the array stays between the subsequently inserted documents even if the field was unset.

How it works…

The insert function is self-explanatory and has comments in it to tell you what it does. An obvious question is, how can we be sure this is indeed what we intended to do? For this purpose, we shall do a small activity as follows. We will work on a manualPadTest collection for this purpose. From the Mongo shell, execute the following commands:

> db.manualPadTest.drop()
> db.manualPadTest.insert({i:1})
> db.manualPadTest.insert({i:2})
> db.manualPadTest.stats()

Take note of the avgObjSize field in the stats. Next, execute the following commands from the Mongo shell:

> db.manualPadTest.drop()
> insert(db.manualPadTest , {i:1})
> insert(db.manualPadTest , {i:2})
> db.manualPadTest.stats()

Take note of the avgObjSize field in the stats. This figure is much larger than the one we saw earlier in a regular insert without padding. The paddingFactor field, as we see in both cases, still is 1, but the latter case has more buffer for the document to grow.

One catch in the insert function we used in this recipe is that the insert into the collection and the update document operations are not atomic.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset