Understanding and analyzing oplogs

Oplog is a special collection and forms the backbone of the MongoDB replication. When any write operation or configuration changes are done on the replica set's primary, they are written to the oplog on the primary. All the secondary members then tail this collection to get the changes to be replicated. Tailing is synonymous with the tail command in Unix and can only be done on a special type of collection called capped collections. Capped collections are fixed size collections that maintain the insertion order just like a queue. When the collection's allocated space becomes full, the oldest data is overwritten. If you are not aware of capped collections and what tailable cursors are, refer to the Creating and tailing capped collection cursors in MongoDB recipe in Chapter 5, Advanced Operations, for more details.

Oplog is a capped collection present in the nonreplicated database called local. In the previous recipe, we saw what a local database is and what collections are present in it. Oplog is something we didn't discuss in the previous recipe, as it demands a lot more explanation and a dedicated recipe is needed to do it justice.

Getting ready

Refer to the Starting multiple instances as part of a replica set recipe in Chapter 1, Installing and Starting the MongoDB Server, for the prerequisites and to know about the replica set basics. Go ahead and set up a simple three-node replica set on your computer as mentioned in the recipe. Open a shell and connect to the primary member of the replica set. You will need to start the Mongo shell and connect to the primary instance.

How to do it…

  1. Execute the following commands after connecting to a primary from the shell to get the timestamp of the last operation present in oplog. We are interested in looking at the operations after this time.
    > use test
    > local = db.getSisterDB('local')
    > var cutoff = local.oplog.rs.find().sort({ts:-1}).limit(1).next().ts
    
  2. Execute the following command from the shell. Keep the output in the shell or copy it somewhere. We will analyze it later.
    > local.system.namespaces.findOne({name:'local.oplog.rs'})
    
  3. Insert 10 documents as follows:
    > for(i = 0; i < 10; i++) db.oplogTest.insert({'i':i})
    
  4. Execute the following update operation to set a string value for all documents with the value of i greater than 5, which are 6, 7, 8, and 9 in our case. It is a multiupdate operation:
    > db.oplogTest.update({i:{$gt:5}}, {$set:{val:'str'}}, false, true)
    
  5. Now create the index as follows:
    > db.oplogTest.ensureIndex({i:1}, {background:1})
    
  6. Execute the following query on oplog as follows:
    > local.oplog.rs.find({ts:{$gt:cutoff}}).pretty()
    

How it works…

For those aware of messaging and its terminologies, oplog can be looked at as a topic in the messaging world with one producer, which is the primary instance, and multiple consumers, which are the secondary instances. The primary instance writes to an oplog all the contents that need to be replicated. Thus, any create, update, and delete operations, as well as any reconfigurations on the replica sets will be written to the oplog; and the secondary instances will tail (continuously read the contents of the oplog being added to it, which is similar to a tail command with an -f option in Unix) the collection to get documents written by the primary. If the secondary has a slaveDelay configured, it will not read documents for more than the maximum time minus the slaveDelay time from the oplog.

We started by saving an instance of the local database in the variable called local and identified a cutoff time that we will use to query all the operations we will perform in this recipe from the oplog.

Executing a query on the system.namespaces collection in the local database shows us that the collection is a capped collection with a fixed size. For performance reasons, capped collections are allocated continuous space on the filesystem and this space is preallocated. The size allocated by the server is dependent on the OS and CPU architecture. While starting the server, the oplogSize option can be provided to mention the size of the oplog. The defaults are generally good enough for most cases; however, for development purposes, one may choose to override this value with a smaller value. Oplogs are capped collections that need to be preallocated a space on the disk. This preallocation not only takes time when the replica set is first initialized, but also takes up a fixed amount of disk space. For development purposes, we generally start multiple MongoDB processes as part of the same replica set on the same machine and want them to be up and running as quickly as possible with minimal resource usage. Also, having the entire oplog in memory becomes possible if the oplog size is small. For all these reasons, it is advisable to start local instances for development purposes with a small oplog size.

We performed some operations, such as insert 10 documents and update four documents, using a multiupdate operation, and created an index. If we query the oplog for entries after the cutoff we computed earlier, we see 10 documents for each insert in it. The document looks as follows:

{
        "ts" : Timestamp(1392402144, 1),
        "h" : NumberLong("-4661965417977826137"),
        "v" : 2,
        "op" : "i",
        "ns" : "test.oplogTest",
        "o" : {
                "_id" : ObjectId("52fe5edfd473d2f623718f51"),
                "i" : 0
        }
}

As seen in the previous example, we first look at the three fields, namely op, ns, and o. These fields stand for the operation, the fully qualified name of the collection into which the data is being inserted, and the actual object to be inserted. The operation i stands for the insert operation. Note that the value of o, which is the document to be inserted, contains the _id field that got generated on the primary. We should see 10 such documents, one for each insert. What is interesting is to see what happens on a multiupdate operation. The primary puts four documents, one for each of them affected by the updates. In this case, the op value is u, for the update, and the query used to match the document is not the same as we gave in the update function; rather, it is a query that uniquely finds a document based on the _id field. As there is an index already in place for the _id field (created automatically for each collection), this operation to find the document to be updated is not expensive. The value of the o field is the same as the document we passed to the update function from the shell. The sample document in the oplog for the update is as follows:

{
    "ts" : Timestamp(1392402620, 1),
    "h" : NumberLong("-7543933489976433166"),
    "v" : 2,
    "op" : "u",
    "ns" : "test.oplogTest",
    "o2" : {
            "_id" : ObjectId("52fe5edfd473d2f623718f57")
    },
    "o" : {
            "$set" : {
                    "val" : "str"
            }
    }
}

The update in the oplog is the same as the one we provided, because the $set operation is idempotent, which means you may apply an operation safely any number of times.

However, an update using the $inc operator is not idempotent. Let us execute the following update query:

> db.oplogTest.update({i:9}, {$inc:{i:1}})

In this case, the oplog will have the following output as the value of o:

"o" : {
    "$set" : {
           "i" : 10
     }
}

This nonidempotent operation is put into oplog by Mongo smartly, as an idempotent operation with the value of i set to a value that is expected to be after the increment operation once. Thus, it is safe to replay an oplog any number of times without corrupting the data.

Finally, we can see that the index creation process is put in the oplog as an insert operation in the system.indexes collection. However, there is something to remember during index creation till Version 2.4 of MongoDB. An index creation, whether foreground or background on the primary, is always created in the foreground on a secondary and thus, for that period, replication will not happen on that secondary instance. For large collections, index creation can take hours and thus, the size of the oplog is very important to let the secondary catch up from where it hasn't replicated since the index creation started. However, since version 2.6, index creation initiated in the background on the primary will also be built in the background on secondary instances.

For more details on the index creation on replica sets, visit http://docs.mongodb.org/master/tutorial/build-indexes-on-replica-sets/.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset