Understanding and analyzing oplogs

Oplog is a special collection and forms the backbone of the MongoDB replication. When any write operation or configuration changes are done on the replica set's primary, they are written to the oplog on the primary. All the secondary members then tail this collection to get the changes to be replicated. Tailing is synonymous to tail command in Unix and can only be done on a special type of collection called capped collection. Capped collections are fixed size collections which maintain the insertion order just like a queue. When the collection's allocated space becomes full, the oldest data is overwritten. If you are not aware of capped collections and what tailable cursors are, please refer to Creating and tailing a capped collection cursors in MongoDB in Chapter 5, Advanced Operations for more details.

Oplog is a capped collection present in the non-replicated database called local. In our previous recipe, we saw what a local database is and what collections are present in it. Oplog is something we didn't discuss in last recipe, as it demands a lot more explanation and a dedicated recipe is needed to do justice.

Getting ready

Refer to the recipe Starting multiple instances as part of a replica set from Chapter 1, Installing and Starting the Server for the prerequisites and know about the replica set basics. Go ahead and set up a simple three-node replica set on your computer as mentioned in the recipe. Open a shell and connect to the primary member of the replica set. You will need to start the mongo shell and connect to the primary instance.

How to do it…

  1. Execute the following steps after connecting to a primary from the shell to get the timestamp of the last operation present in the oplog. We would be interested in looking at the operations after this time.
    > use test
    > local = db.getSisterDB('local')
    > var cutoff = local.oplog.rs.find().sort({ts:-1}).limit(1).next().ts
    
  2. Execute the following from the shell. Keep the output in the shell or copy it to some place. We will analyze it later:
    > local.system.namespaces.findOne({name:'local.oplog.rs'})
    
  3. Insert 10 documents as follows:
    > for(i = 0; i < 10; i++) db.oplogTest.insert({'i':i})
    
  4. Execute the following update to set a string value for all documents with value of i greater than 5, which is 6, 7, 8 and 9 in our case. It would be a multiupdate operation:
    > db.oplogTest.update({i:{$gt:5}}, {$set:{val:'str'}}, false, true)
    
  5. Now, create the index as follows:
    > db.oplogTest.ensureIndex({i:1}, {background:1})
    
  6. Execute the following query on oplog:
    > local.oplog.rs.find({ts:{$gt:cutoff}}).pretty()
    

How it works…

For those aware of messaging and its terminologies, Oplog can be looked at as a topic in messaging world with one producer, the primary instance, and multiple consumers, the secondary instances. Primary instance writes to an oplog all the contents that need to be replicated. Thus, any create, update, and delete operations as well as any reconfigurations on the replica sets would be written to the oplog and the secondary instances would tail (continuously read the contents of the oplog being added to it, similar to a tail with -f option command in Unix) the collection to get documents written by the primary. If the secondary has a slaveDelay configured, it will not read documents more than the maximum time minus the slaveDelay time from the oplog.

We started by saving an instance of the local database in the variable called local and identified a cutoff time that we would use for querying all the operations we will perform in this recipe from the oplog.

Executing a query on the system.namespaces collection in the local database shows us that the collection is a capped collection with a fixed size. For performance reasons capped collections are allocated continuous space on the filesystem and are preallocated. The size allocated by the server is dependent on the OS and CPU architecture. While starting the server the option oplogSize can be provided to mention the size of the oplog. The defaults are generally good enough for most cases. However, for development purpose, you can choose to override this value for a smaller value. Oplogs are capped collections that need to be preallocated a space on disk. This preallocation not only takes time when the replica set is first initialized but takes up a fixed amount of disk space. For development purpose, we generally start multiple MongoDB processes as part of the same replica set on same machine and would want them to be up and running as quickly as possible with minimum resource usage. Also, having the entire oplog in memory becomes possible if the oplog size is small. For all these reasons, it is advisable to start the local instances for development purpose with a small oplog size.

We performed some operations such as insert 10 documents and update four documents using a multi update and create an index. If we query the oplog for entries after the cutoff, we computed earlier we see 10 documents for each insert in it. The document looks something like this:

{
        "ts" : Timestamp(1392402144, 1),
        "h" : NumberLong("-4661965417977826137"),
        "v" : 2,        "op" : "i",
        "ns" : "test.oplogTest",
        "o" : {
                "_id" : ObjectId("52fe5edfd473d2f623718f51"),
                "i" : 0
        }
}

As we can see, we first look at the three fields: op, ns, and o. These stand for the operation, the fully qualified name of the collection into which the data is being inserted, and the actual object to be inserted. The operation i stand for insert operation. Note that the value of o, which is the document to be inserted, contains the _id field that got generated on the primary. We should see 10 such documents, one for each insert. What is interesting to see is what happens on a multi update operation. The primary puts four documents, one for each of them affected for the updates. In this case, the value op is u, which is for update and the query used to match the document is not the same as what we gave in the update function, but it is a query that uniquely finds a document based on the _id field. Since there is an index already in place for the _id field (created automatically for each collection), this operation to find the document to be updated is not expensive. The value of the field o is the same document we passed to the update function from the shell. The sample document in the oplog for the update is as follows:

{
    "ts" : Timestamp(1392402620, 1),
    "h" : NumberLong("-7543933489976433166"),
    "v" : 2,
    "op" : "u",
    "ns" : "test.oplogTest",
    "o2" : {
            "_id" : ObjectId("52fe5edfd473d2f623718f57")
    },
    "o" : {
            "$set" : {
                    "val" : "str"
            }
    }
}

The update in the oplog is the same as the one we provided. This is because the $set operation is idempotent, which means you may apply an operation safely any number of times.

However, update using $inc operator is not idempotent. Let's execute the following update:

> db.oplogTest.update({i:9}, {$inc:{i:1}})

In this case, the oplog would have the following as the value of o.

"o" : {
    "$set" : {
           "i" : 10
     }
}

This non-idempotent operation is put into oplog by Mongo smartly as an idempotent operation with the value of i set to a value that is expected to be after the increment operation once. Thus it is safe to replay an oplog any number of times without corrupting the data.

Finally, we can see that the index creation process is put in the oplog as an insert operation in the system.indexes collection. For large collections, index creation can take hours and thus the size of the oplog is very important to let the secondary catch up from where it hasn't replicated since the index creation started. However, since version 2.6, index creation initiated in background on primary will also be built in background on secondary instances.

For more details on the index creation on replica sets, visit the following URL: http://docs.mongodb.org/master/tutorial/build-indexes-on-replica-sets/.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset