Chapter 11. Replication

This chapter covers

  • Understanding basic replication concepts
  • Connecting a driver to a replica set
  • Administering replica sets and handling failover
  • Increasing the durability of writes with write concerns
  • Optimizing your reads with the read preference
  • Managing complex replica sets using tagging

Replication is central to most database management systems because of one inevitable fact: failures happen. If you want your live production data to be available even after a failure, you need to be sure that your production databases are available on more than one machine. Replication provides data protection, high availability, and disaster recovery.

We’ll begin this chapter by introducing replication and discussing its main use cases. Then we’ll cover MongoDB’s replication through a detailed study of replica sets. Finally, we’ll describe how to connect to replicated MongoDB clusters using the drivers, how to use write concerns, and how to load-balance reads across replicas.

11.1. Replication overview

Replication is the distribution and maintenance of data across multiple MongoDB servers (nodes). MongoDB can copy your data to one or more nodes and constantly keep them in sync when changes occur. This type of replication is provided through a mechanism called replica sets, in which a group of nodes are configured to automatically synchronize their data and fail over when a node disappears. MongoDB also supports an older method of replication called master-slave, which is now considered deprecated, but master-slave replication is still supported and can be used in MongoDB v3.0. For both methods, a single primary node receives all writes, and then all secondary nodes read and apply those writes to themselves asynchronously.

Master-slave replication and replica sets use the same replication mechanism, but replica sets additionally ensure automated failover: if the primary node goes offline for any reason, one of the secondary nodes will automatically be promoted to primary, if possible. Replica sets provide other enhancements too, such as easier recovery and more sophisticated deployment topologies. For these reasons you’d rarely want to use simple master-slave replication.[1] Replica sets are thus the recommended replication strategy for production deployments; we’ll devote the bulk of this chapter to explanations and examples of replica sets, with only a brief overview of master-slave replication.

1

The only time you should opt for MongoDB’s master-slave replication is when you’d require more than 51 slave nodes, because a replica set can have no more than 50 members, which should never happen under normal circumstances.

It’s also important to understand the pitfalls of replication, most importantly the possibility of a rollback. In a replica set, data isn’t considered truly committed until it’s been written to a majority of member nodes, which means more than 50% of the servers; therefore, if your replica set has only two servers, this means that no server can be down. If the primary node in a replica set fails before it replicates its data, other members will continue accepting writes, and any unreplicated data must be rolled back, meaning it can no longer be read. We’ll describe this scenario in detail next.

11.1.1. Why replication matters

All databases are vulnerable to failures of the environments in which they run. Replication provides a kind of insurance against these failures. What sort of failure are we talking about? Here are some of the more common scenarios:

  • The network connection between the application and the database is lost.
  • Planned downtime prevents the server from coming back online as expected. Most hosting providers must schedule occasional downtime, and the results of this downtime aren’t always easy to predict. A simple reboot will keep a database server offline for at least a few minutes. Then there’s the question of what happens when the reboot is complete. For example, newly installed software or hardware can prevent MongoDB or even the operating system from starting up properly.
  • There’s a loss of power. Although most modern datacenters feature redundant power supplies, nothing prevents user error within the datacenter itself or an extended brownout or blackout from shutting down your database server.
  • A hard drive fails on the database server. Hard drives have a mean time to failure of a few years and fail more often than you might think.[2] Even if it’s acceptable to have occasional downtime for your MongoDB, it’s probably not acceptable to lose your data if a hard drive fails. It’s a good idea to have at least one copy of your data, which replication provides.

    2

    You can read a detailed analysis of consumer hard drive failure rates in Google’s article “Failure Trends in a Large Disk Drive Population” (http://research.google.com/archive/disk_failures.pdf).

In addition to protecting against external failures, replication has been particularly important for MongoDB’s durability. When running without journaling enabled, MongoDB’s data files aren’t guaranteed to be free of corruption in the event of an unclean shutdown—with journaling enabled data files can’t get corrupted. Without journaling, replication should always be run to guarantee a clean copy of the data files if a single node shuts down hard.

Of course, replication is desirable even when running with journaling. After all, you still want high availability and fast failover. In this case, journaling expedites recovery because it allows you to bring failed nodes back online simply by replaying the journal. This is much faster than resyncing from an existing replica to recover from failure.

It’s important to note that although they’re redundant, replicas aren’t a replacement for backups. A backup represents a snapshot of the database at a particular time in the past, whereas a replica is always up to date. There are cases where a data set is large enough to render backups impractical, but as a general rule, backups are prudent and recommended even when running with replication. In other words, backups are there in case of a logical failure such as an accidental data loss or data corruption.

We highly recommend running a production MongoDB instance with both replication and journaling, unless you’re prepared to lose data; to do otherwise should be considered poor deployment practice. When (not if) your application experiences a failure, work invested in thinking through and setting up replication will pay dividends.

11.1.2. Replication use cases and limitations

You may be surprised at how versatile a replicated database can be. In particular, replication facilitates redundancy, failover, maintenance, and load balancing. Let’s take a brief look at each of these use cases.

Replication is designed primarily for redundancy. It ensures that replicated nodes stay in sync with the primary node. These replicas can live in the same datacenter as the primary, or they can be distributed geographically as an additional failsafe. Because replication is asynchronous, any sort of network latency or partition between nodes will have no effect on the performance of the primary. As another form of redundancy, replicated nodes can also be delayed by a constant number of seconds, minutes, or even hours behind the primary. This provides insurance against the case where a user inadvertently drops a collection or an application somehow corrupts the database. Normally, these operations will be replicated immediately; a delayed replica gives administrators time to react and possibly save their data.

Another use case for replication is failover. You want your systems to be highly available, but this is possible only with redundant nodes and the ability to switch over to those nodes in an emergency. Conveniently, MongoDB’s replica sets almost always make this switch automatically.

In addition to providing redundancy and failover, replication simplifies maintenance, usually by allowing you to run expensive operations on a node other than the primary. For example, it’s common practice to run backups on a secondary node to keep unnecessary load off the primary and to avoid downtime. Building large indexes is another example. Because index builds are expensive, you may opt to build on a secondary node first, swap the secondary with the existing primary, and then build again on the new secondary.

Finally, replication allows you to balance reads across replicas. For applications whose workloads are overwhelmingly read-heavy, this is the easiest, or if you prefer, the most naïve, way to scale MongoDB. But for all its promise, a replica set doesn’t help much if any of the following apply:

  • The allotted hardware can’t process the given workload. As an example, we mentioned working sets in the previous chapter. If your working data set is much larger than the available RAM, then sending random reads to the secondaries likely won’t improve your performance as much as you might hope. In this scenario, performance becomes constrained by the number of I/O operations per second (IOPS) your disk can handle—generally around 80–100 for non-SSD hard drives. Reading from a replica increases your total IOPS, but going from 100 to 200 IOPS may not solve your performance problems, especially if writes are occurring at the same time and consuming a portion of that number. In this case, sharding may be a better option.
  • The ratio of writes to reads exceeds 50%. This is an admittedly arbitrary ratio, but it’s a reasonable place to start. The issue here is that every write to the primary must eventually be written to all the secondaries as well. Therefore, directing reads to secondaries that are already processing a lot of writes can sometimes slow the replication process and may not result in increased read throughput.
  • The application requires consistent reads. Secondary nodes replicate asynchronously and therefore aren’t guaranteed to reflect the latest writes to the primary node. In pathological cases, secondaries can run hours behind. As we’ll explain later, you can guarantee your writes go to secondaries before returning to the driver, but this approach carries a latency cost.

Replica sets are excellent for scaling reads that don’t require immediate consistency, but they won’t help in every situation. If you need to scale and any of the preceding conditions apply, then you’ll need a different strategy, involving sharding, augmented hardware, or some combination of the two.

11.2. Replica sets

Replica sets are the recommended MongoDB replication strategy. We’ll start by configuring a sample replica set. We’ll then describe how replication works because this knowledge is incredibly important for diagnosing production issues. We’ll end by discussing advanced configuration details, failover and recovery, and best deployment practices.

11.2.1. Setup

The minimum recommended replica set configuration consists of three nodes, because in a replica set with only two nodes you can’t have a majority in case the primary server goes down. A three-member replica set can have either three members that hold data or two members that hold data and an arbiter. The primary is the only member in the set that can accept write operations. Replica set members go through a process in which they “elect” a new master by voting. If a primary becomes unavailable, elections allow the set to recover normal operations without manual intervention. Unfortunately, if a majority of the replica set is inaccessible or unavailable, the replica set cannot accept writes and all remaining members become read-only. You may consider adding an arbiter to a replica set if it has an equal number of nodes in two places where network partitions between the places are possible. In such cases, the arbiter will break the tie between the two facilities and allow the set to elect a new primary.

In the minimal configuration, two of these three nodes serve as first-class, persistent mongod instances. Either can act as the replica set primary, and both have a full copy of the data. The third node in the set is an arbiter, which doesn’t replicate data but merely acts as a kind of observer. Arbiters are lightweight mongod servers that participate in the election of a primary but don’t replicate any of the data. You can see an illustration of the replica set you’re about to set up in figure 11.1. The arbiter is located at the secondary data center on the right.

Figure 11.1. A basic replica set consisting of a primary, a secondary, and an arbiter

Now let’s create a simple three-node replica set to demonstrate how to do it. Normally you would create a replica set with each member on a separate machine. To keep this tutorial simple, we’re going to start all three on a single machine. Each MongoDB instance we start is identified by its hostname and its port; running the set locally means that when we connect, we’ll use the local hostname for all three and start each on a separate port.

Begin by creating a data directory for each replica set member:

mkdir ~/node1
mkdir ~/node2
mkdir ~/arbiter

Next, start each member as a separate mongod. Because you’ll run each process on the same machine, it’s easiest to start each mongod in a separate terminal window:

mongod --replSet myapp --dbpath ~/node1 --port 40000
mongod --replSet myapp --dbpath ~/node2 --port 40001
mongod --replSet myapp --dbpath ~/arbiter --port 40002

Note how we tell each mongod that it will be a member of the myapp replica set and that we start each mongod on a separate port. If you examine the mongod log output, the first thing you’ll notice are error messages saying that the configuration can’t be found. This is completely normal:

[rsStart] replSet info you may need to run replSetInitiate
    -- rs.initiate() in the shell -- if that is not already done
[rsStart] replSet can't get local.system.replset config from self
    or any seed (EMPTYCONFIG)

On MongoDB v3.0 the log message will be similar to the following:

2015-09-15T16:27:21.088+0300 I REPL     [initandlisten] Did not find local
replica set configuration document at startup;  NoMatchingDocument Did not
find replica set configuration document in local.system.replset

To proceed, you need to configure the replica set. Do so by first connecting to one of the non-arbiter mongods just started. These instances aren’t running on MongoDB’s default port, so connect to one by running

mongo --port 40000

These examples were produced running these mongod processes locally, so you’ll see the name of the example machine, iron, pop up frequently; substitute your own hostname.

Connect, and then run the rs.initiate() command:[3]

3

Some users have reported trouble with this step because they have the line bind_ip = 127.0.0.1 in their mongod.conf file at /etc/mongod.conf or /usr/local/etc/mongod.conf. If initiating the replica set prints an error, look for and remove that configuration.

> rs.initiate()
{
    "info2" : "no configuration explicitly specified -- making one",
    "me" : "iron.local:40000",
    "info" : "Config now saved locally.  Should come online in about a minute.",
    "ok" : 1
}

On MongoDB v3.0 the output will be similar to the following:

{
    "info2" : "no configuration explicitly specified -- making one",
    "me" : "iron.local:40000",
    "ok" : 1
}

Within a minute or so, you’ll have a one-member replica set. You can now add the other two members using rs.add():

> rs.add("iron.local:40001")
{ "ok" : 1 }
> rs.add("iron.local:40002", {arbiterOnly: true})
{ "ok" : 1 }

On MongoDB v3.0 you can also add an arbiter with the following command:

> rs.addArb("iron.local:40002")
{ "ok" : 1 }

Note that for the second node, you specify the arbiterOnly option to create an arbiter. Within a minute, all members should be online. To get a brief summary of the replica set status, run the db.isMaster() command:

> db.isMaster()
{
  "setName" : "myapp",
  "ismaster" : true,
  "secondary" : false,
  "hosts" : [
    "iron.local:40001",
    "iron.local:40000"
  ],

  "arbiters" : [
    "iron.local:40002"
  ],
  "primary" : "iron.local:40000",
  "me" : "iron.local:40000",
  "maxBsonObjectSize" : 16777216,
  "maxMessageSizeBytes" : 48000000,
  "localTime" : ISODate("2013-11-06T05:53:25.538Z"),
  "ok" : 1
}

The same command produces the following output on a MongoDB v3.0 machine:

myapp:PRIMARY> db.isMaster()
{
    "setName" : "myapp",
    "setVersion" : 5,
    "ismaster" : true,
    "secondary" : false,
    "hosts" : [
        "iron.local:40000",
        "iron.local:40001"
    ],
    "arbiters" : [
        "iron.local:40002"
    ],
    "primary" : "iron.local:40000",
    "me" : "iron.local:40000",
    "electionId" : ObjectId("55f81dd44a50a01e0e3b4ede"),
    "maxBsonObjectSize" : 16777216,
    "maxMessageSizeBytes" : 48000000,
    "maxWriteBatchSize" : 1000,
    "localTime" : ISODate("2015-09-15T13:37:13.798Z"),
    "maxWireVersion" : 3,
    "minWireVersion" : 0,
    "ok" : 1
}

A more detailed view of the system is provided by the rs.status() method. You’ll see state information for each node. Here’s the complete status listing:

> rs.status()
{
  "set" : "myapp",
  "date" : ISODate("2013-11-07T17:01:29Z"),
  "myState" : 1,
  "members" : [
    {
      "_id" : 0,
      "name" : "iron.local:40000",
      "health" : 1,
      "state" : 1,
      "stateStr" : "PRIMARY",
      "uptime" : 1099,

      "optime" : Timestamp(1383842561, 1),
      "optimeDate" : ISODate("2013-11-07T16:42:41Z"),
      "self" : true
    },
    {
      "_id" : 1,
      "name" : "iron.local:40001",
      "health" : 1,
      "state" : 2,
      "stateStr" : "SECONDARY",
      "uptime" : 1091,
      "optime" : Timestamp(1383842561, 1),
      "optimeDate" : ISODate("2013-11-07T16:42:41Z"),
      "lastHeartbeat" : ISODate("2013-11-07T17:01:29Z"),
      "lastHeartbeatRecv" : ISODate("2013-11-07T17:01:29Z"),
      "pingMs" : 0,
      "lastHeartbeatMessage" : "syncing to: iron.1ocal:40000",
      "syncingTo" : "iron.local:40000"
    },
    {
      "_id" : 2,
      "name" : "iron.local:40002",
      "health" : 1,
      "state" : 7,
      "stateStr" : "ARBITER",
      "uptime" : 1089,
      "lastHeartbeat" : ISODate("2013-11-07T17:01:29Z"),
      "lastHeartbeatRecv" : ISODate("2013-11-07T17:01:29Z"),
      "pingMs" : 0
    }
  ],
        "ok" : 1
}

The rs.status() command produces a slightly different output on a MongoDB v3.0 server:

{
    "set" : "myapp",
    "date" : ISODate("2015-09-15T13:41:58.772Z"),
    "myState" : 1,
    "members" : [
        {
            "_id" : 0,
            "name" : "iron.local:40000",
            "health" : 1,
            "state" : 1,
            "stateStr" : "PRIMARY",
            "uptime" : 878,
            "optime" : Timestamp(1442324156, 1),
            "optimeDate" : ISODate("2015-09-15T13:35:56Z"),
            "electionTime" : Timestamp(1442323924, 2),
            "electionDate" : ISODate("2015-09-15T13:32:04Z"),
            "configVersion" : 5,
            "self" : true

        },
        {
            "_id" : 1,
            "name" : "iron.local:40001",
            "health" : 1,
            "state" : 2,
            "stateStr" : "SECONDARY",
            "uptime" : 473,
            "optime" : Timestamp(1442324156, 1),
            "optimeDate" : ISODate("2015-09-15T13:35:56Z"),
            "lastHeartbeat" : ISODate("2015-09-15T13:41:56.819Z"),
            "lastHeartbeatRecv" : ISODate("2015-09-15T13:41:57.396Z"),
            "pingMs" : 0,
            "syncingTo" : "iron.local:40000",
            "configVersion" : 5
        },
        {
            "_id" : 2,
            "name" : "iron.local:40002",
            "health" : 1,
            "state" : 7,
            "stateStr" : "ARBITER",
            "uptime" : 360,
            "lastHeartbeat" : ISODate("2015-09-15T13:41:57.676Z"),
            "lastHeartbeatRecv" : ISODate("2015-09-15T13:41:57.676Z"),
            "pingMs" : 10,
            "configVersion" : 5
        }
    ],
    "ok" : 1
}

Unless your MongoDB database contains a lot of data, the replica set should come online within 30 seconds. During this time, the stateStr field of each node should transition from RECOVERING to PRIMARY, SECONDARY, or ARBITER.

Now even if the replica set status claims that replication is working, you may want to see some empirical evidence of this. Go ahead and connect to the primary node with the shell and insert a document:

$ mongo --port 40000
myapp:PRIMARY> use bookstore
switched to db bookstore
myapp:PRIMARY> db.books.insert({title: "Oliver Twist"})
myapp:PRIMARY> show dbs
bookstore 0.203125GB
local 0.203125GB

Notice how the MongoDB shell prints out the replica set membership status of the instance it’s connected to.

Initial replication of your data should occur almost immediately. In another terminal window, open a new shell instance, but this, time point it to the secondary node. Query for the document just inserted; it should have arrived:

$ mongo --port 40001
myapp:SECONDARY> show dbs
bookstore 0.203125GB
local 0.203125GB
myapp:SECONDARY> use bookstore
switched to db bookstore
myapp:SECONDARY> rs.slaveOk()
myapp:SECONDARY> db.books.find()
{ "_id" : ObjectId("4d42ebf28e3c0c32c06bdf20"), "title" : "Oliver Twist" }

If replication is working as displayed here, you’ve successfully configured your replica set. By default, MongoDB attempts to protect you from accidentally querying a secondary because this data will be less current than the primary, where writes occur. You must explicitly allow reads from the secondary in the shell by running rs.slaveOk().

It should be satisfying to see replication in action, but perhaps more interesting is automated failover. Let’s test that now and kill a node. You could kill the secondary, but that merely stops replication, with the remaining nodes maintaining their current status. If you want to see a change of system state, you need to kill the primary. If the primary is running in the foreground of your shell, you can kill it by pressing Ctrl-C; if it’s running in the background, then get its process ID from the mongod.lock file in ~/node1 and run kill -3 <process id>. You can also connect to the primary using the shell and run commands to shut down the server:

$ mongo --port 40000

PRIMARY> use admin
PRIMARY> db.shutdownServer()

Once you’ve killed the primary, note that the secondary detects the lapse in the primary’s heartbeat. The secondary then elects itself primary. This election is possible because a majority of the original nodes (the arbiter and the original secondary) are still able to ping each other. Here’s an excerpt from the secondary node’s log:

Thu Nov  7 09:23:23.091 [rsHealthPoll] replset info iron.local:40000
    heartbeat failed, retrying
Thu Nov  7 09:23:23.091 [rsHealthPoll] replSet info iron.local:40000
    is down (or slow to respond):
Thu Nov  7 09:23:23.091 [rsHealthPoll] replSet member iron.local:40000
    is now in state DOWN
Thu Nov  7 09:23:23.092 [rsMgr] replSet info electSelf 1
Thu Nov  7 09:23:23.202 [rsMgr] replSet PRIMARY

If you connect to the new primary node and check the replica set status, you’ll see that the old primary is unreachable:

$ mongo --port 40001

> rs.status()
...
  {
    "_id" : 0,
    "name" : "iron.local:40000",
    "health" : 0,
    "state" : 8,
    "stateStr" : "(not reachable/healthy)",
    "uptime" : 0,
    "optime" : Timestamp(1383844267, 1),
    "optimeDate" : ISODate("2013-11-07T17:11:07Z"),
    "lastHeartbeat" : ISODate("2013-11-07T17:30:00Z"),
    "lastHeartbeatRecv" : ISODate("2013-11-07T17:23:21Z"),
    "pingMs" : 0
  },
...

Post-failover, the replica set consists of only two nodes. Because the arbiter has no data, your application will continue to function as long as it communicates with the primary node only.[4] Even so, replication isn’t happening, and there’s now no possibility of failover. The old primary must be restored. Assuming that the old primary was shut down cleanly, you can bring it back online, and it’ll automatically rejoin the replica set as a secondary. Go ahead and try that now by restarting the old primary node.

4

Applications sometimes query secondary nodes for read scaling. If that’s happening, this kind of failure will cause read failures,sSo it’s important to design your application with failover in mind. More on this at the end of the chapter.

That’s a quick overview of replica sets. Some of the details are, unsurprisingly, messier. In the next two sections, you’ll see how replica sets work and look at deployment, advanced configuration, and how to handle tricky scenarios that may arise in production.

11.2.2. How replication works

Replica sets rely on two basic mechanisms: an oplog and a heartbeat. The oplog enables the replication of data, and the heartbeat monitors health and triggers failover. You’ll now see how both of these mechanisms work in turn. You should begin to understand and predict replica set behavior, particularly in failure scenarios.

All about the oplog

At the heart of MongoDB’s replication stands the oplog. The oplog is a capped collection that lives in a database called local on every replicating node and records all changes to the data. Every time a client writes to the primary, an entry with enough information to reproduce the write is automatically added to the primary’s oplog. Once the write is replicated to a given secondary, that secondary’s oplog also stores a record of the write. Each oplog entry is identified with a BSON timestamp, and all secondaries use the timestamp to keep track of the latest entry they’ve applied.[5]

5

The BSON timestamp is a unique identifier consisting of the number of seconds since the epoch and an incrementing counter. For more details, see http://en.wikipedia.org/wiki/Unix_time.

To better see how this works, let’s look more closely at a real oplog and at the operations recorded in it. First connect with the shell to the primary node started in the previous section and switch to the local database:

myapp:PRIMARY> use local
switched to db local

The local database stores all the replica set metadata and the oplog. Naturally, this database isn’t replicated itself. Thus it lives up to its name; data in the local database is supposed to be unique to the local node and therefore shouldn’t be replicated.

If you examine the local database, you’ll see a collection called oplog.rs, which is where every replica set stores its oplog. You’ll also see a few system collections. Here’s the complete output:

myapp:PRIMARY> show collections
me
oplog.rs
replset.minvalid
slaves
startup_log
system.indexes
system.replset

replset.minvalid contains information for the initial sync of a given replica set member, and system.replset stores the replica set config document. Not all of your mongod servers will have the replset.minvalid collection. me and slaves are used to implement write concerns, described at the end of this chapter, and system.indexes is the standard index spec container.

First we’ll focus on the oplog. Let’s query for the oplog entry corresponding to the book document you added in the previous section. To do so, enter the following query. The resulting document will have four fields, and we’ll discuss each in turn:

> db.oplog.rs.findOne({op: "i"})
{
  "ts" : Timestamp(1383844267, 1),
  "h" : NumberLong("-305734463742602323"),
  "v" : 2,
  "op" : "i",
  "ns" : "bookstore.books",
  "o" : {
    "_id" : ObjectId("527bc9aac2595f18349e4154"),
    "title" : "Oliver Twist"
  }
}

The first field, ts, stores the entry’s BSON timestamp. The timestamp includes two numbers; the first representing the seconds since epoch and the second representing a counter value—1 in this case. To query with a timestamp, you need to explicitly construct a timestamp object. All the drivers have their own BSON timestamp constructors, and so does JavaScript. Here’s how to use it:

db.oplog.rs.findOne({ts: Timestamp(1383844267, 1)})

Returning to the oplog entry, the op field specifies the opcode. This tells the secondary node which operation the oplog entry represents. Here you see an i, indicating an insert. After op comes ns to signify the relevant namespace (database and collection) and then the lowercase letter o, which for insert operations contains a copy of the inserted document.

As you examine oplog entries, you may notice that operations affecting multiple documents are analyzed into their component parts. For multi-updates and mass deletes, a separate entry is created in the oplog for each document affected. For example, suppose you add a few more Dickens books to the collection:

myapp:PRIMARY> use bookstore
myapp:PRIMARY> db.books.insert({title: "A Tale of Two Cities"})
myapp:PRIMARY> db.books.insert({title: "Great Expectations"})

Now with four books in the collection, let’s issue a multi-update to set the author’s name:

myapp:PRIMARY> db.books.update({}, {$set: {author: "Dickens {multi:true})

How does this appear in the oplog?

myapp:PRIMARY> use local
myapp:PRIMARY> db.oplog.rs.find({op: "u"})
{
  "ts" : Timestamp(1384128758, 1),
  "h" : NumberLong("5431582342821118204"),
  "v" : 2,
  "op" : "u",
  "ns" : "bookstore.books",
  "o2" : {
    "_id" : ObjectId("527bc9aac2595f18349e4154")
  },
  "o" : {
    "$set" : {
      "author" : "Dickens"
    }
  }
}
{
  "ts" : Timestamp(1384128758, 2),
  "h" : NumberLong("3897436474689294423"),
  "v" : 2,
  "op" : "u",
  "ns" : "bookstore.books",

  "o2" : {
    "_id" : ObjectId("528020a9f3f61863aba207e7")
  },
  "o" : {
    "$set" : {
      "author" : "Dickens"
    }
  }
}
{
  "ts" : Timestamp(1384128758, 3),
  "h" : NumberLong("2241781384783113"),
  "v" : 2,
  "op" : "u",
  "ns" : "bookstore.books",
  "o2" : {
    "_id" : ObjectId("528020a9f3f61863aba207e8")
  },
  "o" : {
    "$set" : {
      "author" : "Dickens"
    }
  }
}

As you can see, each updated document gets its own oplog entry. This normalization is done as part of the more general strategy of ensuring that secondaries always end up with the same data as the primary. To guarantee this, every applied operation must be idempotent—it can’t matter how many times a given oplog entry is applied. The result must always be the same. But the secondaries must apply the oplog entries in the same order as they were generated for the oplog. Other multidocument operations, like deletes, will exhibit the same behavior. You can try different operations and see how they ultimately appear in the oplog.

To get some basic information about the oplog’s current status, you can run the shell’s db.getReplicationInfo() method:

myapp:PRIMARY> db.getReplicationInfo()
{
  "logSizeMB" : 192,
  "usedMB" : 0.01,
  "timeDiff" : 286197,
  "timeDiffHours" : 79.5,
  "tFirst" : "Thu Nov 07 2013 08:42:41 GMT-0800 (PST)",
  "tLast" : "Sun Nov 10 2013 16:12:38 GMT-0800 (PST)",
  "now" : "Sun Nov 10 2013 16:19:49 GMT-0800 (PST)"
}

Here you see the timestamps of the first and last entries in this oplog. You can find these oplog entries manually by using the $natural sort modifier. For example, the following query fetches the latest entry:

db.oplog.rs.find().sort({$natural: -1}) .limit(1)

The only important thing left to understand about replication is how the secondaries keep track of their place in the oplog. The answer lies in the fact that secondaries also keep an oplog. This is a significant improvement upon master-slave replication, so it’s worth taking a moment to explore the rationale.

Imagine you issue a write to the primary node of a replica set. What happens next? First, the write is recorded and then added to the primary’s oplog. Meanwhile, all secondaries have their own oplogs that replicate the primary’s oplog. When a given secondary node is ready to update itself, it does three things. First, it looks at the timestamp of the latest entry in its own oplog. Next, it queries the primary’s oplog for all entries greater than that timestamp. Finally, it writes the data and adds each of those entries to its own oplog.[6] This means that in case of failover, any secondary promoted to primary will have an oplog that the other secondaries can replicate from. This feature essentially enables replica set recovery.

6

When journaling is enabled, documents are written to the core data files and to the oplog simultaneously in an atomic transaction.

Secondary nodes use long polling to immediately apply new entries from the primary’s oplog. Long polling means the secondary makes a long-lived request to the primary. When the primary receives a modification, it responds to the waiting request immediately. Thus, secondaries will usually be almost completely up to date. When they do fall behind because of network partitions or maintenance on secondaries, the latest timestamp in each secondary’s oplog can be used to monitor any replication lag.

Master-slave replication

Master-slave replication is the original replication paradigm in MongoDB. This flavor of replication is easy to configure and has the advantage of supporting any number of slave nodes. But master-slave replication is no longer recommended for production deployments. There are a couple reasons for this. First, failover is completely manual. If the master node fails, then an administrator must shut down a slave and restart it as a master node. Then the application must be reconfigured to point to the new master. Second, recovery is difficult. Because the oplog exists only on the master node, a failure requires that a new oplog be created on the new master. This means that any other existing nodes will need to resync from the new master in the event of a failure.

In short, there are few compelling reasons to use master-slave replication. Replica sets are the way forward, and they’re the flavor of replication you should use. If for some reason you must use master-slave replication, consult the MongoDB manual for more information.

Halted replication

Replication will halt permanently if a secondary can’t find the point it’s synced to in the primary’s oplog. When that happens, you’ll see an exception in the secondary’s log that looks like this:

repl: replication data too stale, halting
Fri Jan 28 14:19:27 [replsecondary] caught SyncException

Recall that the oplog is a capped collection. This means that the collection can only hold so much data. If a secondary is offline for an extended period of time, the oplog may not be large enough to store every change made in that period. Once a given secondary fails to find the point at which it’s synced in the primary’s oplog, there’s no longer any way of ensuring that the secondary is a perfect replica of the primary. Because the only remedy for halted replication is a complete resync of the primary’s data, you’ll want to strive to avoid this state. To do that, you’ll need to monitor secondary delay, and you’ll need to have a large enough oplog for your write volume. You’ll learn more about monitoring in chapter 12. Choosing the right oplog size is what we’ll cover next.

Sizing the replication oplog

The oplog is a capped collection; as such, MongoDB v2.6 doesn’t allow you to resize it once it’s been created. This makes it important to choose an initial oplog size carefully. But in MongoDB v3.0 you can change the size of the oplog. The procedure requires you to stop the mongod instance and start it as a standalone instance, modify the oplog size, and restart the member.

The default oplog sizes vary somewhat. On 32-bit systems, the oplog will default to 50 MB, whereas on 64-bit systems, the oplog will be the larger of 1 GB or 5% of free disk space, unless you’re running on Mac OS X, in which case the oplog will be 192 MB. This smaller size is due to the assumption that OS X machines are development machines. For many deployments, 5% of free disk space will be more than enough. One way to think about an oplog of this size is to recognize that once it overwrites itself 20 times, the disk will likely be full (this is true for insert-only workloads).

That said, the default size won’t be ideal for all applications. If you know that your application will have a high write volume, you should do some empirical testing before deploying. Set up replication and then write to the primary at the rate you’ll have in production. You’ll want to hammer the server in this way for at least an hour. Once done, connect to any replica set member and get the current replication information:

db.getReplicationInfo()

Once you know how much oplog you’re generating per hour, you can then decide how much oplog space to allocate. The goal is to eliminate instances where your secondaries get too far behind the primary to catch up using the oplog. You should probably shoot for being able to withstand at least eight hours of secondary downtime. You want to avoid having to completely resync any node, and increasing the oplog size will buy you time in the event of network failures and the like.

If you want to change the default oplog size, you must do so the first time you start each member node using mongod’s --oplogSize option. The value is in megabytes. Thus you can start mongod with a 1 GB oplog like this:[7]

7

For a tutorial on how to resize the oplog, see http://docs.mongodb.org/manual/tutorial/change-oplog-size/.

mongod --replSet myapp --oplogSize 1024
Heartbeat and failover

The replica set heartbeat facilitates election and failover. By default, each replica set member pings all the other members every two seconds. In this way, the system can ascertain its own health. When you run rs.status(), you see the timestamp of each node’s last heartbeat along with its state of health (1 means healthy and 0 means unresponsive).

As long as every node remains healthy and responsive, the replica set will hum along its merry way. But if any node becomes unresponsive, action may be taken. Every replica set wants to ensure that exactly one primary node exists at all times. But this is possible only when a majority of nodes is visible. For example, look back at the replica set you built in the previous section. If you kill the secondary, then a majority of nodes still exists, so the replica set doesn’t change state but simply waits for the secondary to come back online. If you kill the primary, then a majority still exists but there’s no primary. Therefore, the secondary is automatically promoted to primary. If more than one secondary exists, the most current secondary will be the one elected.

But other possible scenarios exist. Imagine that both the secondary and the arbiter are killed. Now the primary remains but there’s no majority—only one of the three original nodes remains healthy. In this case, you’ll see a message like this in the primary’s log:

[rsMgr] can't see a majority of the set, relinquishing primary
[rsMgr] replSet relinquishing primary state
[rsMgr] replSet SECONDARY
[rsMgr] replSet closing client sockets after relinquishing primary

With no majority, the primary demotes itself to a secondary. This may seem puzzling, but think about what might happen if this node were allowed to remain primary. If the heartbeats fail due to some kind of network partition, the other nodes will still be online. If the arbiter and secondary are still up and able to see each other, then according to the rule of the majority, the remaining secondary will become a primary. If the original primary doesn’t step down, you’re suddenly in an untenable situation: a replica set with two primary nodes. If the application continues to run, it might write to and read from two different primaries, a sure recipe for inconsistency and truly bizarre application behavior. Therefore, when the primary can’t see a majority, it must step down.

Commit and rollback

One final important point to understand about replica sets is the concept of a commit. In essence, you can write to a primary node all day long, but those writes won’t be considered committed until they’ve been replicated to a majority of nodes. What do we mean by committed? The idea can best be explained by example.

Please note that operations on a single document are always atomic with MongoDB databases, but operations that involve multiple documents aren’t atomic as a whole.

Imagine again the replica set you built in the previous section. Suppose you issue a series of writes to the primary that don’t get replicated to the secondary for some reason (connectivity issues, secondary is down for backup, secondary is lagging, and so on). Now suppose that the secondary is suddenly promoted to primary. You write to the new primary, and eventually the old primary comes back online and tries to replicate from the new primary. The problem here is that the old primary has a series of writes that don’t exist in the new primary’s oplog. This situation triggers a rollback.

In a rollback, all writes that were never replicated to a majority are undone. This means that they’re removed from both the secondary’s oplog and the collection where they reside. If a secondary has registered a delete, the node will look for the deleted document in another replica and restore it to itself. The same is true for dropped collections and updated documents.

The reverted writes are stored in the rollback subdirectory of the relevant node’s data path. For each collection with rolled-back writes, a separate BSON file will be created the filename of which includes the time of the rollback. In the event that you need to restore the reverted documents, you can examine these BSON files using the bsondump utility and manually restore them, possibly using mongorestore.

If you ever have to restore rolled-back data, you’ll realize that this is a situation you want to avoid, and fortunately you can, to some extent. If your application can tolerate the extra write latency, you can use write concerns, described later, to ensure that your data is replicated to a majority of nodes on each write (or perhaps after every several writes). Being smart about write concerns and about monitoring of replication lag in general will help you mitigate the problem of rollback, or even avoid it altogether.

In this section you learned perhaps a few more replication internals than expected, but the knowledge should come in handy. Understanding how replication works goes a long way in helping you to diagnose any issues you may have in production.

11.2.3. Administration

For all the automation they provide, replica sets have some potentially complicated configuration options. In what follows, we’ll describe these options in detail. In the interest of keeping things simple, we’ll also suggest which options can be safely ignored.

Configuration details

Here we’ll present the mongod startup options pertaining to replica sets, and we’ll describe the structure of the replica set configuration document.

Replication options

Earlier, you learned how to initiate a replica set using the shell’s rs.initiate() and rs.add() methods. These methods are convenient, but they hide certain replica set configuration options. Let’s look at how to use a configuration document to initiate and update a replica set’s configuration.

A configuration document specifies the configuration of the replica set. To create one, first add a value for _id that matches the name you passed to the --replSet parameter:

> config = {_id: "myapp", members: []}
{ "_id" : "myapp", "members" : [ ] }

The individual members can be defined as part of the configuration document as follows:

config.members.push({_id: 0, host: 'iron.local:40000'})
config.members.push({_id: 1, host: 'iron.local:40001'})
config.members.push({_id: 2, host: 'iron.local:40002', arbiterOnly: true})

As noted earlier, iron is the name of our test machine; substitute your own hostname as necessary. Your configuration document should now look like this:

> config
{
  "_id" : "myapp",
  "members" : [
    {
      "_id" : 0,
      "host" : "iron.local:40000"
    },
    {
      "_id" : 1,
      "host" : "iron.local:40001"
    },
    {
      "_id" : 2,
      "host" : "iron.local:40002",
      "arbiterOnly" : true
    }
  ]
}

You can then pass the document as the first argument to rs.initiate() to initiate the replica set.

Technically, the document consists of an _id containing the name of the replica set, an array specifying between 3 and 50 members, and an optional subdocument for specifying certain global settings. This sample replica set uses the minimum required configuration parameters, plus the optional arbiterOnly setting. Please keep in mind that although a replica set can have up to 50 members, it can only have up to 7 voting members.

The document requires an _id that matches the replica set’s name. The initiation command will verify that each member node has been started with the --replSet option with that name. Each replica set member requires an _id consisting of increasing integers starting from 0. Also, members require a host field with a hostname and optional port.

Here you initiate the replica set using the rs.initiate() method. This is a simple wrapper for the replSetInitiate command. Thus you could have started the replica set like this:

db.runCommand({replSetInitiate: config});

config is a variable holding your configuration document. Once initiated, each set member stores a copy of this configuration document in the local database’s system.replset collection. If you query the collection, you’ll see that the document now has a version number. Whenever you modify the replica set’s configuration, you must also increment this version number. The easiest way to access the current configuration document is to run rs.conf().

To modify a replica set’s configuration, there’s a separate command, replSet-Reconfig, which takes a new configuration document. Alternatively, you can use rs.reconfig() which also uses replSetReconfig. The new document can specify the addition or removal of set members along with alterations to both member-specific and global configuration options. The process of modifying a configuration document, incrementing the version number, and passing it as part of the replSetReconfig can be laborious, so a number of shell helpers exist to ease the way. To see a list of them all, enter rs.help() at the shell.

Bear in mind that whenever a replica set reconfiguration results in the election of a new primary node, all client connections will be closed. This is done to ensure that clients will no longer attempt to send writes to a secondary node unless they’re aware of the reconfiguration.

If you’re interested in configuring a replica set from one of the drivers, you can see how by examining the implementation of rs.add(). Enter rs.add (the method without the parentheses) at the shell prompt to see how the method works.

Configuration document options

Until now, we’ve limited ourselves to the simplest replica set configuration document. But these documents support several options for both replica set members and for the replica set as a whole. We’ll begin with the member options. You’ve seen _id, host, and arbiterOnly. Here are these plus the rest, in all their gritty detail:

  • _id (required) —A unique incrementing integer representing the member’s ID. These _id values begin at 0 and must be incremented by one for each member added.
  • host (required) —A string storing the hostname of this member along with an optional port number. If the port is provided, it should be separated from the hostname by a colon (for example, iron:30000). If no port number is specified, the default port, 27017, will be used. We’ve seen it before, but here’s a simple document with a replica set _id and host:
    {
      "_id" : 0,
      "host" : "iron:40000"
    }
  • arbiterOnly —A Boolean value, true or false, indicating whether this member is an arbiter. Arbiters store configuration data only. They’re lightweight members that participate in primary election but not in the replication itself. Here’s an example of using the arbiterOnly setting:
    {
      "_id" : 0,
      "host" : "iron:40000",
      "arbiterOnly": true
    }
  • priority —A decimal number from 0 to 1000 that helps to determine the relative eligibility that this node will be elected primary. For both replica set initiation and failover, the set will attempt to elect as primary the node with the highest priority, as long as it’s up to date. This might be useful if you have a replica set where some nodes are more powerful than the others; it makes sense to prefer the biggest machine as the primary. There are also cases where you might want a node never to be primary (say, a disaster recovery node residing in a secondary data center). In those cases, set the priority to 0. Nodes with a priority of 0 will be marked as passive in the results to the isMaster() command and will never be elected primary. Here’s an example of setting the member’s priority:
    {
      "_id" : 0,
      "host" : "iron:40000",
      "priority" : 500
    }
  • votes —All replica set members get one vote by default. The votes setting allows you to give more than one vote to an individual member. This option should be used with extreme care, if at all. For one thing, it’s difficult to reason about replica set failover behavior when not all members have the same number of votes. Moreover, the vast majority of production deployments will be perfectly well served with one vote per member. If you do choose to alter the number of votes for a given member, be sure to think through and simulate the various failure scenarios carefully. This member has an increased number of votes:
    {
      "_id" : 0,
      "host" : "iron:40000",
      "votes" : 2
    }
  • hidden —A Boolean value that, when true, will keep this member from showing up in the responses generated by the isMaster command. Because the MongoDB drivers rely on isMaster for knowledge of the replica set topology, hiding a member keeps the drivers from automatically accessing it. This setting can be used in conjunction with buildIndexes and must be used with slaveDelay. This member is configured to be hidden:
    {
      "_id" : 0,
      "host" : "iron:40000",
      "hidden" : true
    }
  • buildIndexes —A Boolean value, defaulting to true, that determines whether this member will build indexes. You’ll want to set this value to false only on members that will never become primary (those with a priority of 0). This option was designed for nodes used solely as backups. If backing up indexes is important, you shouldn’t use this option. Here’s a member configured not to build indexes:
    {
      "_id" : 0,
      "host" : "iron:40000",
      "buildIndexes" : false
    }
  • slaveDelay —The number of seconds that a given secondary should lag behind the primary. This option can be used only with nodes that will never become primary. To specify a slaveDelay greater than 0, be sure to also set a priority of 0. You can use a delayed slave as insurance against certain kinds of user errors. For example, if you have a secondary delayed by 30 minutes and an administrator accidentally drops a database, you have 30 minutes to react to this event before it’s propagated. This member has been configured with a slaveDelay of one hour:
    {
      "_id" : 0,
      "host" : "iron:40000",
      "slaveDelay" : 3600
    }
  • tags —A document containing a set of key-value pairs, usually used to identify this member’s location in a particular datacenter or server rack. Tags are used for specifying granular write concern and read settings, and they’re discussed in section 11.3.4. In the tag document, the values entered must be strings. Here’s a member with two tags:
    {
      "_id" : 0,
      "host" : "iron:40000",
      "tags" : {
        "datacenter" : "NY",
        "rack" : "B"
    }

That sums up the options for individual replica set members. There are also two global replica set configuration parameters scoped under a settings key. In the replica set configuration document, they appear like this:

{
  _id: "myapp",
  members: [ ... ],
  settings: {
    getLastErrorDefaults: {
      w: 1
    },
    getLastErrorModes: {
      multiDC: {
        dc: 2
      }
    }
  }
}

  • getLastErrorDefaults —A document specifying the default arguments to be used when the client calls getLastError with no arguments. This option should be treated with care because it’s also possible to set global defaults for getLast-Error within the drivers, and you can imagine a situation where application developers call getLastError not realizing that an administrator has specified a default on the server. For more details on getLastError, see its documentation at http://docs.mongodb.org/manual/reference/command/getLastError. Briefly, to specify that all writes are replicated to at least two members with a timeout of 500 ms, you’d specify this value in the config like this:
    settings: {
      getLastErrorDefaults: {
        w: 2,
        wtimeout: 500
      }
    }
  • getLastErrorModes —A document defining extra modes for the getLastError command. This feature is dependent on replica set tagging and is described in detail in section 11.3.4.
Replica set status

You can see the status of a replica set and its members by running the replSetGet-Status command. To invoke this command from the shell, run the rs.status() helper method. The resulting document indicates the members and their respective states, uptime, and oplog times. It’s important to understand replica set member state. You can see a complete list of possible values in table 11.1.

Table 11.1. Replica set states

State

State string

Notes

0 STARTUP Indicates that the replica set is negotiating with other nodes by pinging all set members and sharing config data.
1 PRIMARY This is the primary node. A replica set will always have at most one primary node.
2 SECONDARY This is a secondary, read-only node. This node may become a primary in the event of a failover if and only if its priority is greater than 0 and it’s not marked as hidden.
3 RECOVERING This node is unavailable for reading and writing. You usually see this state after a failover or upon adding a new node. While recovering, a data file sync is often in progress; you can verify this by examining the recovering node’s logs.
4 FATAL A network connection is still established, but the node isn’t responding to pings. This usually indicates a fatal error on the machine hosting the node marked FATAL.
5 STARTUP2 An initial data file sync is in progress.
6 UNKNOWN A network connection has yet to be made.
7 ARBITER This node is an arbiter.
8 DOWN The node was accessible and stable at some point but isn’t currently responding to heartbeat pings.
9 ROLLBACK A rollback is in progress.
10 REMOVED The node was once a member of the replica set but has since been removed.

You can consider a replica set stable and online when all its nodes are in any of states 1, 2, or 7 and when at least one node is running as the primary. You can use the rs.status() or replSetGetStatus command from an external script to monitor overall state, replication lag, and uptime, and this is recommended for production deployments.[8]

8

Note that in addition to running the status command, you can get a useful visual through the web console. Chapter 13 discusses the web console and shows an example of its use with replica sets.

Failover and recovery

In the sample replica set you saw a couple examples of failover. Here we summarize the rules of failover and provide some suggestions on handling recovery.

A replica set will come online when all members specified in the configuration can communicate with one another. Each node is given one vote by default, and those votes are used to form a majority and elect a primary. This means that a replica set can be started with as few as two nodes (and votes). But the initial number of votes also decides what constitutes a majority in the event of a failure.

Let’s assume that you’ve configured a replica set of three complete replicas (no arbiters) and thus have the recommended minimum for automated failover. If the primary fails, and the remaining secondaries can see each other, then a new primary can be elected. As for deciding which one, the secondary with the most up-to-date oplog with the higher priority will be elected primary.

Failure modes and recovery

Recovery is the process of restoring the replica set to its original state following a failure. There are two overarching failure categories to be handled. The first is called clean failure, where a given node’s data files can still be assumed to be intact. One example of this is a network partition. If a node loses its connections to the rest of the set, you need only wait for connectivity to be restored, and the partitioned node will resume as a set member. A similar situation occurs when a given node’s mongod process is terminated for any reason but can be brought back online cleanly.[9] Again, once the process is restarted, it can rejoin the set.

9

For instance, if MongoDB is shut down cleanly, then you know that the data files are okay. Alternatively, if running with journaling, the MongoDB instance should be recoverable regardless of how it’s killed.

The second type is called categorical failure, where a node’s data files either no longer exist or must be presumed corrupted. Unclean shutdowns of the mongod process without journaling enabled and hard drive crashes are both examples of this kind of failure. The only ways to recover a categorically failed node are to completely replace the data files via a resync or to restore from a recent backup. Let’s look at both strategies in turn.

To completely resync, start a mongod with an empty data directory on the failed node. As long as the host and port haven’t changed, the new mongod will rejoin the replica set and then resync all the existing data. If either the host or port has changed, then after bringing the mongod back online you’ll also have to reconfigure the replica set. As an example, suppose the node at iron:40001 is rendered unrecoverable and you bring up a new node at foobar:40000. You can reconfigure the replica set by grabbing the configuration document, modifying the host for the second node, and then passing that to the rs.reconfig() method:

> config = rs.conf()
{
  "_id" : "myapp",
  "version" : 1,
  "members" : [

    {
      "_id" : 0,
      "host" : "iron:40000"
    },
    {
      "_id" : 1,
      "host" : "iron:40001"
    },
    {
      "_id" : 2,
      "host" : "iron:40002",
      "arbiterOnly" : true
    }
  ]
}
> config.members[1].host = "foobar:40000"
foobar:40000
> rs.reconfig(config)

Now the replica set will identify the new node, and the new node should start to sync from an existing member.

In addition to restoring via a complete resync, you have the option of restoring from a recent backup. You’ll typically perform backups from one of the secondary nodes by making snapshots of the data files and then storing them offline.[10] Recovery via backup is possible only if the oplog within the backup isn’t stale relative to the oplogs of the current replica set members. This means that the latest operation in the backup’s oplog must still exist in the live oplogs. You can use the information provided by db.getReplicationInfo() to see right away if this is the case. When you do, don’t forget to take into account the time it will take to restore the backup. If the backup’s latest oplog entry is likely to go stale in the time it takes to copy the backup to a new machine, you’re better off performing a complete resync.

10

Backups are discussed in detail in chapter 13.

But restoring from backup can be faster, in part because the indexes don’t have to be rebuilt from scratch. To restore from a backup, copy the backed-up data files to a mongod data path. The resync should begin automatically, and you can check the logs or run rs.status() to verify this.

Deployment strategies

You now know that a replica set can consist of up to 50 nodes in MongoDB v3.0, and you’ve been presented with a dizzying array of configuration options and considerations regarding failover and recovery. There are a lot of ways you might configure a replica set, but in this section we’ll present a couple that will work for the majority of cases.

The most minimal replica set configuration providing automated failover is the one you built earlier consisting of two replicas and one arbiter. In production, the arbiter can run on an application server while each replica gets its own machine. This configuration is economical and sufficient for many production apps.

But for applications where uptime is critical, you’ll want a replica set consisting of three complete replicas. What does the extra replica buy you? Think of the scenario where a single node fails completely. You still have two first-class nodes available while you restore the third. As long as a third node is online and recovering (which may take hours), the replica set can still fail over automatically to an up-to-date node.

Some applications will require the redundancy afforded by two datacenters, and the three-member replica set can also work in this case. The trick is to use one of the datacenters for disaster recovery only. Figure 11.2 shows an example of this. Here, the primary datacenter houses a replica set primary and secondary, and a backup datacenter keeps the remaining secondary as a passive node (with priority 0).

Figure 11.2. A three-node replica set with members in two datacenters

In this configuration, the replica set primary will always be one of the two nodes living in datacenter A. You can lose any one node or any one datacenter and still keep the application online. Failover will usually be automatic, except in the cases where both of A’s nodes are lost. Because it’s rare to lose two nodes at once, this would likely represent the complete failure or partitioning of datacenter A. To recover quickly, you could shut down the member in datacenter B and restart it without the --replSet flag. Alternatively, you could start two new nodes in datacenter B and then force a replica set reconfiguration. You’re not supposed to reconfigure a replica set when the majority of the set is unreachable, but you can do so in emergencies using the force option. For example, if you’ve defined a new configuration document, config, you can force reconfiguration like this:

> rs.reconfig(config, {force: true})

As with any production system, testing is key. Make sure that you test for all the typical failover and recovery scenarios in a staging environment comparable to what you’ll be running in production. Knowing from experience how your replica set will behave in these failures cases will secure some peace of mind and give you the wherewithal to calmly deal with emergencies as they occur.

11.3. Drivers and replication

If you’re building an application using MongoDB’s replication, you need to know about several application-specific topics. The first is related to connections and failover. Next comes the write concern, which allows you to decide to what degree a given write should be replicated before the application continues. The next topic, read scaling, allows an application to distribute reads across replicas. Finally, we’ll discuss tagging, a way to configure more complex replica set reads and writes.

11.3.1. Connections and failover

The MongoDB drivers present a relatively uniform interface for connecting to replica sets.

Single-node connections

You’ll always have the option of connecting to a single node in a replica set. There’s no difference between connecting to a node designated as a replica set primary and connecting to one of the vanilla stand-alone nodes we’ve used for the examples throughout the book. In both cases, the driver will initiate a TCP socket connection and then run the isMaster command. For a stand-alone node, this command returns a document like the following:

{
  "ismaster" : true,
  "maxBsonObjectSize" : 16777216,
  "maxMessageSizeBytes" : 48000000,
  "localTime" : ISODate("2013-11-12T05:22:54.317Z"),
  "ok" : 1
}

What’s most important to the driver is that the isMaster field be set to true, which indicates that the given node is a stand-alone, a master running master-slave replication, or a replica set primary.[11] In all of these cases, the node can be written to, and the user of the driver can perform any CRUD operation.

11

The isMaster command also returns a value for the maximum BSON object size for this version of the server. The drivers then validate that all BSON objects are within this limit prior to inserting them.

But when connecting directly to a replica set secondary, you must indicate that you know you’re connecting to such a node (for most drivers, at least). In the Ruby driver, you accomplish this with the:read parameter. To connect directly to the first secondary you created earlier in the chapter, the Ruby code would look like this:

@con = Mongo::Client.new(['iron: 40001'], {:read => {:mode => :secondary}})

Without the :read argument, the driver will raise an exception indicating that it couldn’t connect to a primary node (assuming that the mongod running at port 40001 is the secondary). This check is in place to keep you from inadvertently reading from a secondary node. Though such attempts to read will always be rejected by the server, you won’t see any exceptions unless you’re running the operations with safe mode enabled.

The assumption is that you’ll usually want to connect to a primary node master; the :read parameter is enforced as a sanity check.

Replica set connections

You can connect to any replica set member individually, but you’ll normally want to connect to the replica set as a whole. This allows the driver to figure out which node is primary and, in the case of failover, reconnect to whichever node becomes the new primary.

Most of the officially supported drivers provide ways of connecting to a replica set. In Ruby, you connect by creating a new instance of Mongo::Client, passing in a list of seed nodes as well as the name of the replica set:

Mongo::Client.new(['iron:40000', 'iron:40001'], :replica_set => 'myapp')

Internally, the driver will attempt to connect to each seed node and then call the isMaster command. Issuing this command to a replica set returns a number of important set details:

> db.isMaster()
{
  "setName" : "myapp",
  "ismaster" : false,
  "secondary" : true,
  "hosts" : [
    "iron:40001",
    "iron:40000"
  ],
  "arbiters" : [
    "iron:40002"
  ],
  "me" : "iron:40000",
  "maxBsonObjectSize" : 16777216,
  "maxMessageSizeBytes" : 48000000,
  "localTime" : ISODate("2013-11-12T05:14:42.009Z"),
  "ok" : 1
}

Once a seed node responds with this information, the driver has everything it needs. Now it can connect to the primary member, again verify that this member is still primary, and then allow the user to read and write through this node. The response object also allows the driver to cache the addresses of the remaining secondary and arbiter nodes. If an operation on the primary fails, then on subsequent requests the driver can attempt to connect to one of the remaining nodes until it can reconnect to a primary.

When connecting to a MongoDB replica set in this way, drivers will automatically discover additional nodes. This means that when you’re connecting to a replica set, you don’t need to explicitly list every member of the set. The response from the isMaster command alerts the driver of the presence of the other members. If none of the replica set members listed in the connection arguments are active, the connection will fail, so it’s wise to list as many as you can. But don’t sweat it if a few nodes are missing from the connection list; they’ll be found. If you have multiple data centers, it’s considered good practice to include members from all data centers.

It’s important to keep in mind that although replica set failover is automatic, the drivers don’t attempt to hide the fact that a failover has occurred. The course of events goes something like this: First, the primary fails or a new election takes place. Subsequent requests will reveal that the socket connection has been broken, and the driver will then raise a connection exception and close any open sockets to the database. It’s now up to the application developer to decide what happens next, and this decision will depend on both the operation being performed and the specific needs of the application.

Keeping in mind that the driver will automatically attempt to reconnect on any subsequent request, let’s imagine a couple of scenarios. First, suppose that you only issue reads to the database. In this case, there’s little harm in retrying a failed read because there’s no possibility of changing database state. But now imagine that you also regularly write to the database. You can write to the database with or without checking for errors. As discussed in section 11.3.2, with a write concern of 1 or more, the driver will check for problems, including a failure of the write to reach the replica set, by calling the getLastError command. This is the default in most drivers since around MongoDB v2.0, but if you explicitly set the write concern to 0, the driver writes to the TCP socket without checking for errors. If you’re using a relatively recent version of the drivers or the shell, you don’t have to explicitly call getLastError(); writes send detailed ACK in any case.

If your application writes with a write concern of 0 and a failover occurs, you’re left in an uncertain state. How many of the recent writes made it to the server? How many were lost in the socket buffer? The indeterminate nature of writing to a TCP socket makes answering these questions practically impossible. How big of a problem this is depends on the application. For logging, non-safe-mode writes are probably acceptable, because losing writes hardly changes the overall logging picture. But for users creating data in the application, non-safe-mode writes can be a disaster.

The important thing to remember is that the write concern is set to 1 by default, meaning the writes are guaranteed to have reached one member of a replica set, and there are risks in setting it to 0. Receiving a response doesn’t eliminate the possibility of a rollback, as we discussed in section 11.2.2. MongoDB gives you some more advanced capabilities for managing this by controlling how writes work with the write concern.

11.3.2. Write concern

It should be clear now that the default write concern of 1 is reasonable for some applications because it’s important to know that writes have arrived error-free at the primary server. But greater levels of assurance are required if you want to eliminate the possibility of a rollback, and the write concern addresses this by allowing developers to specify the extent to which a write should be replicated before getting an acknowledgment and allowing the application to continue. Technically, you control write concerns via two parameters on the getLastError command: w and wtimeout. The first value, w, indicates the total number of servers that the latest write should be replicated to; the second is a timeout that causes the command to return an error if the write can’t be replicated in the specified number of milliseconds.

For example, if you want to make sure that a given write is replicated to at least one server, you can indicate a w value of 2. If you want the operation to time out if this level of replication isn’t achieved in 500 ms, you include a wtimeout of 500. Note that if you don’t specify a value for wtimeout, and the replication for some reason never occurs, the operation will block indefinitely.

When using a driver, you usually pass the write concern value in with the write, but it depends on the specific driver’s API. In Ruby, you can specify a write concern on a single operation like this:

@collection.insert_one(doc, {:w => 2, :wtimeout => 200})

Many drivers support setting default write concern values for a given connection or database. It can be overwritten for a single operation, as shown earlier, but will become the default for the life of the connection:

Mongo::Client.new(['hostname:27017'], :write => {:w => 2})

Even fancier options exist. For instance, if you’ve enabled journaling, you can also force that the journal be synced to disk before acknowledging a write by adding the j option:

@collection.insert_one(doc, :write => {:w => 2, :j => true})

To find out how to set the write concern in your particular case, check your driver’s documentation.

Write concerns work with both replica sets and master-slave replication. If you examine the local databases, you’ll see a couple of collections, me on secondary nodes and slaves on the primary node. These are used to implement write concerns. Whenever a secondary polls a primary, the primary makes a note of the latest oplog entry applied to each secondary in its slaves collection. Thus, the primary knows what each secondary has replicated at all times and can therefore reliably answer the getLastError command’s write requests.

Keep in mind that using write concerns with values of w greater than 1 will introduce extra latency. Configurable write concerns essentially allow you to make the trade-off between speed and durability. If you’re running with journaling, then a write concern with w equal to 1 should be fine for most applications. On the other hand, for logging or analytics, you might elect to disable journaling and write concerns altogether and rely solely on replication for durability, allowing that you may lose some writes in the event of a failure. Consider these trade-offs carefully and test the different scenarios when designing your application.

11.3.3. Read scaling

Replicated databases are great for read scaling. If a single server can’t handle the application’s read load, you have the option to route queries to more than one replica. Most of the drivers have built-in support for sending queries to secondary nodes through a read preference configuration. With the Ruby driver, this is provided as an option on the Mongo::Client constructor:

Mongo::Client.new(
    ['iron:40000', 'iron:40001'],
    {:read => {:mode => :secondary})

Note in the connection code that we configure which nodes the new client will read from. When the :read argument is set to {:mode => :secondary}, the connection object will choose a random, nearby secondary to read from. This configuration is called the read preference, and it can be used to direct your driver to read from certain nodes. Most MongoDB drivers have these available read preferences:

  • primary —This is the default setting and indicates that reads will always be from the replica set primary and thus will always be consistent. If the replica set is experiencing problems and there’s no secondary available, an error will be thrown.
  • primaryPreferred —Drivers with this setting will read from the primary unless for some reason it’s unavailable or there’s no primary, in which case reads will go to a secondary. This means that reads aren’t guaranteed to be consistent.
  • secondary —This setting indicates the driver should always read from the secondary. This is useful in cases where you want to be sure that your reads will have no impact on the writes that occur on the primary. If no secondaries are available, the read will throw an exception.
  • secondaryPreferred —This is a more relaxed version of the previous setting. Reads will go to secondaries, unless no secondaries are available, in which case reads will go to the primary.
  • nearest —A driver configured with this setting will attempt to read from the nearest member of the replica set, as measured by network latency. This could be either a primary or a secondary. Thus, reads will go to the member that the driver believes it can communicate with the quickest.

Remember, the primary read preference is the only one where reads are guaranteed to be consistent. Writing is always done first on the primary. Because there may be a lag in updating the secondary, it’s possible that a document that has just been written won’t be found on a read immediately following the write, unless you’re reading the primary.

It turns out that even if you’re not using the nearest setting, if a MongoDB driver has a read preference that allows it to query secondaries, it will still attempt to communicate with a nearby node. It does this according to its member selection strategy. The driver first ranks all the nodes by their network latency. Then it excludes all the nodes for which network latency is at least 15 ms larger than the lowest latency. Finally, it picks one of the remaining nodes at random. Fifteen milliseconds is the default for this value, but some drivers will allow configuration of the acceptable latency window. For the Ruby driver, this configuration might look like this:

Mongo::Client.new(
    ['iron:40000', 'iron:40001'],
    :read => {:mode => :secondary},:local_threshold => '0.0015')

The :local_threshold option specifies the maximum latency in seconds as a float.

Note that the nearest read preference uses this strategy to pick a node to read from as well, but it includes the primary in the selection process. Overall, the advantage of this approach is that a driver will be likely to have lower latency on a query than with totally random selection but is still able to distribute reads to multiple nodes if they have similarly low latency.

Many MongoDB users scale with replication in production, but there are three cases where this sort of scaling won’t be sufficient. The first relates to the number of servers needed. As of MongoDB v2.0, replica sets support a maximum of 12 members, 7 of which can vote. As of MongoDB v3.0, replica sets support a maximum of 50 members, 7 of which can vote. If you need even more replicas for scaling, you can use master-slave replication. But if you don’t want to sacrifice automated failover and you need to scale beyond the replica set maximum, you’ll need to migrate to a sharded cluster.

The second case involves applications with a high write load. As mentioned at the beginning of this chapter, secondaries must keep up with this write load. Sending reads to write-laden secondaries may inhibit replication.

A third situation that replica scaling can’t handle is consistent reads. Because replication is asynchronous, replicas aren’t always going to reflect the latest writes to the primary. Therefore, if your application reads arbitrarily from secondaries, the picture presented to end users isn’t always guaranteed to be fully consistent. For some applications, this isn’t an issue, but for others you need consistent reads; in our shopping cart example from chapter 4, there would be serious problems if we weren’t reading the most current data. In fact, atomic operations that read and write must be run on the primary. In these cases, you have two options. The first is to separate the parts of the application that need consistent reads from the parts that don’t. The former can always be read from the primary, and the latter can be distributed to secondaries. When this strategy is either too complicated or doesn’t scale, sharding is the way to go.[12]

12

Note that to get consistent reads from a sharded cluster, you must always read from the primary nodes of each shard, and you must issue safe writes.

11.3.4. Tagging

If you’re using either write concerns or read scaling, you may find yourself wanting more granular control over exactly which secondaries receive writes or reads. For example, suppose you’ve deployed a five-node replica set across two data geographically separate centers, NY and FR. The primary datacenter, NY, contains three nodes, and the secondary datacenter, FR, contains the remaining two. Let’s say that you want to use a write concern to block until a certain write has been replicated to at least one node in datacenter FR. With what you know about write concerns right now, you’ll see that there’s no good way to do this. You can’t use a w value of a majority of nodes (3) because the most likely scenario is that the three nodes in NY will acknowledge first. You could use a value of 4, but this won’t hold up well if, say, you lose one node from each datacenter.

Replica set tagging solves this problem by allowing you to define special write concern modes that target replica set members with certain tags. To see how this works, you first need to learn how to tag a replica set member. In the config document, each member can have a key called tags pointing to an object containing key-value pairs. Here’s an example:

{
  "_id" : "myapp",
  "version" : 1,
  "members" : [
    {
      "_id" : 0,
      "host" : "ny1.myapp.com:30000",
      "tags": { "dc": "NY", "rackNY": "A" }
    },

    {
      "_id" : 1,
      "host" : "ny2.myapp.com:30000",
      "tags": { "dc": "NY", "rackNY": "A" }
    },
    {
      "_id" : 2,
      "host" : "ny3.myapp.com:30000",
      "tags": { "dc": "NY", "rackNY": "B" }
    },
    {
      "_id" : 3,
      "host" : "fr1.myapp.com:30000",
      "tags": { "dc": "FR", "rackFR": "A" }
    },
    {
      "_id" : 4,
      "host" : "fr2.myapp.com:30000",
      "tags": { "dc": "FR", "rackFR": "B" }
    }
  ],
  settings: {
    getLastErrorModes: {
      multiDC: { dc: 2 } },
      multiRack: { rackNY: 2 } },
    }
  }
}

This is a tagged configuration document for the hypothetical replica set spanning two datacenters. Note that each member’s tag document has two key-value pairs: the first identifies the datacenter and the second names the local server rack for the given node. Keep in mind that the names used here are completely arbitrary and only meaningful in the context of this application; you can put anything in a tag document (though the value must be a string). What’s most important is how you use it.

That’s where getLastErrorModes comes into play. This allows you to define modes for the getLastError command that implement specific write concern requirements. In the example, we’ve defined two of these. The first, multiDC, is defined as { "dc": 2 }, which indicates that a write should be replicated to nodes tagged with at least two different values for dc. If you examine the tags, you’ll see this will necessarily ensure that the write has been propagated to both datacenters. The second mode specifies that at least two server racks in NY should’ve received the write. Again the tags should make this clear.

In general, a getLastErrorModes entry consists of a document with one or more keys (in this case, dc and rackNY) the values of which are integers. These integers indicate the number of different tagged values for the key that must be satisfied for the getLastError command to complete successfully. Once you define these modes, you can use them as values for w in your application. For example, using the first mode in Ruby looks like this:

@collection.with(:write => {:w => "multiDC"}).insert_one(doc)

In addition to making write concerns more sophisticated, tagging also works with most of the read preferences discussed in section 11.3.3. With reads, tagging works by restricting reads to those with a specific tag. For example, if using a read preference of secondary, the driver will ignore all the nodes that don’t have the given tag value. Because the primary read preference can only ever read from one node, it’s not compatible with tags, but all the other read preferences are. Here’s an example of this using the Ruby driver:

@collection.find({user: "pbakkum"},
{
  :read => :secondary,
  :tag_sets => {

    :dc => "NY"
  }
})

This configuration reads from a secondary node that has the dc:NY tag.

Tagging is an element of MongoDB that you may never use, but it can be incredibly useful in certain situations. Keep it in mind if you’re managing complex replica set configurations.

11.4. Summary

It should be clear from all that we’ve discussed that replication is essential for most deployments. MongoDB’s replication is supposed to be easy, and setting it up usually is. But when it comes to backing up and failing over, there are bound to be hidden complexities. For these complex cases, let experience, and some help from this chapter, breed familiarity.

To finish up, here are some key things to remember as you move on and manage your own replica sets:

  • We recommend that every production deployment of MongoDB where data protection is critical should use a replica set. Failing that, frequent backups are especially essential.
  • A replica set should include at least three members, though one of these can be an arbiter.
  • Data isn’t considered committed until it has been written to a majority of replica set members. In a failure scenario, if a majority of members remain they’ll continue to accept writes. Writes that haven’t reached a majority of members in this situation will be placed in the rollback data directory and must be handled manually.
  • If a replica set secondary is down for a period of time, and the changes made to the database don’t fit into MongoDB’s oplog, this node will be unable to catch up and must be resynced from scratch. To avoid this, try to minimize the downtime of your secondaries.
  • The driver’s write concern controls how many nodes must be written to before returning. Increase this value to increase durability. For real durability, we recommend you set it to a majority of members to avoid rollback scenarios, though this approach carries a latency cost.
  • MongoDB give you fine-grained controls over how reads and writes behave in more complex replica sets using read preferences and tagging. Use these options to optimize the performance of your replica set, especially if you have set members in multiple datacenters.

As always, think through your deployments and test thoroughly. Replica sets can be an especially valuable tool when used effectively.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset