Chapter 12. Scaling your system with sharding

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 12. Scaling your system with sharding

This chapter covers

Sharding motivation and architecture
Setting up and loading a sample shard cluster
Querying and indexing a sharded cluster
Choosing a shard key
Deploying sharding in production

With the increasing scale of modern applications, it’s become more and more expensive, and in some cases impossible, to get a single machine powerful enough to handle the load. One solution to the problem is to pool the capacity of a large number of less powerful machines. Sharding in MongoDB is designed to do just that: partition your database into smaller pieces so that no single machine has to store all the data or handle the entire load. On top of that, sharding in MongoDB is transparent to the application, which means the interface for querying a sharded cluster is exactly the same as the interface for querying a replica set or a single mongod server instance.

We’ll begin the chapter with an overview of sharding. We’ll go into detail about what problems it tries to solve and how to know when you need it. Next, we’ll talk about the components that make up a sharded cluster. Then, we’ll cover the two different ways to shard, and scratch the surface of MongoDB’s range-based partitioning.

These three sections will give you a basic working knowledge of sharding, but you won’t fully understand how these ideas all come together until you set up your own sharded cluster. That’s what you’ll do in the fourth section, where you’ll build a sample cluster to host data from a massive Google Docs–like application.

We’ll then discuss some sharding mechanics, describing how queries and indexing work across shards. We’ll look at the ever-important choice of shard key, and we’ll end the chapter with some specific advice on running sharding in production.

Google Docs instead of e-commerce

We’re using a Google Docs–like application here, rather than the e-commerce application we've used in the rest of the book, because the schema is simpler and allows us to focus on the sharding itself.

In an e-commerce application, you may have multiple collections. Some of these collections may be large, such as a collection storing all user comments, whereas some may be smaller, such as a collection storing all user profiles. In a more complex application such as this, you’d only shard the collections that would benefit from the added capacity of sharding while leaving the smaller collections unsharded for simplicity. Because sharded and unsharded collections can exist in the same system, all of this will work together, completely transparently to the application. In fact, if later you find that one of the collections that didn’t need to be sharded is becoming larger, you can enable sharding at any time.

The same principles that we’ll see when looking at our Google Docs–like application apply to any sharded collection. We’ll stick with this example to keep things simple and focus on what’s new in this chapter.

12.1. Sharding overview

Before you build your first sharded cluster, it’s useful to have a general understanding of the concepts behind sharding. In this section, we’ll cover what problems sharding solves, discuss some of the challenges inherent in sharding, and then talk about how to know when sharding is the correct solution in practice.

12.1.1. What is sharding?

Sharding is the process of partitioning a large dataset into smaller, more manageable pieces. Until this point in the book, you’ve used MongoDB as a single server, where each mongod instance contains a complete copy of your application’s data. Even when using replication (as we did in chapter 11), each replica clones every other replica’s data entirely. For the majority of applications, storing the complete data set on each server is perfectly acceptable. But as the size of the data grows, and as an application demands greater read-and-write throughput, commodity servers may not be sufficient. In particular, these servers may not be able to address enough RAM, or they might not have enough CPU cores, to process the workload efficiently. In addition, as the size of the data grows, it may become impractical to store and manage backups for such a large data set on one disk or RAID array. If you’re to continue to use commodity or virtualized hardware to host the database, the solution to these problems is to distribute the database across more than one server. The method for doing this in MongoDB is called sharding. Sharding in MongoDB can help your application scale, but remember that it’s a large hammer. It’s a complex system that adds administrative and performance overhead, so make absolutely sure it’s what your application needs. In the next section we’ll cover how you can tell when sharding is your best option.

Sharding: Learn by doing

Sharding is complicated. To get the most out of this chapter, you should run the examples.

You’ll have no trouble running the sample cluster entirely on one machine; once you’ve successfully set up your cluster, start experimenting with it. There’s nothing like having a live, sharded deployment on hand for understanding MongoDB as a distributed system.

12.1.2. When should you shard?

The question of when to shard is straightforward in theory but requires a solid understanding of how your system is being used. In general, there are two main reasons to shard: storage distribution and load distribution. Keep in mind that sharding doesn’t solve all performance issues, and it adds additional complexity and overhead, so it’s important to understand why you’re sharding. In many cases, sharding may not be the optimal solution.

Storage distribution

Understanding the storage requirements of your system is usually not difficult. MongoDB stores all its data in ordinary files in the directory specified by the --dbpath option, which you can read more about in appendix A, so you should be able to use whatever utilities are present on your host operating system to monitor the storage usage of MongoDB. In addition, running db.stats() and db.collection.stats() in the mongo shell will output statistics about the storage usage of the current database and the collection within it named collection, respectively.

If you carefully monitor your storage capacity as your application grows, you’ll be able to clearly see when the storage that your application requires exceeds the capacity of any one node. In that case, if adding more capacity isn’t possible, sharding may be your best option.

Load distribution

Understanding the load—the CPU, RAM, and I/O bandwidth used by requests from clients—that your system must support is a bit more nuanced. In chapter 8 we talked about the importance of keeping indexes and the working data set in RAM, and this is the most common reason to shard. If an application’s data set continues to grow unbounded, a moment will come when that data no longer fits in RAM. If you’re running on Amazon’s EC2, you’ll hit that threshold when you’ve exceeded the available RAM on the largest available instance. Alternatively, you may run your own hardware with much more RAM, in which case you’ll probably be able to delay sharding for some time. But no machine has infinite capacity for RAM; therefore, sharding eventually becomes necessary.

To be sure, the relationship between the load your servers can handle and the amount of RAM they have available isn’t always straightforward. For instance, using solid-state drives (an increasingly affordable prospect) or arranging your disks in a striped RAID configuration will increase the number of IOPS (input/output operations per second) that your disks can handle, which may allow you to push the data-to-RAM ratio without negatively affecting performance. It may also be the case that your working set is a fraction of your total data size and that, therefore, you can operate with relatively little RAM. On the flip side, if you have an especially demanding write load, you may want to shard well before data reaches the size of RAM, simply because you need to distribute the load across machines to get the desired write throughput.

Leave some elbow room

Although it may be tempting to wait to shard until all your disks are 100% full and all your machines are overloaded, that’s a bad idea. Sharding itself puts some load on your system because the process of automatic balancing has to move data off overloaded shards. If your system is already so overloaded that this can’t happen, your empty shards will remain empty, your overloaded shards will remain overloaded, and your system will grind to a halt. Chapter 13 gives some practical advice for how to keep track of the important metrics, so you can scale up smoothly as your application grows.

Whatever the case, the decision to shard an existing system will always be based on regular analyses of network usage, disk usage, CPU usage, and the ever-important ratio of working set size, or the amount of data actively being used, to available RAM.

Now that you understand the background and theory behind sharding and know when you need it, let’s look at the components that make up a sharded cluster in MongoDB.

12.2. Understanding components of a sharded cluster

Several components need to work together to make sharding possible. When they’re all functioning together, this is known as a sharded cluster. To understand how MongoDB’s sharding works, you need to know about all the components that make up a sharded cluster and the role of each component in the context of the cluster as a whole.

A sharded cluster consists of shards, mongos routers, and config servers, as shown in figure 12.1.

Figure 12.1. Components in a MongoDB shard cluster

Let’s examine each component in figure 12.1:

Shards (upper left) store the application data. In a sharded cluster, only the mongos routers or system administrators should be connecting directly to the shards. Like an unsharded deployment, each shard can be a single node for development and testing, but should be a replica set in production.
mongos routers (center) cache the cluster metadata and use it to route operations to the correct shard or shards.
Config servers (upper right) persistently store metadata about the cluster, including which shard has what subset of the data.

Now, let’s discuss in more detail the role each of these components plays in the cluster as a whole.

12.2.1. Shards: storage of application data

A shard, shown at the upper left of figure 12.1, is either a single mongod server or a replica set that stores a partition of the application data. In fact, the shards are the only places where the application data gets saved in a sharded cluster. For testing, a shard can be a single mongod server but should be deployed as a replica set in production because then it will have its own replication mechanism and can fail over automatically. You can connect to an individual shard as you would to a single node or a replica set, but if you try to run operations on that shard directly, you’ll see only a portion of the cluster’s total data.

12.2.2. Mongos router: router of operations

Because each shard contains only part of the cluster’s data, you need something to route operations to the appropriate shards. That’s where mongos comes in. The mongos process, shown in the center of figure 12.1, is a router that directs all reads, writes, and commands to the appropriate shard. In this way, mongos provides clients with a single point of contact with the cluster, which is what enables a sharded cluster to present the same interface as an unsharded one.

mongos processes are lightweight and nonpersistent.^[1] Because of this, they’re often deployed on the same machines as the application servers, ensuring that only one network hop is required for requests to any given shard. In other words, the application connects locally to a mongos, and the mongos manages connections to the individual shards.

¹
The mongos server caches a local copy of the config server metadata in memory. This metadata has a version identifier that changes when the metadata changes, so when a mongos with old metadata tries to contact a shard with a newer metadata version, it receives a notification that it must refresh its local copy.

12.2.3. Config servers: storage of metadata

mongos processes are nonpersistent, which means something must durably store the metadata needed to properly manage the cluster. That’s the job of the config servers, shown in the top right of figure 12.1. This metadata includes the global cluster configuration; the locations of each database, collection, and the particular ranges of data therein; and a change log preserving a history of the migrations of data across shards.

The metadata held by the config servers is central to the proper functioning and upkeep of the cluster. For instance, every time a mongos process is started, the mongos fetches a copy of the metadata from the config servers. Without this data, no coherent view of the shard cluster is possible. The importance of this data, then, informs the design and deployment strategy for the config servers.

If you examine figure 12.1, you’ll see there are three config servers, but they’re not deployed as a replica set. They demand something stronger than asynchronous replication; when the mongos process writes to them, it does so using a two-phase commit. This guarantees consistency across config servers. You must run exactly three config servers in any production deployment of sharding, and these servers must reside on separate machines for redundancy.^[2]

²
You can also run a single config server, but only as a way of more easily testing sharding. Running with only one config server in production is like taking a transatlantic flight in a single-engine jet: it might get you there, but lose an engine and you’re hosed.

Now you know what a shard cluster consists of, but you’re probably still wondering about the sharding machinery itself. How is data actually distributed? We’ll explain that in the next section, first introducing the two ways to shard in MongoDB, and then covering the core sharding operations.

12.3. Distributing data in a sharded cluster

Before discussing the different ways to shard, let’s discuss how data is grouped and organized in MongoDB. This topic is relevant to a discussion of sharding because it illustrates the different boundaries on which we can partition our data.

To illustrate this, we’ll use the Google Docs–like application we’ll build later in the chapter. Figure 12.2 shows how the data for such an application would be structured in MongoDB.

Figure 12.2. Levels of granularity available in a sharded MongoDB deployment

Looking at the figure from the innermost box moving outward, you can see there are four different levels of granularity in MongoDB: document, chunk, collection, and database.

These four levels of granularity represent the units of data in MongoDB:

Document —The smallest unit of data in MongoDB. A document represents a single object in the system and can’t be divided further. You can compare this to a row in a relational database. Note that we consider a document and all its fields to be a single atomic unit. In the innermost box in figure 12.2, you can see a document with a username field with a value of "hawkins".
Chunk —A group of documents clustered by values on a field. A chunk is a concept that exists only in sharded setups. This is a logical grouping of documents based on their values for a field or set of fields, known as a shard key. We’ll cover the shard key when we go into more detail about chunks later in this section, and then again in section 12.6. The chunk shown in figure 12.2 contains all the documents that have the field username with values between "bakkum" and "verch".
Collection —A named grouping of documents within a database. To allow users to separate a database into logical groupings that make sense for the application, MongoDB provides the concept of a collection. This is nothing more than a named grouping of documents, and it must be explicitly specified by the application to run any queries. In figure 12.2, the collection name is spreadsheets. This collection name essentially identifies a subgroup within the cloud-docs database, which we’ll discuss next.
Database —Contains collections of documents. This is the top-level named grouping in the system. Because a database contains collections of documents, a collection must also be specified to perform any operations on the documents themselves. In figure 12.2, the database name is cloud-docs. To run any queries, the collection must also be specified—spreadsheets in our example. The combination of database name and collection name together is unique throughout the system, and is commonly referred to as the namespace. It is usually represented by concatenating the collection name and database together, separated by a period character. For the example shown in figure 12.2, that would look like cloud-docs.spreadsheets.

Databases and collections were covered in section 4.3, and are present in unsharded deployments as well, so the only unfamiliar grouping here should be the chunk.

12.3.1. Ways data can be distributed in a sharded cluster

Now you know the different ways in which data is logically grouped in MongoDB. The next questions are, how does this interact with sharding? On which of these groupings can we partition our data? The quick answer to these questions is that data can be distributed in a sharded cluster on two of these four groupings:

On the level of an entire database, where each database along with all its collections is put on its own shard.
On the level of partitions or chunks of a collection, where the documents within a collection itself are divided up and spread out over multiple shards, based on values of a field or set of fields called the shard key in the documents.

You may wonder why MongoDB does partitioning based on chunks rather than on individual documents. It seems like that would be the most logical grouping because a document is the smallest possible unit. But when you consider the fact that not only do we have to partition the data, but we also have to be able to find it again, you’ll see that if we partition on a document level—for example, by allowing each spreadsheet in our Google Docs–like application to be independently moved around—we need to store metadata on the config servers, keeping track of every single document independently. If you imagine a system with small documents, half of your data may end up being metadata on the config servers just keeping track of where your actual data is stored.

Granularity jump from database to partition of collection

You may also wonder why there’s a jump in granularity from an entire database to a partition of a collection. Why isn’t there an intermediate step where we can distribute on the level of whole collections, without partitioning the collections themselves?

The real answer to this question is that it’s completely theoretically possible. It just hasn’t been implemented yet. Fortunately, because of the relationship between databases and collections, there’s an easy workaround. If you’re in a situation where you have different collections—say, files.spreadsheets and files.powerpoints—that you want to be put on separate servers, you can store them in separate databases. For example, you could store spreadsheets in files_spreadsheets.spreadsheets and PowerPoint files in files_powerpoints.powerpoints. Because files_spreadsheets and files_powerpoints are two separate databases, they’ll be distributed, and so will the collections.

In the next two sections, we’ll cover each of the supported distribution methods. First we’ll discuss distributing entire databases, and then we’ll move on to the more common and useful method of distributing chunks within a collection.

12.3.2. Distributing databases to shards

As you create new databases in a sharded cluster, each database is assigned to a different shard. If you do nothing else, a database and all its collections will live forever on the shard where they were created. The databases themselves don’t even need to be sharded.

Because the name of a database is specified by the application, you can think of this as a kind of manual partitioning. MongoDB has nothing to do with how well-partitioned your data is. To see why this is manual, consider using this method to shard the spreadsheets collection in our documents example. To shard this two ways using database distribution, you’d have to make two databases—say files1 and files2—and evenly divide the data between the files1.spreadsheets and the files2.spreadsheets collections. It’s completely up to you to decide which spreadsheet goes in which collection and come up with a scheme to query the appropriate database to find them later. This is a difficult problem, which is why we don’t recommend this approach for this type of application.

When is the database distribution method really useful? One example of a real application for database distribution is MongoDB as a service. In one implementation of this model, customers can pay for access to a single MongoDB database. On the backend, each database is created in a sharded cluster. This means that if each client uses roughly the same amount of data, the distribution of the data will be optimal due to the distribution of the databases throughout the cluster.

12.3.3. Sharding within collections

Now, we’ll review the more powerful form of MongoDB sharding: sharding an individual collection. This is what the phrase automatic sharding refers to, because this is the form of sharding in which MongoDB itself makes all the partitioning decisions, without any direct intervention from the application.

To allow for partitioning of an individual collection, MongoDB defines the idea of a chunk, which as you saw earlier is a logical grouping of documents, based on the values of a predetermined field or set of fields called a shard key. It’s the user’s responsibility to choose the shard key, and we’ll cover how to do this in section 12.8.

For example, consider the following document from a spreadsheet management application:

{
  _id: ObjectId("4d6e9b89b600c2c196442c21")
  filename: "spreadsheet-1",
  updated_at: ISODate("2011-03-02T19:22:54.845Z"),
  username: "banks",
  data: "raw document data"
}

If all the documents in our collection have this format, we can, for example, choose a shard key of the _id field and the username field. MongoDB will then use that information in each document to determine what chunk the document belongs to.

How does MongoDB make this determination? At its core, MongoDB’s sharding is range-based; this means that each “chunk” represents a range of shard keys. When MongoDB looks at a document to determine what chunk it belongs to, it first extracts the values for the shard key and then finds the chunk whose shard key range contains the given shard key values.

To see a concrete example of what this looks like, imagine that we chose a shard key of username for this spreadsheets collection, and we have two shards, “A” and “B.” Our chunk distribution may look something like table 12.1.

Table 12.1. Chunks and shards

Start	End	Shard
-∞	Abbot	B
Abbot	Dayton	A
Dayton	Harris	B
Harris	Norris	A
Norris	∞	B

Looking at the table, it becomes a bit clearer what purpose chunks serve in a sharded cluster. If we gave you a document with a username field of "Babbage", you’d immediately know that it should be on shard A, just by looking at the table above. In fact, if we gave you any document that had a username field, which in this case is our shard key, you’d be able to use table 12.1 to determine which chunk the document belonged to, and from there determine which shard it should be sent to. We’ll look into this process in much more detail in sections 12.5 and 12.6.

12.4. Building a sample shard cluster

The best way to get a handle on sharding is to see how it works in action. Fortunately, it’s possible to set up a sharded cluster on a single machine, and that’s exactly what we’ll do now.^[3]

³
The idea is that you can run every mongod and mongos process on a single machine for testing. In section 12.7 we’ll look at production sharding configurations and the minimum number of machines required for a viable deployment.

The full process of setting up a sharded cluster involves three phases:

1. Starting the mongod and mongos servers—The first step is to spawn all the individual mongod and mongos processes that make up the cluster. In the cluster we’re setting up in this chapter, we’ll spawn nine mongod servers and one mongos server.

2. Configuring the cluster—The next step is to update the configuration so that the replica sets are initialized and the shards are added to the cluster. After this, the nodes will all be able to communicate with each other.

3. Sharding collections—The last step is to shard a collection so that it can be spread across multiple shards. The reason this exists as a separate step is because MongoDB can have both sharded and unsharded collections in the same cluster, so you must explicitly tell it which ones you want to shard. In this chapter, we’ll shard our only collection, which is the spreadsheets collection of the cloud-docs database.

We’ll cover each of these steps in detail in the next three sections. We’ll then simulate the behavior of the sample cloud-based spreadsheet application described in the previous sections. Throughout the chapter we’ll examine the global shard configuration, and in the last section, we’ll use this to see how data is partitioned based on the shard key.

12.4.1. Starting the mongod and mongos servers

The first step in setting up a sharded cluster is to start all the required mongod and mongos processes. The shard cluster you’ll build will consist of two shards and three config servers. You’ll also start a single mongos to communicate with the cluster. Figure 12.3 shows a map of all the processes that you’ll launch, with their port numbers in parentheses.

Figure 12.3. A map of processes comprising the sample shard cluster

You’ll runa bunch of commands to bring the cluster online, so if you find yourself unable to see the forest because of the trees, refer back to this figure.

Starting the sharding components

Let’s start by creating the data directories for the two replica sets that will serve as our shards:

$ mkdir /data/rs-a-1
$ mkdir /data/rs-a-2
$ mkdir /data/rs-a-3
$ mkdir /data/rs-b-1
$ mkdir /data/rs-b-2
$ mkdir /data/rs-b-3

Next, start each mongod. Because you’re running so many processes, you’ll use the --fork option to run them in the background.^[4] The commands for starting the first replica set are as follows:

⁴
If you’re running Windows, note that fork won’t work for you. Because you’ll have to open a new terminal window for each process, you’re best off omitting the logpath option as well.

$ mongod --shardsvr --replSet shard-a --dbpath /data/rs-a-1 
  --port 30000 --logpath /data/rs-a-1.log --fork
$ mongod --shardsvr --replSet shard-a --dbpath /data/rs-a-2 
  --port 30001 --logpath /data/rs-a-2.log --fork

$ mongod --shardsvr --replSet shard-a  --dbpath /data/rs-a-3 
  --port 30002 --logpath /data/rs-a-3.log --fork

Here are the commands for the second replica set:

We won’t cover all the command-line options used here. To see what each of these flags means in more detail, it’s best to refer to the MongoDB documentation at http://docs.mongodb.org/manual/reference/program/mongod/ for the mongod program. As usual, you now need to initiate these replica sets. Connect to each one individually, run rs.initiate(), and then add the remaining nodes. The first should look like this:

$ mongo localhost:30000
> rs.initiate()

You’ll have to wait a minute or so before the initial node becomes primary. During the process, the prompt will change from shard-a:SECONDARY> to shard-a:PRIMARY. Using the rs.status() command will also reveal more information about what’s going on behind the scenes. Once it does, you can add the remaining nodes:

Using localhost as the machine name might cause problems in the long run because it only works if you’re going to run all processes on a single machine. If you know your hostname, use it to get out of trouble. On a Mac, your hostname should look something like MacBook-Pro.local. If you don’t know your hostname, make sure that you use localhost everywhere!

Configuring a replica set that you’ll use as a shard is exactly the same as configuring a replica set that you’ll use on its own, so refer back to chapter 10 if any of this replica set setup looks unfamiliar to you.

Initiating the second replica set is similar. Again, wait a minute after running rs.initiate():

$ mongo localhost:30100
> rs.initiate()
> rs.add("localhost:30100")
> rs.addArb("localhost:30101")

Finally, verify that both replica sets are online by running the rs.status() command from the shell on each one. If everything checks out, you’re ready to start the config servers.^[5] Now you create each config server’s data directory and then start a mongod for each one using the configsvr option:

⁵
Again, if running on Windows, omit the --fork and --logpath options, and start each mongod in a new window.

$ mkdir /data/config-1
$ mongod --configsvr --dbpath /data/config-1 --port 27019 
  --logpath /data/config-1.log --fork --nojournal
$ mkdir /data/config-2
$ mongod --configsvr --dbpath /data/config-2 --port 27020 
  --logpath /data/config-2.log --fork --nojournal
$ mkdir /data/config-3
$ mongod --configsvr --dbpath /data/config-3 --port 27021 
  --logpath /data/config-3.log --fork --nojournal

Ensure that each config server is up and running by connecting with the shell, or by tailing the log file (tail –f <log_file_path>) and verifying that each process is listening on the configured port. Looking at the logs for any one config server, you should see something like this:

Wed Mar  2 15:43:28 [initandlisten] waiting for connections on port 27020
Wed Mar  2 15:43:28 [websvr] web admin interface listening on port 28020

If each config server is running, you can go ahead and start the mongos. The mongos must be started with the configdb option, which takes a comma-separated list of config database addresses:^[6]

⁶
Be careful not to put spaces between the config server addresses when specifying them.

$ mongos --configdb localhost:27019,localhost:27020,localhost:27021 
  --logpath /data/mongos.log --fork --port 40000

Once again, we won’t cover all the command line options we’re using here. If you want more details on what each option does, refer to the docs for the mongos program at http://docs.mongodb.org/manual/reference/program/mongos/.

12.4.2. Configuring the cluster

Now that you’ve started all the mongod and mongos processes that we’ll need for this cluster (see figure 12.2), it’s time to configure the cluster. Start by connecting to the mongos. To simplify the task, you’ll use the sharding helper methods. These are methods run on the global sh object. To see a list of all available helper methods, run sh.help().

You’ll enter a series of configuration commands beginning with the addShard command. The helper for this command is sh.addShard(). This method takes a string consisting of the name of a replica set, followed by the addresses of two or more seed nodes for connecting. Here you specify the two replica sets you created along with the addresses of the two non-arbiter members of each set:

$ mongo localhost:40000
> sh.addShard("shard-a/localhost:30000,localhost:30001")
  { "shardAdded" : "shard-a", "ok" : 1 }
> sh.addShard("shard-b/localhost:30100,localhost:30101")
  { "shardAdded" : "shard-b", "ok" : 1 }

If successful, the command response will include the name of the shard just added. You can examine the config database’s shards collection to see the effect of your work. Instead of using the use command, you’ll use the getSiblingDB() method to switch databases:

> db.getSiblingDB("config").shards.find()
{ "_id" : "shard-a", "host" : "shard-a/localhost:30000,localhost:30001" }
{ "_id" : "shard-b", "host" : "shard-b/localhost:30100,localhost:30101" }

As a shortcut, the listshards command returns the same information:

> use admin
> db.runCommand({listshards: 1})

While we’re on the topic of reporting on sharding configuration, the shell’s sh.status() method nicely summarizes the cluster. Go ahead and try running it now.

12.4.3. Sharding collections

The next configuration step is to enable sharding on a database. This doesn’t do anything on its own, but it’s a prerequisite for sharding any collection within a database. Your application’s database will be called cloud-docs, so you enable sharding like this:

> sh.enableSharding("cloud-docs")

As before, you can check the config data to see the change you just made. The config database holds a collection called databases that contains a list of databases. Each document specifies the database’s primary shard location and whether it’s partitioned (whether sharding is enabled):

> db.getSiblingDB("config").databases.find()
{ "_id" : "admin", "partitioned" : false, "primary" : "config" }
{ "_id" : "cloud-docs", "partitioned" : true, "primary" : "shard-a" }

Now all you need to do is shard the spreadsheets collection. When you shard a collection, you define a shard key. Here you’ll use the compound shard key {username: 1, _id: 1} because it’s good for distributing data and makes it easy to view and comprehend chunk ranges:

> sh.shardCollection("cloud-docs.spreadsheets", {username: 1, _id: 1})

Again, you can verify the configuration by checking the config database for sharded collections:

Don’t worry too much about understanding all the fields in this document. This is internal metadata that MongoDB uses to track collections, and it isn’t meant to be accessed directly by users.

Sharding an empty collection

This sharded collection definition may remind you of something: it looks a bit like an index definition, especially with its unique key. When you shard an empty collection, MongoDB creates an index corresponding to the shard key on each shard.^[7] Verify this for yourself by connecting directly to a shard and running the getIndexes() method. Here you connect to your first shard, and the output contains the shard key index, as expected:

⁷
If you’re sharding an existing collection, you’ll have to create an index corresponding to the shard key before you run the shardcollection command.

Once you’ve sharded the collection, sharding is ready to go. You can now write to the cluster and data will distribute. You’ll see how that works in the next section.

12.4.4. Writing to a sharded cluster

We’ll insert some documents into the sharded cluster so you can observe the formation and movement of chunks, which is the essence of MongoDB’s sharding. The sample documents, each representing a single spreadsheet, will look like this:

{
  _id: ObjectId("4d6f29c0e4ef0123afdacaeb"),
  filename: "sheet-1",
  updated_at: new Date(),
  username: "banks",
  data: "RAW DATA"
}

Note that the data field will contain a 5 KB string to simulate user data.

This book’s source code for this chapter includes a Ruby script you can use to write documents to the cluster. The script takes a number of iterations as its argument, and for each iteration, it inserts one 5 KB document for each of 200 users. The script’s source is here:

If you have the script on hand, you can run it from the command line with no arguments to insert the initial iteration of 200 values:

$ ruby load.rb 1

Now connect to mongos via the shell. If you query the spreadsheets collection, you’ll see that it contains exactly 200 documents and that they total around 1 MB. You can also query a document, but be sure to exclude the sample data field (you don’t want to print 5 KB of text to the screen):

$ mongo localhost:40000
> use cloud-docs
> db.spreadsheets.count()
200
> db.spreadsheets.stats().size
1019496
> db.spreadsheets.findOne({}, {data: 0})
{
  "_id" : ObjectId("4d6d6b191d41c8547d0024c2"),
  "username" : "Cerny",
  "updated_at" : ISODate("2011-03-01T21:54:33.813Z"),
  "filename" : "sheet-0"
}

Check on the shards

Now you can check out what’s happened sharding-wise. Switch to the config database and check the number of chunks:

> use config
> db.chunks.count()
1

There’s only one chunk so far. Let’s see how it looks:

Can you figure out what range this chunk represents? If there’s only one chunk, it spans the entire sharded collection. That’s borne out by the min and max fields, which show that the chunk’s range is bounded by $minKey and $maxKey.

minKey and maxKey

$minKey and $maxKey are used in comparison operations as the boundaries of BSON types. BSON is MongoDB’s native data format. $minKey always compares lower than all BSON types, and $maxKey compares greater than all BSON types. Because the value for any given field can contain any BSON type, MongoDB uses these two types to mark the chunk endpoints at the extremities of the sharded collection.

You can see a more interesting chunk range by adding more data to the spreadsheets collection. You’ll use the Ruby script again, but this time you’ll run 100 iterations, which will insert an extra 20,000 documents totaling 100 MB:

$ ruby load.rb 100

Verify that the insert worked:

> db.spreadsheets.count()
20200
> db.spreadsheets.stats().size
103171828

Sample insert speed

Note that it may take several minutes to insert this data into the shard cluster. There are two main reasons for the slowness:

1. You’re performing a round-trip for each insert, whereas you might be able to perform bulk inserts in a production situation.

2. Most significantly, you’re running all of the shard’s nodes on a single machine. This places a huge burden on the disk because four of your nodes are being written to simultaneously (two replica set primaries and two replicating secondaries).

Suffice it to say that in a proper production installation, this insert would run much more quickly.

Having inserted this much data, you’ll definitely have more than one chunk. You can check the chunk state quickly by counting the number of documents in the chunks collection:

> use config
> db.chunks.count()
10

You can see more detailed information by running sh.status(). This method prints all of the chunks along with their ranges. For brevity, we’ll only show the first two chunks:

Seeing data on multiple shards

The picture has definitely changed. As you can see in figure 12.4, you now have 10 chunks. Naturally, each chunk represents a contiguous range of data.

Figure 12.4. The chunk distribution of the `spreadsheets` collection

You can see in figure 12.4 that shard-a has a chunk that ranges from one of Abdul’s documents to one of Buettner’s documents, just as you saw in our output. This means that all the documents with a shard key that lies between these two values will either be inserted into, or found on, shard-a.^[8] You can also see in the figure that shard-b has some chunks too, in particular the chunk ranging from one of Lee’s documents to one of Stewart’s documents, which means any document with a shard key between those two values belongs on shard-b. You could visually scan the sh.status() output to see all the chunks, but there’s a more direct way: running a query on the chunks collection that filters on the name of the shard and counting how many documents would be returned:

⁸
If you’re following along and running all these examples for yourself, note that your chunk distributions may differ somewhat.

> db.chunks.count({"shard": "shard-a"})
5
> db.chunks.count({"shard": "shard-b"})
5

As long as the cluster’s data size is small, the splitting algorithm dictates that splits happen often. That’s what you see now. This is an optimization that gives you a good distribution of data and chunks early on. From now on, as long as writes remain evenly distributed across the existing chunk ranges, few migrations will occur.

Splits and migrations

Behind the scenes, MongoDB relies on two mechanisms to keep the cluster balanced: splits and migrations.

Splitting is the process of dividing a chunk into two smaller chunks. This happens when a chunk exceeds the maximum chunk size, currently 64 MB by default. Splitting is necessary because chunks that are too large are unwieldy and hard to distribute evenly throughout the cluster.

Migrating is the process of moving chunks between shards. When some shards have significantly more chunks than others, this triggers something called a migration round. During a migration round, chunks are migrated from shards with many chunks to shards with fewer chunks until the cluster is more evenly balanced. As you can imagine, of the two operations, migrating is significantly more expensive than splitting.

In practice, these operations shouldn’t affect you, but it’s useful to be aware that they’re happening in case you run into a performance issue. If your inserts are well-distributed, the data set on all your shards should increase at roughly the same rate, meaning that the number of chunks will also grow at roughly the same rate and expensive migrations will be relatively infrequent.

Now the split threshold will increase. You can see how the splitting slows down, and how chunks start to grow toward their max size, by doing a more massive insert. Try adding another 800 MB to the cluster. Once again, we’ll use the Ruby script, remembering that it inserts about 1 MB on each iteration:

$ ruby load.rb 800

This will take a lot of time to run, so you may want to step away and grab a snack after starting this load process. By the time it’s done, you’ll have increased the total data size by a factor of 8. But if you check the chunking status, you’ll see that there are only around twice as many chunks:

> use config
> db.chunks.count()
21

Given that there are more chunks, the average chunk ranges will be smaller, but each chunk will include more data. For example, the first chunk in the collection spans from Abbott to Bender but it’s already nearly 60 MB in size. Because the max chunk size is currently 64 MB by default, you’d soon see this chunk split if you were to continue inserting data.

Another thing to note is that the distribution still looks pretty even, as it did before:

> db.chunks.count({"shard": "shard-a"})
11
> db.chunks.count({"shard": "shard-b"})
10

Although the number of chunks has increased during the last 800 MB insert round, you can probably assume that no migrations occurred; a likely scenario is that each of the original chunks split in two, with a single extra split somewhere in the mix. You can verify this by querying the config database’s changelog collection:

> db.changelog.count({what: "split"})
20
> db.changelog.find({what: "moveChunk.commit"}).count()
6

This is in line with these assumptions. A total of 20 splits have occurred, yielding 20 chunks, but only 6 migrations have taken place. For an extra-deep look at what’s going on here, you can scan the change log entries. For instance, here’s the entry recording the first chunk move:

Here you see the movement of chunks from shard-a to shard-b. In general, the documents you find in the change log are quite readable. As you learn more about sharding and begin prototyping your own shard clusters, the config change log makes an excellent live reference on split and migrate behavior. Refer to it often.

12.5. Querying and indexing a shard cluster

From the application’s perspective, there’s no difference between querying a sharded cluster and querying a single mongod. In both cases, the query interface and the process of iterating over the result set are the same. But behind the scenes, things are different, and it’s worthwhile to understand exactly what’s going on.

12.5.1. Query routing

Imagine you’re querying a sharded cluster. How many shards does mongos need to contact to return a proper query response? If you give it some thought, you’ll see that it depends on whether the shard key is present in the query selector that we pass to find and similar operations. Remember that the config servers (and thus mongos) maintain a mapping of shard key ranges to shards. These mappings are none other than the chunks we examined earlier in the chapter. If a query includes the shard key, then mongos can quickly consult the chunk data to determine exactly which shard contains the query’s result set. This is called a targeted query.

But if the shard key isn’t part of the query, the query planner will have to visit all shards to fulfill the query completely. This is known as a global or scatter/gather query. The diagram in figure 12.5 illustrates both query types.

Figure 12.5. Targeted and global queries against a shard cluster

Figure 12.5 shows a cluster with two shards, two mongos routers, and two application servers. The shard key for this cluster is {username: 1, _id: 1}. We’ll discuss how to choose a good shard key in section 12.6.

To the left of the figure, you can see a targeted query that includes the username field in its query selector. In this case, the mongos router can use the value of the username field to route the query directly to the correct shard.

To the right of the figure, you can see a global or scatter/gather query that doesn’t include any part of the shard key in its query selector. In this case, the mongos router must broadcast the query to both shards.

The effect of query targeting on performance cannot be overstated. If all your queries are global, that means each shard must respond to every single query against your cluster. In contrast, if all your queries are targeted, each shard only needs to handle on average the total number of requests divided by the total number of shards. The implications for scalability are clear.

But targeting isn’t the only thing that affects performance in a sharded cluster. As you’ll see in the next section, everything you’ve learned about the performance of an unsharded deployment still applies to a sharded cluster, but to each shard individually.

12.5.2. Indexing in a sharded cluster

No matter how well-targeted your queries are, they must eventually run on at least one shard. This means that if your shards are slow to respond to queries, your cluster will be slow as well.

As in an unsharded deployment, indexing is an important part of optimizing performance. There are only a few key points to keep in mind about indexing that are specific to a sharded cluster:

Each shard maintains its own indexes. When you declare an index on a sharded collection, each shard builds a separate index for its portion of the collection. For example, when you issue the db.spreadsheets.createIndex() command while connected to a mongos router, each shard processes the index creation command individually.
It follows that the sharded collections on each shard should have the same indexes. If this ever isn’t the case, you’ll see inconsistent query performance.
Sharded collections permit unique indexes on the _id field and on the shard key only. Unique indexes are prohibited elsewhere because enforcing them would require inter-shard communication, which is against the fundamental design of sharding in MongoDB.

Once you understand how queries are routed and how indexing works, you should be in a good position to write smart queries and indexes for your sharded cluster. Most of the advice on indexing and query optimization from chapter 8 will apply.

In the next section, we’ll cover the powerful explain() tool, which you can use to see exactly what path is taken by a query against your cluster.

12.5.3. The explain() tool in a sharded cluster

The explain() tool is your primary way to troubleshoot and optimize queries. It can show you exactly how your query would be executed, including whether it can be targeted and whether it can use an index. The following listing shows an example of what this output might look like.

Listing 12.1. Index and query to return latest documents updated by a user

You can see from this explain() plan that this query was only sent to one shard , and that when it ran on that shard it used the index we created to satisfy the sort more efficiently . Note that this explain() plan output is from v2.6 and earlier, and it has changed in 3.0 and later versions. Chapter 8 contains output from the explain() command when used on a MongoDB v3.0 server. Consult the documentation at https://docs.mongodb.org/manual/reference/method/cursor.explain/ for your specific version if you see any fields you don’t understand.

12.5.4. Aggregation in a sharded cluster

It’s worth noting that the aggregation framework also benefits from sharding. The analysis of an aggregation is a bit more complicated than a single query, and may change between versions as new optimizations are introduced. Fortunately, the aggregation framework also has an explain() option that you can use to see details about how your query would perform. As a basic rule of thumb, the number of shards that an aggregation operation needs to contact is dependent on the data that the operation needs as input to complete. For example, if you’re counting every document in your entire database, you’ll need to query all shards, but if you’re only counting a small range of documents, you may not need to query every shard. Consult the current documentation at https://docs.mongodb.org/manual/reference/method/db.collection.aggregate/ for more details.

12.6. Choosing a shard key

In section 12.3 you saw how the shard key is used to split a collection into logical ranges called chunks, and in section 12.5.1 you saw how the mongos can use this information to figure out where a set of documents might be located.

In this section, we’ll discuss in depth the vitally important process of choosing a shard key. A poorly chosen shard key will prevent your application from taking advantage of many of the benefits provided by sharding. In the pathological case, both insert and query performance will be significantly impaired. Adding to the gravity of the decision is that once you’ve chosen a shard key, you’re stuck with it. Shard keys are immutable.^[9]

⁹
Note that there’s no good way to alter the shard key once you’ve created it. Your best bet is to create a new sharded collection with the proper key, export the data from the old sharded collection, and then restore the data to the new one.

The best way to understand the pitfalls of a bad shard key is to walk through the process of finding a good one step by step and analyze in depth each shard key you consider along the way. That’s exactly what we’ll do now, using the spreadsheet application as an example.

After we find an optimal shard key for our spreadsheet application, at the end of the chapter we’ll consider how our shard key choice would have been different if we’d instead been designing a system to support an email application. This will highlight how much the optimal shard key depends on the specifics of each application.

As we walk through the process of choosing a shard key for our spreadsheet application, you’ll see three main pitfalls:

Hotspots —Some shard keys result in situations where all reads or writes are going to a single chunk, on a single shard. This can cause one shard to be completely overwhelmed, while the others are sitting idle.
Unsplittable chunks —A shard key that’s too coarse-grained can result in a situation where there are many documents with the same shard key. Because sharding is based on ranges of shard keys, this means that the documents can’t be split into separate chunks, which can limit MongoDB’s ability to evenly distribute data.
Poor targeting —Even if writes are distributed perfectly in the cluster, if our shard key doesn’t have some relationship to our queries, we’ll have poor query performance. You saw this in section 12.5.1 when we discussed global and targeted queries.

Now we’ll begin the process of finding the best shard key for our spreadsheet application and see firsthand how each of these situations can come up in practice.

12.6.1. Imbalanced writes (hotspots)

The first shard key you may consider is { "_id" : 1 }, which will shard on the _id field. The _id field may initially seem like a great candidate: it must be present in every document, it has an index by default, you may have it in many of your queries, and it’s automatically generated by MongoDB using the BSON Object ID type, which is essentially a GUID (globally unique identifier).

But there’s one glaring problem with using an Object ID as a shard key: its values are strictly ascending. This means that every new document will have a shard key larger than any other document in the collection. Why is this a problem? If the system is already completely balanced, that means every new document will go to the same chunk, which is the chunk that ranges up to $maxKey. This issue is best understood with an example, as shown in figure 12.6.

Figure 12.6. The ascending shard key causing all writes to go to one chunk.

For simplicity, rather than using a BSON Object ID to illustrate the problem with ascending shard keys, this example uses the document number, which is stored in the field n. As you can see, we’ve already inserted documents 1 to 1000, and we’re about to insert documents 1001, 1002, and 1003. Because MongoDB hadn’t seen anything above 1000 before that point, it had no reason to split the chunk ranging from 1000 to $maxKey. This means that the three new documents all belong in that chunk, and so will all new documents.

To see how this might affect a real application, consider using the ascending BSON Object id field as the shard key for the spreadsheet application. Practically, what this means is that every new spreadsheet that gets created will belong in the same chunk, which means that every new document will need to be written to the single shard responsible for storing that chunk. This is where the performance issue becomes clear. Even if you have a hundred shards, all your writes will be sent to a single one while the rest sit idle. On top of that, MongoDB will do its best to migrate data off the overloaded shard, which will slam it even more.

This effectively nullifies one of sharding’s greatest benefits: the automatic distribution of the insert load across machines.^[10] All that said, if you still want to use the _id field as your shard key, you have two options:

¹⁰
Note that an ascending shard key shouldn’t affect updates as long as documents are updated randomly.

First, you can override the _id field with your own identifier that isn’t ascending. If you choose this approach, however, take care to remember that _id must be unique even in a sharded cluster.
Alternatively, you can make your shard key on _id a hashed shard key. This instructs MongoDB to use the output of a hash function on the shard key, rather than using the shard key directly. Because the hash function is designed to have an output that appears randomly distributed, this will ensure that your inserts are evenly distributed throughout the cluster. But this also means that range queries will have to span multiple shards, because even documents with shard keys that are similar may hash to completely different values.

Uniqueness gotchas

MongoDB can only ensure that the shard key is unique. It can’t enforce uniqueness on any other key because one shard can’t check another shard to see if there are duplicates. Surprisingly, this also includes the _id field, which is required to be unique. By default, MongoDB generates the _id field using a BSON Object ID. Great care was taken in the design of this data type to ensure that, statistically speaking, it would be unique across the cluster. Because of this, beware of overriding the _id field yourself. If you don’t properly enforce uniqueness on the client side, you could lose data when two documents with the same _id field are migrated to the same shard.

12.6.2. Unsplittable chunks (coarse granularity)

Now you know that { "_id" : 1 } won’t work as a shard key, but what about { "username" : 1 }? This looks like a good candidate because when we’re querying or updating a spreadsheet, we generally already know what user the spreadsheet belongs to, so we can include the username field in our query and get good targeting. Additionally, sharding on the username field will also lead to relatively well-balanced inserts because the distribution of inserts across users should be relatively even.

There’s just one problem with this field: it’s so coarse that we may end up in a situation where one chunk grows without bound. To see this, imagine that the user “Verch” decides to store 10 GB of spreadsheet data. This would bring the size of the chunk containing the documents with a username of “Verch” well above our 64 MB maximum.

Normally, when a chunk gets too large, MongoDB can split the chunk into smaller pieces and rebalance them across the cluster. However, in this case, there’s no place the chunk can be split, because it already contains only a single shard key value. This causes a number of technical problems, but the end result is that your cluster will become imbalanced, and MongoDB won’t be able to rebalance effectively.

12.6.3. Poor targeting (shard key not present in queries)

After seeing all these problems, you may want to throw up your hands and say “Fine, I’ll just pick a completely random shard key. It’s unique and it’s not ascending.” Although that does solve the problem of writes, it isn’t a good shard key if you ever intend to read from your cluster. As you saw in section 12.5.1, queries can only be routed to the correct shard if they include the shard key. If your shard key is completely random, then at the time you’re querying for a document, chances are you’ll have no idea what the shard key is. You probably will know the username you’re looking for, or the ID of the document, which is why those fields were so appealing in the first place. This means that the router will have to result to a global broadcast, which can hurt performance.

That said, if you’re writing an application where all you need to do is dump a large amount of data and you don’t need selective queries—for example, in an application where you’re collecting sensor data to analyze later in large batches (where you’re processing such a large portion of the data that you’ll need to query all the shards anyway)—a random unique shard key may not be a bad idea. In fact, if you can guarantee uniqueness (by making it a GUID using some library that provides this guarantee) and ensure that it’s not ascending, you can even override the _id field.

12.6.4. Ideal shard keys

So far you’ve seen three things that you need to consider when choosing a shard key. One is how reads are targeted, the second is how well writes are distributed, and the last is how effectively MongoDB can split and migrate chunks in your collection. Each key satisfied some of these properties but failed in others. How can you pick a shard key that works for all three? The answer, as you might expect, is using a compound shard key that provides all these benefits. In this case, that shard key is { username : 1, _id : 1 }. This has good targeting because the username field will often be present in our reads, has good write balancing because username has values that are evenly distributed throughout the alphabet, and is fine-grained enough for MongoDB to split chunks because it includes the _id field, which is unique. See figure 12.7 for an example.

Figure 12.7. Splitting a chunk that was previously unsplittable when sharding only on `username`

Here we can see a split on a chunk that we couldn’t split when we sharded only on the username field. Now, because we include the _id field, we can split all the documents from "Verch" into two separate chunks, which we couldn’t have done if our shard key was { "username" : 1 }.

Indexing matters

This chapter is all about choosing an effective shard key, but it’s important to keep in mind that indexing is still just as important as it was for a single node. As we discussed in section 12.5.2, even if your shard key is perfect and each query is routed to a single shard, if the performance on each shard is poor, your overall performance will still be poor. For this reason, MongoDB requires an index to be created on the shard key before a collection is sharded. There’s no technical reason for this, other than the fact that it’d be a common mistake to leave this out if it wasn’t done automatically. As you design your sharded system, be sure to keep in mind everything you learned about indexing on a single node. It’s just as important in a sharded cluster, and likely more so because you’re operating at a larger scale.

One important factor to consider is index locality. This refers to how close together subsequent inserts are in the index. In this case, random inserts will perform poorly because the entire index must be loaded into memory, whereas sequential inserts will perform well because only the end of the index will need to be in memory at any given time. This is in direct contrast to the requirement that shard keys aren’t ascending and illustrates the fact that proper indexing and optimal shard key choice must be each given dedicated attention when designing a sharded system.

Fortunately, the shard key { "username" : 1, "_id" : 1 } that we chose earlier satisfies all the requirements. It’s well-distributed on username, so it’s balanced from the cluster perspective, but it’s ascending on _id, so it’s also ascending for each username, thus providing good index locality.

12.6.5. Inherent design trade-offs (email application)

Sometimes in sharding there are inherent design trade-offs that must be dealt with. Our spreadsheet example was a clean example of sharding because both the writes (document creation) and reads (document viewing) are correlated by username, so the shard key we chose in the previous section provides good performance on both. But what if we have an application where we need to query based on a field that isn’t correlated with our write patterns? For example, what if we’re building an email application where we need to send emails and read inboxes? This may not seem like a problem at first, until you realize that writes are correlated to the sender whereas reads are correlated to the recipient. Because anyone can in theory send an email to anyone else, it’s difficult to predict the relationship between these two fields. In this section, we’ll consider two approaches. First, we’ll take the more straightforward approach and shard based on sender, and then we’ll see how sharding on recipient can change our usage patterns. These two approaches are shown in figure 12.8.

Figure 12.8. Overview of two ways an email application might be sharded

Sharding on sender

Our first tactic may be to shard on sender. This seems to be a sensible choice—each email has exactly one sender. In this case, our write patterns are good: each write will go to one and only one shard. What happens when we need to read a user’s inbox? If we’re sharding based on sender and then piecing together all the emails a user has received, we need to do a full global broadcast query, as shown in figure 12.8, because we have no idea who any given user has received emails from.

Sharding on recipient

Our next tactic may be to shard on recipient. This is a bit more complicated because every email must be written to multiple locations to ensure every recipient gets a copy, as shown in figure 12.8. This unfortunately means that our writes will take longer and put more load on the cluster. But this approach has one nice side effect; the query to read a user’s inbox is well-targeted, so reads will return quickly and scale well.

What’s the right approach? That’s a difficult question to answer precisely, but for this application, the second approach is likely better. This is for two reasons. One is that users may read their inbox more often than they send emails, and the other is that it’s easier for users to conceptualize that taking an action such as sending an email requires real work but may not realize that reading their inbox does, too. How often have you heard “I can’t even load my inbox!” as if that’s somehow the easiest thing for their email application to do? In the end, this is all speculation and, like any real-world application, requires careful measurement and testing to determine what the usage patterns are.

What this example illustrates is the fact that your shard key depends on the specifics of your application. There’s no magic bullet shard key that covers all use cases, so remember to think about what your read and write patterns are expected to be.

So far we’ve discussed in great detail the theory behind sharding, as well as how to configure the parameters in a way that will optimize the performance of your system. In the next section, we’ll get to the details of what else you might want to think about when setting up a sharded cluster in the real world.

12.7. Sharding in production

Earlier in this chapter, we created a fully functional sharded cluster, all on one machine. Although this is great for testing, it’d be a terrible way to deploy sharding in production.

Fortunately, that’s exactly what we’ll cover in this next section. We’ll look at the three general categories of production concerns and the order in which they arise. The first, provisioning, is how to best allocate machines and resources for use by MongoDB processes. The next, deployment, consists of things that you need to think about before you start running this cluster in production for the first time, and the last, maintenance, is how to keep your new cluster running and healthy.

12.7.1. Provisioning

The first thing to consider when thinking about how to deploy your system is provisioning, or how to allocate resources and machines to MongoDB.

Deployment topologies

To launch the sample MongoDB shard cluster, you had to start a total of nine processes (three mongods for each replica set, plus three config servers). That’s a potentially frightening number. First-time users might assume that running a two-shard cluster in production would require nine separate machines. Fortunately, fewer are needed. You can see why by looking at the expected resource requirements for each component of the cluster.

Consider first the replica sets. Each replicating member contains a complete copy of the data for its shard and may run as a primary or secondary node. These processes will always require enough disk space to store their copy of the data, and enough RAM to serve that data efficiently. Thus, replicating mongods are the most resource-intensive processes in a shard cluster and must be given their own machines.

What about replica set arbiters? These processes are like any other member of the replica set, except they don’t store any data besides replica set config data, which is minimal. Hence, arbiters incur little overhead and certainly don’t need their own servers.

Next are the config servers. These also store a relatively small amount of data. For instance, the data on the config servers managing the sample replica set totaled only about 30 KB. If you assume that this data will grow linearly with shard cluster data size, then a 1 TB shard cluster might swell the config servers’ data size to a mere 30 MB.^[11] This means that config servers don’t necessarily need their own machines, either. But given the critical role played by the config servers, some users prefer to place them on a few modestly provisioned machines (or virtual instances).

¹¹
That’s a highly conservative estimate. The real value will likely be far smaller.

From what you already know about replica sets and shard clusters, you can construct a list of minimum deployment requirements:

Each member of a replica set, whether it’s a complete replica or an arbiter, needs to live on a distinct machine.
Every replicating replica set member needs its own machine.
Replica set arbiters are lightweight enough to share a machine with another process. Refer back to chapter 10 for more on arbiters.
Config servers can optionally share a machine. The only hard requirement is that all config servers in the config cluster reside on distinct machines.

Satisfying these rules might feel like tackling a logic problem. Fortunately, we’ll apply them now by looking at two reasonable deployment topologies for the sample two-shard cluster. The first requires only four machines. The process layout is illustrated in figure 12.9.

Figure 12.9. A two-shard cluster deployed across four machines

This configuration satisfies all the deployment rules just mentioned. Predominant on each machine are the replicating nodes of each shard. The remaining processes are arranged so that all three config servers and all members of each replica set live on different machines. To speak of fault tolerance, this topology will tolerate the failure of any one machine. No matter which machine fails, the cluster will continue to process reads and writes. If the failing machine happens to host one of the config servers, all chunk splits and migrations will be suspended.^[12] Fortunately, suspending sharding operations doesn’t prevent the cluster from servicing operations; splitting and migrating can wait until the lost machine is recovered.

¹²
All three config servers need to be online for any sharding operations to take place.

That’s the minimum recommend setup for a two-shard cluster. But applications demanding the highest availability and the fastest paths to recovery will need something more robust. As discussed in the previous chapter, a replica set consisting of two replicas and one arbiter is vulnerable while recovering. Having three nodes reduces the fragility of the set during recovery and also allows you to keep a node in a secondary datacenter for disaster recovery. Figure 12.10 shows a robust two-shard cluster topology. Each shard consists of a three-node replica set, where each node contains a complete copy of the data. For disaster recovery, one config server and one node from each shard are located in a secondary datacenter; to ensure that those nodes never become primary, they’re given a priority of 0.

Figure 12.10. A two-shard cluster deployed across six machines and two datacenters

With this configuration, each shard is replicated twice, not just once. Additionally, the secondary datacenter has all the data necessary for a user to completely reconstruct the shard cluster in the event of the failure of the first datacenter.

The decision about which sharding topology is best for your application should always be based on serious considerations about how much downtime you can tolerate, as measured by your recovery time objective (RTO). Think about the potential failure scenarios and simulate them. Consider the consequences for your application (and business) if a datacenter should fail.

12.7.2. Deployment

Now that you’ve settled on the topology of your cluster, it’s time to discuss the actual deployment and configuration.

Fortunately, configuration of a production cluster follows exactly the same steps that we took to configure our example cluster in section 12.4. Here, we’ll focus on the additional variables that must be considered in a production deployment.

Adding new shards

Users frequently want to know how many shards to deploy and how large each shard should be. Naturally, each additional shard introduces extra complexity, and each shard also requires replicas. Thus it’s better to have a small number of large shards than a large number of small ones. But the question remains, how large can each shard be in practice?

The answer, of course, depends on the circumstances. In fact, the same concepts we discussed in section 12.1.2 for how to know when to shard in the first place apply here as well. Knowing when a single replica set—or, in this case, a shard—is at capacity is a matter of understanding the requirements of your application. Once you reach a point where your application requirements exceed the capacity of the shards you have or plan to have in your cluster, that’s when you need a new shard. As always, make sure you add enough shards before your cluster grinds to a halt, or MongoDB may not be able to rebalance your data quickly enough.

Sharding an existing collection

You can shard existing collections, but don’t be surprised if it takes some time to distribute the data across shards. Only one balancing round can happen at a time, and the migrations will move only around 100–200 MB of data per minute. Thus, sharding a 50 GB collection will take around eight hours, and this will likely involve some moderate disk activity. In addition, when you initially shard a large collection like this, you may have to split manually to expedite the sharding process because splitting is triggered by inserts.

Given this, it should be clear that sharding a collection at the last minute isn’t a good response to a performance problem. If you plan on sharding a collection at some point in the future, you should do so well in advance of any anticipated performance degradation.

Presplitting chunks for initial load

If you have a large data set that you need to load into a sharded collection, and you know something about the distribution of the data, then you can save a lot of time by presplitting and then premigrating chunks. For example, suppose you wanted to import the spreadsheet data into a fresh MongoDB shard cluster. You can ensure that the data distributes evenly upon import by first splitting and then migrating chunks across shards. You can use the split and moveChunk commands to accomplish this. These are aliased by the sh.splitAt() (or sh.splitFind()) and sh.moveChunks() helpers, respectively.

Here’s an example of a manual chunk split. You issue the split command, specify the collection you want, and then indicate a split point:

> sh.splitAt( "cloud-docs.spreadsheets",
  { "username" : "Chen", "_id" : ObjectId("4d6d59db1d41c8536f001453") })

When run, this command will locate the chunk that logically contains the document where username is Chen and _id is ObjectId("4d6d59db1d41c8536f001453").^[13] It then splits the chunk at that point, which results in two chunks. You can continue splitting like this until you have a set of chunks that nicely distribute the data. You’ll want to make sure that you’ve created enough chunks to keep the average chunk size well within the 64 MB split threshold. If you expect to load 1 GB of data, you should plan to create around 20 chunks.

¹³
Note that such a document need not exist. That should be clear from the fact that you’re splitting chunks on an empty collection.

The second step is to ensure that all shards have roughly the same number of chunks. Because all chunks will initially reside on the same shard, you’ll need to move them. Each chunk can be moved using the moveChunk command. The helper method simplifies this:

> sh.moveChunk("cloud-docs.spreadsheets", {username: "Chen"}, "shardB")

This says that you want to move the chunk that logically would contain the document {username: "Chen"} to shard B.

12.7.3. Maintenance

We’ll round out this chapter with a few words about sharding maintenance and administration. Note that much of this can be done using MongoDB’s official monitoring and automation tools, which we’ll discuss in chapter 13, but here we explore the fundamentals of what’s happening in the cluster, because that’s still important to know. The MongoDB automation uses a lot of these commands under the hood to implement its functionality.

Monitoring

A shard cluster is a complex piece of machinery, and you should monitor it closely. The serverStatus and currentOp() commands can be run on any mongos, and their output will reflect aggregate statistics across shards. We’ll discuss these commands in more detail in the next chapter.

In addition to aggregating server statistics, you’ll want to keep an eye on the distribution of chunks and on individual chunk sizes. As you saw in the sample cluster, all of this information is stored in the config database. If you ever detect unbalanced chunks or unchecked chunk growth, you can use the split and movechunk commands to address these issues. Alternatively, you can consult the logs to see whether the balancing operation has halted for some reason.

Manual partitioning

There are a couple of cases where you may want to manually split and migrate chunks on a live shard cluster. For example, as of MongoDB v2.6, the balancer doesn’t directly take into account the load on any one shard. Obviously, the more a shard is written to, the larger its chunks become, and the more likely they are to eventually migrate. Nevertheless, it’s not hard to imagine situations where you’d be able to alleviate load on a shard by migrating chunks. This is another situation where the movechunk command can be helpful.

Adding a shard

If you’ve determined that you the need more capacity, you can add a new shard to an existing cluster using the same method you used earlier:

sh.addShard("shard-c/rs1.example.net:27017,rs2.example.net:27017")

When adding capacity in this way, be realistic about how long it will take to migrate data to the new shard. As stated earlier, you can expect data to migrate at a rate of 100–200 MB per minute. This means that if you need to add capacity to a sharded cluster, you should do so long before performance starts to degrade. To determine when you need to add a new shard, consider the rate at which your data set is growing. Obviously, you’ll want to keep indexes and working set in RAM. A good rule of thumb is to plan to add a new shard at least several weeks before the indexes and working set on your existing shards reach 90% of RAM.

If you’re not willing to play it safe, as described here, then you open yourself up to a world of pain. Once your indexes and working set don’t fit in RAM, your application can come to a halt, especially if the application demands high write and read throughput. The problem is that the database will have to page to and from the disk, which will slow reads and writes, backlogging operations that can’t be served into a read/write queue. At that point, adding capacity is difficult because migrating chunks between shards adds read load to existing shards. Obviously, when a database is overloaded, the last thing you want to do is add load.

All of this is to emphasize that you should monitor your cluster and add capacity well before you need to.

Removing a shard

You may, in rare cases, want to remove a shard. You can do so using the removeshard command:

> use admin
> db.runCommand({removeshard: "shard-1/localhost:30100,localhost:30101"})
{
  "msg" : "draining started successfully",
  "state" : "started",
  "shard" : "shard-1-test-rs",
  "ok" : 1 }

The command response indicates that chunks are now being drained from the shard to be relocated to other shards. You can check the status of the draining process by running the command again:

> db.runCommand({removeshard: "shard-1/localhost:30100,localhost:30101"})
{
  "msg" : "draining ongoing",
  "state" : "ongoing",
  "remaining" : {
    "chunks" : 376,
    "dbs" : 3
  },
  "ok" : 1 }

Once the shard is drained, you also need to make sure that no database’s primary shard is the shard you’re going to remove. You can check database shard membership by querying the config.databases collection:

> use config
> db.databases.find()
 { "_id" : "admin", "partitioned" : false, "primary" : "config" }
 { "_id" : "cloud-docs", "partitioned" : true, "primary" : "shardA" }
 { "_id" : "test", "partitioned" : false, "primary" : "shardB" }

Here the cloud-docs database is owned by shardB but the test database is owned by shardA. Because you’re removing shardB, you need to change the test database’s primary node. You can accomplish that with the moveprimary command:

> db.runCommand({moveprimary: "test", to: "shard-0-test-rs" });

Run this command for each database whose primary is the shard to be removed. Then run the removeshard command again to verify that the shard is completely drained:

> db.runCommand({removeshard: "shard-1/localhost:30100,localhost:30101"})
{ "msg": "remove shard completed successfully",
  "stage": "completed",
  "host": "localhost:30100",
  "ok" : 1
}

Once you see that the removal is completed, it’s safe to take the removed shard offline.

Unsharding a collection

Although you can remove a shard, there’s no official way to unshard a collection. If you do need to unshard a collection, your best option is to dump the collection and then restore the data to a new collection with a different name.^[14] You can then drop the sharded collection you dumped. For example, suppose that foo is a sharded collection. You must dump foo by connecting to mongos with mongodump:

¹⁴
The utilities you use to dump and restore, mongodump and mongorestore, are covered in the next chapter.

$ mongodump -h localhost --port 40000 -d cloud-docs -c foo
connected to: localhost:40000
DATABASE: cloud-docs   to   dump/cloud-docs
  cloud-docs.foo to dump/cloud-docs/foo.bson
     100 objects

This will dump the collection to a file called foo.bson. You can then restore that file using mongorestore:

$ mongorestore -h localhost --port 40000 -d cloud-docs -c bar
Tue Mar 22 12:06:12 dump/cloud-docs/foo.bson
Tue Mar 22 12:06:12    going into namespace [cloud-docs.bar]
Tue Mar 22 12:06:12    100 objects found

Once you’ve moved the data into an unsharded collection, you’re then free to drop the old sharded collection, foo. But when dropping collections, you should be extra careful because bad things can happen: first of all, make sure that you’re dropping the correct collection!

Backing up a sharded cluster

As you’ll see in chapter 13, there are a few different options for backing up MongoDB. For the most part, these strategies also apply to backing up each shard in a sharded cluster. But there are two additional steps that must be taken when backing up a sharded cluster, regardless of which method you’re using to back up the shards:

Disable chunk migrations —The first thing to keep in mind when backing up a sharded cluster is that chunk migrations may be occurring. This means that unless you backup everything at exactly the same instant in time, which is essentially impossible, you may end up missing some data. We’ll cover exactly how to disable chunk migrations in this section.
Config server metadata —When backing up a sharded cluster, the config server metadata must also be backed up. To do this, perform a backup of a single config server node, because all config servers should have the same data. Like the backup of the shards, this should also be done after chunk migrations are disabled to avoid missing data.

Fortunately, there’s a built-in mechanism to disable automatic chunk migrations. All migration of data between shards is handled by something called the balancer process. Once you stop this process, you’re guaranteed that no automatic migrations are happening. You can still trigger migrations manually or create new databases, however, which would disrupt a proper backup, so be sure you have no other processes running that do administrative operations.

Stopping and starting the balancer

To disable the balancer, use the sh.stopBalancer() shell helper:

> use config
> sh.stopBalancer()

Note that this may take a long time to complete. This is because this helper only marks the balancer as disabled, and doesn’t abort existing balancing rounds. This means it has to wait for in-progress balancing rounds to complete. Once it returns successfully, you can be sure that no chunk migrations are in progress. To start the balancer again, you can use the sh.startBalancer() helper:

> use config
> sh.startBalancer()

You should consult the MongoDB docs for additional balancer configuration, which includes a setting to enable the balancer only in a specified time window.

Failover and recovery

Although we’ve covered general replica set failures, it’s also important to note a sharded cluster’s potential points of failure along with best practices for recovery.

Failure of a shard member

Each shard consists of a replica set. If any member of one of these replica sets fails, a secondary member will be elected primary, and the mongos process will automatically connect to it. Chapter 11 describes the specific steps to take in restoring a failed replica set member. The method you choose depends on how the member has failed, but regardless, the instructions are the same whether or not the replica set is part of a sharded cluster.

Failure of a config server

A sharded cluster requires three config servers for normal operation, but up to two of these can fail. Whenever you have fewer than three config servers, your remaining config servers will become read-only, and all splitting and balancing operations will be suspended. Note that this won’t negatively affect the cluster as a whole. Reads and writes to the cluster will still work, and the balancer will start from where it left off once all three config servers are restored.

To restore a config server, copy the data files from an existing config server to the failed config server’s machine. Then restart the server.^[15]

¹⁵
As always, before you copy any data files, make sure you either lock the mongod (as described in chapter 11) or shut it down cleanly. Never copy data files while the server is live.

Failure of a mongos

The failure of a mongos process is nothing to worry about. If you’re hosting mongos on an application server and mongos fails, it’s likely that your application server has failed, too. Recovery in this case simply means restoring the server.

Regardless of how mongos fails, the process has no state of its own. This means that recovering a mongos is a matter of restarting the mongos process and pointing it at the config servers.

12.8. Summary

Sharding is an effective strategy for maintaining high read-and write-performance on large data sets. MongoDB’s sharding works well in numerous production deployments and can work for you, too. Instead of having to worry about implementing your own half-baked, custom sharding solution, you can take advantage of all the effort that’s been put into MongoDB’s sharding mechanism. If you follow the advice in this chapter, paying particular attention to the recommended deployment topologies, the strategies for choosing a shard key, and the importance of keeping data in RAM, then sharding will serve you well.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 12. Scaling your system with sharding

Create new playlist

Sign In

Sign Up

Chapter 12. Scaling your system with sharding

12.1. Sharding overview

12.1.1. What is sharding?

12.1.2. When should you shard?

Storage distribution

Load distribution

12.2. Understanding components of a sharded cluster

Figure 12.1. Components in a MongoDB shard cluster

12.2.1. Shards: storage of application data

12.2.2. Mongos router: router of operations

12.2.3. Config servers: storage of metadata

12.3. Distributing data in a sharded cluster

Figure 12.2. Levels of granularity available in a sharded MongoDB deployment

12.3.1. Ways data can be distributed in a sharded cluster

12.3.2. Distributing databases to shards

12.3.3. Sharding within collections

Table 12.1. Chunks and shards

12.4. Building a sample shard cluster

12.4.1. Starting the mongod and mongos servers

Figure 12.3. A map of processes comprising the sample shard cluster

Starting the sharding components

12.4.2. Configuring the cluster

12.4.3. Sharding collections

Sharding an empty collection

12.4.4. Writing to a sharded cluster

Check on the shards

Seeing data on multiple shards

Figure 12.4. The chunk distribution of the spreadsheets collection

12.5. Querying and indexing a shard cluster

12.5.1. Query routing

Figure 12.5. Targeted and global queries against a shard cluster

12.5.2. Indexing in a sharded cluster

12.5.3. The explain() tool in a sharded cluster

Listing 12.1. Index and query to return latest documents updated by a user

12.5.4. Aggregation in a sharded cluster

12.6. Choosing a shard key

12.6.1. Imbalanced writes (hotspots)

Figure 12.6. The ascending shard key causing all writes to go to one chunk.

12.6.2. Unsplittable chunks (coarse granularity)

12.6.3. Poor targeting (shard key not present in queries)

12.6.4. Ideal shard keys

Figure 12.7. Splitting a chunk that was previously unsplittable when sharding only on username

12.6.5. Inherent design trade-offs (email application)

Figure 12.8. Overview of two ways an email application might be sharded

Sharding on sender

Sharding on recipient

12.7. Sharding in production

12.7.1. Provisioning

Deployment topologies

Figure 12.9. A two-shard cluster deployed across four machines

Figure 12.10. A two-shard cluster deployed across six machines and two datacenters

12.7.2. Deployment

Adding new shards

Sharding an existing collection

Presplitting chunks for initial load

12.7.3. Maintenance

Monitoring

Manual partitioning

Adding a shard

Removing a shard

Unsharding a collection

Backing up a sharded cluster

Stopping and starting the balancer

Failover and recovery

Failure of a shard member

Failure of a config server

Failure of a mongos

12.8. Summary

Table of Contents for
Chapter 12. Scaling your system with sharding

Figure 12.4. The chunk distribution of the `spreadsheets` collection

Figure 12.7. Splitting a chunk that was previously unsplittable when sharding only on `username`