Chapter 8. Indexing and query optimization

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 8. Indexing and query optimization

This chapter covers

Basic indexing concepts and theory
Practical advice for managing indexes
Using compound indexes for more complex queries
Optimizing queries
All the MongoDB indexing options

Indexes are enormously important. With the right indexes in place, MongoDB can use its hardware efficiently and serve your application’s queries quickly. But the wrong indexes produce the opposite result: slow queries, slow writes, and poorly utilized hardware. It stands to reason that anyone wanting to use MongoDB effectively must understand indexing.

But for many developers, indexes are a topic shrouded in mystery. This need not be the case. Once you’ve finished this chapter, you should have a good mental model for thinking clearly about indexes. To introduce the concepts of indexing, we’ll begin with a modest thought experiment. We’ll then explore some core indexing concepts and provide an overview of the B-tree data structure underlying MongoDB indexes.

Then it’s on to indexing in practice. We’ll discuss unique, sparse, and multikey indexes, and provide a number of pointers on index administration. Next, we’ll delve into query optimization, describing how to use explain() and work harmoniously with the query optimizer.

In versions 2.0, 2.4, 2.6, and 3.0, MongoDB gained more advanced indexing techniques. Most queries only require indexes that match a field’s value or a range of values. But you may also want to run a query for words that are similar to a given word. This requires a text index, which is covered in chapter 9. Or perhaps you’d like to use a spatial index to find documents with latitude and longitude values near a given point. This chapter is intended to give you a good understanding of indexing fundamentals so that you’ll be able to create indexes and use them effectively to optimize your queries.

8.1. Indexing theory

We’ll proceed gradually, beginning with an extended analogy and ending with an exposition of some of MongoDB’s key implementation details. Along the way, we’ll define and provide examples of a number of important terms. If you’re not familiar with compound-key indexes, virtual memory, and index data structures, you should find this section helpful.

8.1.1. A thought experiment

To understand indexing, you need a picture in your head. Imagine a cookbook. And not just any cookbook—a massive cookbook: 5,000 pages long with the most delicious recipes for every occasion, cuisine, and season, with all the good ingredients you might find at home. This is the cookbook to end them all. Let’s call it The Cookbook Omega.

Although this might be the best of all possible cookbooks, there are two tiny problems with The Cookbook Omega. The first is that the recipes are in random order. On page 3,475 you have Australian Braised Duck, and on page 2 you’ll find Zacatecan Tacos.

That would be manageable, were it not for the second problem: The Cookbook Omega has no index.

Here’s the first question to ask yourself: with no index, how do you find the recipe for Rosemary Potatoes in The Cookbook Omega? Your only choice is to scan through every page of the book until you find the recipe. If the recipe is on page 3,973, that’s how many pages you have to look through. In the worst case, where the recipe is on the last page, you have to look at every single page.

That would be madness. The solution is to build an index.

A simple index

There are several ways you can imagine searching for a recipe, but the recipe’s name is probably a good place to start. If you create an alphabetical listing of each recipe name followed by its page number, you’ll have indexed the book by recipe name. A few entries might look like this:

Tibetan Yak Soufflé: 45
Toasted Sesame Dumplings: 4,011
Turkey à la King: 943

As long as you know the name of the recipe (or even the first few letters of that name), you can use this index to quickly find any recipe in the book. If that’s the only way you expect to search for recipes, your work is done.

But this is unrealistic because you can also imagine wanting to find recipes based on, say, the ingredients you have in your pantry. Or perhaps you want to search by cuisine. For those cases, you need more indexes.

Here’s a second question: with only one index on the recipe name, how do you find all the cauliflower recipes? Again, lacking the proper indexes, you’d have to scan the entire book, all 5,000 pages. This is true for any search on ingredients or cuisine.

You need to build another index, this time on ingredients. In this index, you have an alphabetical listing of ingredients, each pointing to all the page numbers of recipes containing that ingredient. The most basic index on ingredients would look like this:

Cashews: 3; 20; 42; 88; 103; 1,215...
Cauliflower: 2; 47; 88; 89; 90; 275...
Currants: 1,001; 1,050; 2,000; 2,133...

Is this the index you thought you were going to get? Is it even helpful?

A compound index

This index is good if all you need is a list of recipes for a given ingredient. But if you want to include any other information about the recipe in your search, you still have some scanning to do—once you know the page numbers where cauliflower is referenced, you then need to go to each of those pages to get the name of the recipe and what type of cuisine it is. This is better than paging through the whole book, but you can do better.

For example, imagine that you randomly discovered a great cauliflower recipe in The Cookbook Omega several months ago but you’ve forgotten its name; you suspect that you’ll recognize it when you see it. As of now, you have two indexes, one on recipe name and the other on ingredients. Can you think of a way to use these two indexes in combination to find your long-lost cauliflower recipe?

In fact, this is impossible. If you start with the index on recipe name, but don’t remember the name of the recipe, then searching this index is little better than paging through the entire book. If you start with the index on ingredients, then you’ll have a list of page numbers to check, but those page numbers can in no way be plugged into the index on recipe name. Therefore, you can only use one index in this case, and it happens that the one on ingredients is more helpful.

One index per query

Users commonly believe that a query on two fields can be resolved using two separate indexes on those fields. An algorithm exists for this: look up the page numbers in each index matching each term, and then scan the intersection of those pages for the individual recipes matching both terms. A number of pages won’t match, but you’ll still narrow down the total number of scanned items. Index intersection was first supported by version 2.6 of MongoDB. Note that whether the use of a compound index or the use of an index intersection is more efficient depends on the particular query and the system. Also, keep in mind that the database will use a single index per query if possible and that if you’re going to be querying on more than one field frequently, ensure that a compound index for those fields exists.

What can you do? Happily, there’s a solution to the long-lost cauliflower recipe, and its answer lies in the use of compound indexes.

The two indexes you’ve created so far are single-key indexes: they both order only one key from each recipe. You’re going to build yet another index for The Cookbook Omega, but this time, instead of using one key per index, you’ll use two. Indexes that use more than one key like this are called compound indexes.

This compound index uses both ingredients and recipe name, in that order. You’ll notate the index like this: ingredient-name. Part of this index would look like what you see in figure 8.1.

Figure 8.1. A compound index inside a cookbook

The value of this index for a human is obvious. You can now search by ingredient and probably find the recipe you want, even if you remember only the initial part of the name. For a machine, it’s still valuable for this use case and will keep the database from having to scan every recipe name listed for that ingredient. This compound index would be especially useful if, as with The Cookbook Omega, there were several hundred (or thousand) cauliflower recipes. Can you see why?

One thing to notice: with compound indexes, order matters. Imagine the reverse compound index on name-ingredient. Would this index be interchangeable with the compound index we just explored?

Definitely not. With the new index, once you have the recipe name, your search is already limited to a single recipe; a single page in your cookbook. If this index were used on a search for the recipe Cashew Marinade and the ingredient Bananas, the index could confirm that no such recipe exists. But this use case is the opposite one: you know the ingredient, but not the recipe name.

The cookbook now has three indexes: one on recipe name, one on ingredient, and one on ingredient-name. This means that you can safely eliminate the single-key index on ingredient. Why? Because a search on a single ingredient can use the index on ingredient-name. That is, if you know the ingredient, you can traverse this compound index to get a list of all page numbers containing said ingredient. Look again at the sample entries for this index to see why this is so.

Indexing rules

The goal of this section was to present an extended metaphor to provide you with a better mental model of indexes. From this metaphor, you can derive a few simple concepts:

1. Indexes significantly reduce the amount of work required to fetch documents. Without the proper indexes, the only way to satisfy a query is to scan all documents linearly until the query conditions are met. This frequently means scanning entire collections.

2. Only one single-key index will be used to resolve a query.^[1] For queries containing multiple keys (say, ingredient and recipe name), a compound index containing those keys will best resolve the query.

¹
One exception is queries using the $or operator. But as a general rule, this isn’t possible, or even desirable, in MongoDB.

3. An index on ingredient can and should be eliminated if you have a second index on ingredient-name. More generally, if you have a compound index on a-b, then a second index on a alone will be redundant, but not one on b.

4. The order of keys in a compound index matters.

Bear in mind that this cookbook analogy can be taken only so far. It’s a model for understanding indexes, but it doesn’t fully correspond to the way MongoDB’s indexes work. In the next section, we’ll elaborate on the rules of thumb just presented, and we’ll explore indexing in MongoDB in detail.

8.1.2. Core indexing concepts

The preceding thought experiment hinted at a number of core indexing concepts. Here and throughout the rest of the chapter, we’ll unpack those ideas.

Single-key indexes

With a single-key index, each entry in the index corresponds to a single value from each of the documents indexed. The default index on _id is a good example of a single-key index. Because this field is indexed, each document’s _id also lives in an index for fast retrieval by that field.

Compound-key indexes

Although when starting with MongoDB 2.6 you can use more than one index for a query, it’s best if you use only a single index. But you often need to query on more than one attribute, and you want such a query to be as efficient as possible. For example, imagine that you’ve built two indexes on the products collection from this book’s e-commerce example: one index on manufacturer and another on price. In this case, you’ve created two entirely distinct data structures that, when traversed, are ordered like the lists you see in figure 8.2.

Figure 8.2. Single-key index traversal

Now, imagine your query looks like this:

db.products.find({
    'details.manufacturer': 'Acme',
    'pricing.sale': {
      $lt: 7500
    }
  })

This query says to find all Acme products costing less than $75.00. If you issue this query with single-key indexes on manufacturer and price, only one of them will be used. The query optimizer will pick the more efficient of the two, but neither will give you an ideal result. To satisfy the query using these indexes, you’d have to traverse each structure separately and then grab the list of disk locations that match and calculate their intersection.

A compound index is a single index where each entry is composed of more than one key. If you were to build a compound-key index on manufacturer and price, the ordered representation would look like figure 8.3.

Figure 8.3. Compound-key index traversal

To fulfill your query, the query optimizer only needs to find the first entry in the index where manufacturer is Acme and price is 7500. From there, the results can be retrieved with a simple scan of the successive index entries, stopping when the value of the manufacturer no longer equals Acme.

There are two things you should notice about the way this index and query work together. The first is that the order of the index’s keys matters. If you’d declared a compound index where price was the first key and manufacturer the second, then your query would’ve been far less efficient. Can you see why? Take a look at the structure of the entries in such an index in figure 8.4.

Figure 8.4. A compound-key index with the keys reversed

Keys must be compared in the order in which they appear. Unfortunately, this index doesn’t provide an easy way to jump to all the Acme products, so the only way to fulfill your query would be to look at every product whose price is less than $75.00 and then select only those products made by Acme. To put this in perspective, imagine that your collection had a million products, all priced under $100.00 and evenly distributed by price. Under these circumstances, fulfilling your query would require that you scan 750,000 index entries. By contrast, using the original compound index, where manufacturer precedes price, the number of entries scanned would be the same as the number of entries returned. This is because once you’ve arrived at the entry for (Acme - 7500), it’s a simple, in-order scan to serve the query.

The order of keys in a compound index matters. If that seems clear, the second thing you should understand is why we’ve chosen the first ordering over the second. This may be obvious from the diagrams, but there’s another way to look at the problem. Look again at the query: the two query terms specify different kinds of matches. On manufacturer, you want to match the term exactly. But on price, you want to match a range of values, beginning with 7500. As a general rule, a query where one term demands an exact match and another specifies a range requires a compound index where the range key comes second. We’ll revisit this idea in the section on query optimization.

Index efficiency

Although indexes are essential for good query performance, each new index imposes a small maintenance cost. Whenever you add a document to a collection, each index on that collection must be modified to include the new document. If a particular collection has 10 indexes, that makes 10 separate structures to modify on each insert, in addition to writing the document itself. This holds for any write operation, whether you’re removing a document, relocating a document because the allocated space isn’t enough, or updating a given document’s indexed keys.

For read-intensive applications, the cost of indexes is almost always justified. Just realize that indexes do impose a cost and that they therefore must be chosen with care. This means ensuring that all your indexes are used and that none are redundant. You can do this in part by profiling your application’s queries; we’ll describe this process later in the chapter.

But there’s a second consideration. Even with all the right indexes in place, it’s still possible that those indexes won’t result in faster queries. This occurs when indexes and a working data set don’t fit in RAM.

You may recall from chapter 1 that MongoDB tells the operating system to map all data files to memory using the mmap() system call when the MMAPv1 default storage engine is used. As you’ll learn in chapter 10, the WiredTiger storage engine works differently. From this point on, the data files, which include all documents, collections, and their indexes, are swapped in and out of RAM by the operating system in 4 KB chunks called pages.^[2] Whenever data from a given page is requested, the operating system must ensure that the page is available in RAM. If it’s not, a kind of exception known as a page fault is raised, and this tells the memory manager to load the page from disk into RAM.

²
The 4 KB page size is standard but not universal.

With sufficient RAM, all the data files in use will eventually be loaded into memory. Whenever that memory is altered, as in the case of a write, those changes will be flushed to disk asynchronously by the OS. The write, however, will be fast because it occurs directly in RAM; thus the number of disk accesses is reduced to a minimum. But if the working data set can’t fit into RAM, page faults will start to creep up. This means that the operating system will be going to disk frequently, greatly slowing read and write operations. In the worst case, as data size becomes much larger than available RAM, a situation can occur where for any read or write, data must be paged to and from disk. This is known as thrashing, and it causes performance to take a severe dive.

Fortunately, this situation is relatively easy to avoid. At a minimum, you need to make sure that your indexes will fit in RAM. This is one reason why it’s important to avoid creating any unneeded indexes. With extra indexes in place, more RAM will be required to maintain those indexes. Along the same lines, each index should have only the keys it needs. A triple-key compound index might be necessary at times, but be aware that it’ll use more space than a simple single-key index. One example of where it might be valuable to create an index with more than one or two fields is if you can create a covering index for a frequent query. A covering index is one where the entire query can be satisfied from reading only the index, making queries very fast. Covering indexes are discussed more at the end of section 8.3.

Bear in mind that indexes are stored separately in RAM from the data they index and aren’t clustered. In a clustered index, the order of the index corresponds directly to the order of the underlying data; if you index recipes by name in a clustered index, then all of the recipes starting with A will be stored together, followed by B, C, and so on. This isn’t the case in MongoDB. Every name in the recipe index is essentially duplicated in the index, and the order of these names has no bearing on the order of the data. This is important when you scan through a collection sorted with an index because it means that every document fetched could be anywhere in the data set. There’s no guaranteed locality with the previously fetched data.

Ideally, indexes and a working data set fit in RAM. But estimating how much RAM this requires for any given deployment isn’t always easy. You can always discover total index size by looking at the results of the stats command. The working set is the subset of total data commonly queried and updated, which is different for every application. Suppose you have a million users for which you have data. If only half of them are active (thus half the user documents are queried), then your working set for the user collection is half the total data size. If these documents are evenly distributed throughout the entire data set, though, it’s likely that untouched user documents are also being loaded into memory, which imposes a cost.

In chapter 10, we’ll revisit the concept of the working set, and we’ll look at specific ways to diagnose hardware-related performance issues. For now, be aware of the potential costs of adding new indexes, and keep an eye on the ratio of index and working set size to RAM. Doing so will help you to maintain good performance as your data grows.

8.1.3. B-trees

As mentioned, MongoDB represents most indexes internally as B-trees. B-trees are ubiquitous, having remained in popular use for database records and indexes since at least the late 1970s.^[3] If you’ve used other database systems, you may already be familiar with the various aspects of using B-trees. This is good because it means you can effectively transfer most of your knowledge of indexing. If you don’t know much about B-trees, that’s okay, too; this section will present the concepts most relevant to your work with MongoDB.

³
The MMAPv1 storage engine uses B-trees for its indexes only; collections are stored as doubly linked lists. As you’ll see in chapter 10, the WiredTiger storage engine works a little differently despite the fact that it also uses B-trees. But MMAPv1 remains the default MongoDB storage engine.

B-trees have two overarching traits that make them ideal for database indexes. First, they facilitate a variety of queries, including exact matches, range conditions, sorting, prefix matching, and index-only queries. Second, they’re able to remain balanced in spite of the addition and removal of keys.

We’ll look at a simple representation of a B-tree and then discuss some principles that you’ll want to keep in mind. Imagine that you have a collection of users and that you’ve created a compound index on last name and age. An abstract representation of the resulting B-tree might look something like figure 8.5.

Figure 8.5. Sample B-tree structure

A B-tree, as you might guess, is a tree-like data structure. Each node in the tree can contain multiple keys. You can see in the example that the root node contains two keys, each of which is in the form of a BSON object representing an indexed value from the users collection. In reading the contents of the root node, you can see the keys for two documents, indicating last names Edwards and Perry, with ages of 21 and 18, respectively. Each of these keys includes two pointers: one to the data file it belongs to and another to the child node. Additionally, the node itself points to another node with values less than the node’s smallest value.

In MongoDB’s B-tree implementation, a new node is allocated 8,192 bytes, which means that in practice, each node may contain hundreds of keys. This depends on the average index key size; in this case, that average key size might be around 30 bytes. The maximum key size since MongoDB v2.0 is 1024 bytes. Add to this a per-key overhead of 18 bytes and a per-node overhead of 40 bytes, and this results in about 170 keys per node.^[4] One thing to notice is that each node has some empty space (not to scale).

⁴
(8192 – 40) / (30 + 18) = 169.8

This is relevant because users frequently want to know why index sizes are what they are. You now know that each node is 8 KB, and you can estimate how many keys will fit into each node. To calculate this, keep in mind that B-tree nodes are usually intentionally kept around 60% full by default.

Given this information, you should now see why indexes aren’t free, in terms of space or time required to update them. Use this information to help decide when to create indexes on your collections and when to avoid them.

8.2. Indexing in practice

With most of the theory behind us, we’ll now look at some refinements on our concept of indexing in MongoDB. We’ll then proceed to some of the details of index administration.

8.2.1. Index types

MongoDB uses B-trees for indexes and allows you to apply several characteristics to these indexes. This section should give you an overview of your options when creating indexes.

Unique indexes

Often you want to ensure that a field in your document, such as _id or username, is unique to that document. Unique indexes are a way to enforce this characteristic, and in fact are used by MongoDB to ensure that _id is a unique primary key.

To create a unique index, specify the unique option:

db.users.createIndex({username: 1}, {unique: true})

Unique indexes enforce uniqueness across all their entries. If you try to insert a document into this book’s sample application’s users collection with an already-indexed username value, the insert will fail with the following exception:

E11000 duplicate key error index:
  gardening.users.$username_1  dup key: { : "kbanker" }

If using a driver, this exception will be caught only if you perform the insert using your driver’s safe mode, which is the default. You may have also encountered this error if you attempted to insert two documents with the same _id—every MongoDB collection has a unique index on this field because it’s the primary key.

If you need a unique index on a collection, it’s usually best to create the index before inserting any data. If you create the index in advance, you guarantee the uniqueness constraint from the start. When creating a unique index on a collection that already contains data, you run the risk of failure because it’s possible that duplicate keys may already exist in the collection. When duplicate keys exist, the index creation fails.

If you do find yourself needing to create a unique index on an established collection, you have a couple of options. The first is to repeatedly attempt to create the unique index and use the failure messages to manually remove the documents with duplicate keys. But if the data isn’t so important, you can also instruct the database to drop documents with duplicate keys automatically using the dropDups option. For example, if your users collection already contains data, and if you don’t care that documents with duplicate keys are removed, you can issue the index creation command like this:

db.users.createIndex({username: 1}, {unique: true, dropDups: true})

Be careful using dropDups

Note that the choice of duplicate key documents to be preserved is arbitrary, so use this feature with extreme care. Typically you’ll want to decide which duplicates to drop instead of having MongoDB choose for you.

The dropDups option was removed starting with MongoDB 3.x and there’s no direct replacement for the dropDups option. You can either create a new collection, create the unique index on this new collection, and copy all the documents from the old collection to the new one (while making sure that you’re ignoring duplicated key errors during the process), or deal with duplicate key documents manually.

Sparse indexes

Indexes are dense by default. This means that for every document in an indexed collection, a corresponding entry exists in the index, even if the document lacks the indexed key. For example, recall the products collection from your e-commerce data model, and imagine that you’ve built an index on the product attribute category_ids. Now suppose that a few products haven’t been assigned to any categories. For each of these category-less products, there will still exist a null entry in the category_ids index. You can query for those null values like this:

db.products.find({category_ids: null})

Here, when searching for all products lacking a category, the query optimizer will still be able to use the index on category_ids to locate the corresponding products.

But in two cases a dense index is undesirable. The first is when you want a unique index on a field that doesn’t appear in every document in the collection. For instance, you definitely want a unique index on every product’s sku field. But suppose that, for some reason, products are entered into the system before a SKU is assigned. If you have a unique index on sku and attempt to insert more than one product without a SKU, the first insert will succeed, but all subsequent inserts will fail because there will already be an entry in the index where sku is null. This is a case where a dense index doesn’t serve your purpose. Instead you want a unique and sparse index.

In a sparse index, only those documents having some value for the indexed key will appear. If you want to create a sparse index, all you have to do is specify {sparse: true}. For example, you can create a unique sparse index on sku like this:

db.products.createIndex({sku: 1}, {unique: true, sparse: true})

There’s another case where a sparse index is desirable: when a large number of documents in a collection don’t contain the indexed key. For example, suppose you allowed anonymous reviews on your e-commerce site. In this case, half the reviews might lack a user_id field, and if that field were indexed, half the entries in that index would be null. This would be inefficient for two reasons. First, it would increase the size of the index. Second, it’d require updates to the index when adding and removing documents with null user_id fields.

If you rarely (or never) expect queries on anonymous reviews, you might elect to build a sparse index on user_id. Again, setting the sparse option is simple:

db.reviews.createIndex({user_id: 1}, {sparse: true, unique: false})

Now only those reviews linked to a user via the user_id field will be indexed.

Multikey indexes

In earlier chapters you saw several examples of indexing fields whose values are arrays.^[5] This is made possible by what’s known as a multikey index, which allows multiple entries in the index to reference the same document. This makes sense if we look at a simple example. Suppose you have a product document with a few tags like this:

⁵
Think of category IDs, for instance.

{
  name: "Wheelbarrow",
  tags: ["tools", "gardening", "soil"]
}

If you create an index on tags, then each value in this document’s tags array will appear in the index. This means that a query on any one of these array values can use the index to locate the document. This is the idea behind a multikey index: multiple index entries, or keys, end up referencing the same document.

Multikey indexes are always enabled in MongoDB, with a few exceptions, such as with hashed indexes. Whenever an indexed field contains an array, each array value will be given its own entry in the index.

The intelligent use of multikey indexes is essential to proper MongoDB schema design. This should be evident from the examples presented in chapters 4 through 6; several more examples are provided in the design patterns section of appendix B. But creating, updating, or deleting multikey indexes is more expensive than creating, updating, or deleting single-key indexes.

Hashed indexes

In the previous examples of B-tree indexes, we showed how MongoDB builds the index tree out of the values being indexed. Thus, in an index of recipes, the “Apple Pie” entry is near the “Artichoke Ravioli” entry. This may seem obvious and natural, but MongoDB also supports hashed indexes where the entries are first passed through a hash function.^[6] This means the hashed values will determine the ordering, so these recipes will likely not be near each other in the index.

⁶
Recall that a hash function takes some input and maps it to an output value of fixed length. For a given input, the hash output will always be consistent. A good hash function in this context will evenly distribute the output values so that they appear to be random.

Indexes of this kind can be created in MongoDB by passing 'hashed' as the index sorting direction. For example:

db.recipes.createIndex({recipe_name: 'hashed'})

Because the indexed value is a hash of the original, these indexes carry some restrictions:

Equality queries will work much the same, but range queries aren’t supported.
Multikey hashed indexes aren’t allowed.
Floating-point values are cast to an integer before being hashed; thus, 4.2 and 4.3 will have the same value in a hashed index.

Given these restrictions and peculiarities, you may wonder why anyone would use a hashed index. The answer lies in the fact that the entries in a hashed index are evenly distributed. In other words, when you have a non-uniform distribution of key data, then a hashed index will create uniformity if you can live with its restrictions. Recall that “Apple Pie” and “Artichoke Ravioli” are no longer next to each other in the hashed index; the data locality of the index has changed. This is useful in sharded collections where the shard index determines which shard each document will be assigned to. If your shard index is based on an increasing value, such as a MongoDB OIDs,^[7] then new documents created will only be inserted to a single shard—unless the index is hashed.

⁷
MongoDB object IDs (OIDs) are the default ids used for MongoDB documents. We discussed them in more detail in chapter 3, section 3.2.1.

Let’s dig into that statement. Unless explicitly set, a MongoDB document will use an OID as its primary key. Here are a few sequentially generated OIDs:

5247ae72defd45a1daba9da9
5247ae73defd45a1daba9daa
5247ae73defd45a1daba9dab

Notice how similar the values are; the most significant bits are based on the time when they were generated. When new documents are inserted with these IDs, their index entries are likely to be near eachother. If the index using these IDs is being used to decide which shard (and thus machine) a document should reside on, these documents are also likely to be inserted on to the same machine. This can be detrimental if a collection is receiving heavy write load, because only a single machine is being used. Hashed indexes solve this issue by distributing these documents evenly in a namespace, and thus across shards and machines. To fully understand this example, wait until you read chapter 12.

If you’re not familiar with sharding, it’s described in much greater detail in chapter 11, and this example will make more sense once we describe sharding in detail. For now, the important thing to remember is that hashed indexes change the locality of index entries, which can be useful in sharded collections.

Geospatial indexes

Another useful query capability is to find documents “close” to a given location, based on latitude and longitude values stored in each document. If you store a directory of restaurants in a MongoDB collection, for example, users are probably most eager to find restaurants located near their home. One answer to this is to run a query to find every restaurant within a 10-mile radius. Executing this query requires an index that can efficiently calculate geographic distances, including the curvature of the earth. Geospatial indexes can handle this and other types of queries.

8.2.2. Index administration

We’ve discussed simple index administration, such as creating indexes, in this and in previous chapters. When you use indexes in real-world applications, however, it’s useful to understand this topic in greater depth. Here we’ll see index creation and deletion in detail and address questions surrounding compaction and backups.

Creating and deleting indexes

By now you’ve created quite a few indexes, so this should be easy. Simply call create-Index() either in the shell or with your language of choice. Please note that in MongoDB v3.0, ensureIndex(), which was previously used for creating indexes, has been replaced by the createIndex() command and shouldn’t be used anymore. What you may not know is that this method works by creating a document defining the new index and putting it into the special system.indexes collection.

Though it’s usually easier to use a helper method to create an index, you can also insert an index specification manually (this is what the helper methods do). You need to be sure you’ve specified the minimum set of keys: ns, key, and name. ns is the namespace, key is the field or combination of fields to index, and name is a name used to refer to the index. Any additional options, like sparse, can also be specified here. For example, let’s create a sparse index on the users collection:

use green
spec = {ns: "green.users", key: {'addresses.zip': 1}, name: 'zip'}
db.system.indexes.insert(spec, true)

If no errors are returned on insert, the index now exists, and you can query the system.indexes collection to prove it:

db.system.indexes.find().pretty()
{
  "ns" : "green.users",
  "key" : {
    "addresses.zip" : 1
  },
  "name" : "zip",
  "v" : 1
}

The v field was added in MongoDB v2.0 to store the version of the index. This version field allows for future changes in the internal index format but should be of little concern to application developers.

To delete an index, you might think that all you need to do is remove the index document from system.indexes, but this operation is prohibited. Instead, you must delete indexes using the database command deleteIndexes. As with index creation, there are helpers for deleting indexes, but if you want to run the command itself, you can do that, too. The command takes as its argument a document containing the collection name and either the name of the index to drop or * to drop all indexes. To manually drop the index you created, issue the command like this:

use green
db.runCommand({deleteIndexes: "users", index: "zip"})

In most cases, you’ll use the shell’s helpers to create and drop indexes:

use green
db.users.createIndex({zip: 1})

You can then check the index specifications with the getIndexSpecs() method:

> db.users.getIndexes()
[
  {
    "v" : 1,
    "key" : {
      "_id" : 1
    },
    "ns" : "green.users",
    "name" : "_id_"
  },
  {
    "v" : 1,
    "key" : {
      "zip" : 1
    },
    "ns" : "green.users",
    "name" : "zip_1"
  }
]

Finally, you can drop the index using the dropIndex() method. Note that you must supply the index’s name as specified in the spec:

use green
db.users.dropIndex("zip_1")

You can also supply your own name while creating an index using the name parameter.

Those are the basics of creating and deleting indexes. For what to expect when an index is created, read on.

Building indexes

Most of the time, you’ll want to declare your indexes before putting your application into production. This allows indexes to be built incrementally, as the data is inserted. But there are two cases where you might choose to build an index after the fact. The first case occurs when you need to import a lot of data before switching into production. For instance, you might be migrating an application to MongoDB and need to seed the database with user information from a data warehouse. You could create the indexes on your user data in advance, but doing so after you’ve imported the data will ensure an ideally balanced and compacted index from the start. It’ll also minimize the net time to build the index.

The second (and more obvious) case for creating indexes on existing data sets is when you have to optimize for new queries. This occurs when you add or change functionality in your application, and it happens more than you might think. Suppose you allow users to log in using their username, so you index that field. Then you modify your application to also allow your users to log in using their email; now you probably need a second index on the email field. Watch out for cases like these because they require rethinking your indexing.

Regardless of why you’re creating new indexes, the process isn’t always pleasing. For large data sets, building an index can take hours, even days. But you can monitor the progress of an index build from the MongoDB logs. Let’s take an example from a data set that we’ll use in the next section. First, you declare an index to be built:

db.values.createIndex({open: 1, close: 1})

Be careful declaring indexes

Because it’s so easy to declare indexes, it’s also easy to inadvertently trigger an index build. If the data set is large enough, the build will take a long time. And in a production situation, this can be a nightmare because there’s no easy way to kill an index build. If this ever happens, you may have to fail over to a backup or secondary. The most prudent advice is to treat an index build as a kind of database migration.

The index builds in two steps. In the first step, the values to be indexed are sorted. A sorted data set makes for a much more efficient insertion into the B-tree. If you look at the MongoDB server log, you’ll see the progress printed for long index builds. Note that the progress of the sort is indicated by the ratio of the number of documents sorted to the total number of documents:

[conn1] building new index on { open: 1.0, close: 1.0 } for stocks.values
    1000000/4308303  23%
    2000000/4308303  46%
    3000000/4308303  69%
    4000000/4308303  92%
    Tue Jan  4 09:59:13 [conn1]    external sort used : 5 files  in 55 secs

For step two, the sorted values are inserted into the index. Progress is indicated in the same way, and when complete, the time it took to complete the index build is indicated as the insert time into system.indexes:

1200300/4308303  27%
    2227900/4308303  51%
    2837100/4308303  65%
    3278100/4308303  76%
    3783300/4308303  87%
    4075500/4308303  94%
Tue Jan  4 10:00:16 [conn1] done building bottom layer, going to commit
Tue Jan  4 10:00:16 [conn1] done for 4308303 records 118.942secs
Tue Jan  4 10:00:16 [conn1] insert stocks.system.indexes 118942ms

In addition to examining the MongoDB log, you can check the index build progress by running the shell’s currentOp() method. This command’s output varies from version to version, but it will probably look something like the next listing.^[8]

⁸
Note that if you’ve started the index build from the MongoDB shell, you’ll have to open a new instance of the shell to run currentOp concurrently. For more about db.currentOp(), see chapter 10.

Listing 8.1. Checking the index build process with the shell `currentOP()` method

The msg field describes the build’s progress. Note also the locks element, which indicates that the index build takes a write lock on the stocks database. This means that no other client can read or write from the database at this time. If you’re running in production, this is obviously a bad thing, and it’s the reason why long index builds can be so vexing. Let’s explore two possible solutions to this problem.

Background indexing

If you’re running in production and can’t afford to halt access to the database, you can specify that an index be built in the background. Although the index build will still take a write lock, the job will yield to allow other readers and writers to access the database. If your application typically exerts a heavy load on MongoDB, a background index build will degrade performance, but this may be acceptable under certain circumstances. For example, if you know that the index can be built within a time window where application traffic is at a minimum, background indexing in this case might be a good choice.

To build an index in the background, specify {background: true} when you declare the index. The previous index can be built in the background like this:

db.values.createIndex({open: 1, close: 1}, {background: true})

Offline indexing

Building an index in the background may still put an unacceptable amount of load on a production server. If this is the case, you may need to index the data offline. This will usually involve taking a replica node offline, building the index on that node by itself, and then allowing the node to catch up with the master replica. Once it’s caught up, you can promote the node to primary and then take another secondary offline and build its version of the index. This tactic presumes that your replication oplog is large enough to prevent the offline node from becoming stale during the index build. Chapter 10 covers replication in detail and should help you plan for a migration such as this.

Backups

Because indexes are hard to build, you may want to back them up. Unfortunately, not all backup methods include indexes. For instance, you might be tempted to use mongodump and mongorestore, but these utilities preserve collections and index declarations only. This means that when you run mongorestore, all the indexes declared for any collections you’ve backed up will be re-created. As always, if your data set is large, the time it takes to build these indexes may be unacceptable.

Consequently, if you want your backups to include indexes, you’ll want to opt for backing up the MongoDB data files themselves. More details about this, as well as general instructions for backups, can be found in chapter 13.

Defragmenting

If your application heavily updates existing data or performs a lot of large deletions, you may end up with a highly fragmented index. B-trees will coalesce on their own somewhat, but this isn’t always sufficient to offset a high delete volume. The primary symptom of a fragmented index is an index size much larger than you’d expect for the given data size. This fragmented state can result in indexes using more RAM than necessary. In these cases, you may want to consider rebuilding one or more indexes. You can do this by dropping and re-creating individual indexes or by running the reIndex command, which will rebuild all indexes for a given collection:

db.values.reIndex();

Be careful about reindexing: the command will take out a write lock for the duration of the rebuild, temporarily rendering your MongoDB instance unusable. Reindexing is best done offline, as described earlier for building indexes on a secondary. Note that the compact command, discussed in chapter 10, will also rebuild indexes for the collection on which it’s run.

We’ve discussed how to create and manage your indexes, but despite this knowledge, you may still find yourself in a situation where your queries aren’t fast enough. This can occur as you add data, traffic, or new queries. Let’s learn how to identify these queries that could be faster and improve the situation.

8.3. Query optimization

Query optimization is the process of identifying slow queries, discovering why they’re slow, and then taking steps to speed them up. In this section, we’ll look at each step of the query optimization process in turn so that by the time you finish reading, you’ll have a framework for addressing problematic queries on any MongoDB installation.

Before diving in, we must warn you that the techniques presented here can’t be used to solve every query performance problem. The causes of slow queries vary too much. Poor application design, inappropriate data models, and insufficient physical hardware are all common culprits, and their remedies require a significant time investment. Here we’ll look at ways to optimize queries by restructuring the queries themselves and by building the most useful indexes. We’ll also describe other avenues for investigation when these techniques fail to deliver.

8.3.1. Identifying slow queries

If your MongoDB-based application feels sluggish, it’s time to start profiling your queries. Any disciplined approach to application design should include a query audit, and MongoDB makes this easy. Though the requirements vary per application, it’s safe to assume that for most apps, queries shouldn’t take much longer than 100 ms. The MongoDB logger has this assumption ingrained because it prints a warning whenever any operation, including a query, takes longer than 100 ms. The logs, therefore, are the first place you should look for slow queries.

It’s unlikely that any of the data sets we’ve worked with up until now have been large enough to generate queries lasting longer than 100 ms. For the following examples, we’ll use a data set consisting of daily NASDAQ summaries. If you want to follow along, you’ll need to have this data locally. To import it, first download the archive using http://mng.bz/ii49. Then unzip the file to a temporary folder. You’ll see the following output:

$ unzip stocks.zip
Archive:  stocks.zip
   creating: dump/stocks/
  inflating: dump/stocks/system.indexes.bson
  inflating: dump/stocks/values.bson

Finally, after starting the mongod process if necessary, restore the dump like this:

$ mongorestore -d stocks dump/stocks

This process may take a few minutes. You may also receive some warning messages at the beginning and end of the process. Don’t worry about those.

The stocks data set is large and easy to work with. For a certain subset of the NASDAQ stock exchange’s symbols, there’s a document for each day’s high, low, close, and volume for a 25-year period beginning in 1983. Given the number and size of the documents in this collection, it’s easy to generate one of the log warnings. Try querying for the first occurrence of Google’s stock price:

use stocks
db.values.find({"stock_symbol": "GOOG"}).sort({date: -1}).limit(1)

You’ll notice that this takes some time to run, and if you check the MongoDB log, you’ll see the expected slow query warning. Here’s a sample of the output to expect from MongoDB v2.6:

Mon Sep 30 21:48:58.066 [conn20] query stocks.values query: { query: {
     stock_symbol: "GOOG" }, orderby: { date: -1.0 } }
          ntoreturn:1 ntoskip:0 nscanned:4308303 scanAndOrder:1 keyUpdates:0
     numYields: 3 locks(micros) r:4399440
          nreturned:1 reslen:194 4011ms

A similar log message from MongoDB v3.0 using another collection has the following format:

2015-09-11T21:17:15.414+0300 I COMMAND  [conn99] command green.$cmd command:
insert { insert: "system.indexes", documents: [ { _id:
ObjectId('55f31aab9a50479be0a7dcd7'), ns: "green.users", key: {
addresses.zip: 1.0 }, name: "zip" } ], ordered: false } keyUpdates:0
writeConflicts:0 numYields:0 reslen:40 locks:{ Global: { acquireCount: { r:
1, w: 1 } }, MMAPV1Journal: { acquireCount: { w: 9 } }, Database: {
acquireCount: { W: 1 } }, Collection: { acquireCount: { W: 1 } }, Metadata: {
acquireCount: { W: 5 } } } 102ms

There’s a lot of information here, and we’ll go over the meaning of all of it when we discuss explain(). For now, if you read the message carefully, you should be able to extract the most important parts: that it’s a query on stocks.values; that the query selector consists of a match on stock_symbol and that a sort is being performed; and, maybe most significantly, that the query takes a whopping four seconds (4011 ms). The exact time may vary quite a bit depending on the speed of your computer.

Warnings like this must be addressed. They’re so critical that it’s worth your while to occasionally search for them in your MongoDB logs. This can be accomplished easily with grep:

grep -E '[0-9]+ms' mongod.log

If 100 ms is too high a threshold, you can lower it with the --slowms server option when you start MongoDB. If you define slow as taking longer than 50 ms, then start mongod with --slowms 50.

Of course, using grep to search through logs isn’t very systematic. You can use the MongoDB logs to check for slow queries, but the procedure is rather coarse and should be reserved as a kind of sanity check in a staging or production environment. To identify slow queries before they become a problem, you want a precision tool. MongoDB’s built-in query profiler is exactly that.

Using the profiler

For identifying slow queries, you can’t beat the built-in profiler. Profiling is disabled by default, so let’s get started by enabling it. From the MongoDB shell, enter the following:

use stocks
db.setProfilingLevel(2)

First you select the database you want to profile; profiling is always scoped to a particular database. Then you set the profiling level to 2. This is the most verbose level; it directs the profiler to log every read and write. A couple of other options are available. To log only slow (100 ms) operations, set the profiling level to 1. To disable the query profiler altogether, set it to 0. And to log only operations taking longer than a certain threshold in milliseconds, pass the number of milliseconds as the second argument like this:

use stocks
db.setProfilingLevel(1, 50)

Once you’ve enabled the profiler, it’s time to issue some queries. Let’s run another query on the stocks database. Try finding the highest closing price in the data set:

db.values.find({}).sort({close: -1}).limit(1)

Profiling results

The profiling results are stored in a special capped collection called system.profile which is located in the database where you executed the setProfilingLevel command. Recall that capped collections are fixed in size and that data is written to them in a circular way so that once the collection reaches its max size, new documents overwrite the oldest documents. The system.profile collection is allocated 128 KB, thus ensuring that the profile data never consumes much in the way of resources.

You can query system.profile as you would any capped collection. For instance, you can find all queries that took longer than 150 ms like this:

db.system.profile.find({millis: {$gt: 150}})

Because capped collections maintain natural insertion order, you can use the $natural operator to sort so that the most recent results are displayed first:

db.system.profile.find().sort({$natural: -1}).limit(5).pretty()

Returning to the query you just issued, you should see an entry in the result set that looks something like this:

Another expensive query: this one took about 11 seconds! In addition to the time it took to complete, you get all same information about the query that you saw in the MongoDB log’s slow query warning, which is enough to start the deeper investigation that we’ll cover in the next section.

But before moving on, a few more words about profiling strategy are in order:

A good way to use the profiler is to start it with a coarse setting and work downward. First ensure that no queries take longer than 100 ms, then move down to 75 ms, and so on.
While the profiler is enabled, you’ll want to put your application through its paces; this means testing all of your application’s reads and writes.
To be thorough, those reads and writes should be executed under real conditions, where the data sizes, query load, and hardware are representative of the application’s production environment.

The query profiler is useful, but to get the most out of it, you need to be methodical. Better to be surprised with a few slow queries in development than in production, where the remedies are much more costly.

For reference, the following output presents a profiling entry from MongoDB v3.0.6:

> db.system.profile.find().limit(1).pretty()
{
    "op" : "query",
    "ns" : "products.system.profile",
    "query" : {

    },
    "ntoreturn" : 0,
    "ntoskip" : 0,
    "nscanned" : 0,
    "nscannedObjects" : 0,
    "keyUpdates" : 0,
    "writeConflicts" : 0,
    "numYield" : 0,

    "locks" : {
        "Global" : {
            "acquireCount" : {
                "r" : NumberLong(2)
            }
        },
        "MMAPV1Journal" : {
            "acquireCount" : {
                "r" : NumberLong(1)
            }
        },
        "Database" : {
            "acquireCount" : {
                "r" : NumberLong(1)
            }
        },
        "Collection" : {
            "acquireCount" : {
                "R" : NumberLong(1)
            }
        }
    },
    "nreturned" : 0,
    "responseLength" : 20,
    "millis" : 1,
    "execStats" : {
        "stage" : "COLLSCAN",
        "filter" : {
            "$and" : [ ]
        },
        "nReturned" : 0,
        "executionTimeMillisEstimate" : 0,
        "works" : 2,
        "advanced" : 0,
        "needTime" : 1,
        "needFetch" : 0,
        "saveState" : 0,
        "restoreState" : 0,
        "isEOF" : 1,
        "invalidates" : 0,
        "direction" : "forward",
        "docsExamined" : 0
    },
    "ts" : ISODate("2015-09-11T18:52:08.847Z"),
    "client" : "127.0.0.1",
    "allUsers" : [ ],
    "user" : ""
}

8.3.2. Examining slow queries

Finding slow queries is easy with MongoDB’s profiler. Discovering why these queries are slow is trickier and may require some detective work. As mentioned, the causes of slow queries are manifold. If you’re lucky, resolving a slow query may be as easy as adding an index. In more difficult cases, you might have to rearrange indexes, restructure the data model, or upgrade hardware. But you should always look at the simplest case first, and that’s what we’re going to do here.

In the simplest case, a lack of indexes, inappropriate indexes, or less-than-ideal queries will be the root of the problem. You can find out for sure by running an explain on the offending queries. Let’s explore how to do that now.

Using and understanding explain( )

MongoDB’s explain command provides detailed information about a given query’s path. Let’s dive right in and see what information can be gleaned from running an explain on the last query you ran in the previous section. To run explain from the shell, you only need to attach the explain() method call:

The millis field indicates that this query takes about 11 seconds,^[9] and there’s an obvious reason for this. Look at the nscanned value: this shows that the query engine had to scan 4,308,303 documents to fulfill the query. Now, quickly run a count on the values collection:

⁹
If this doesn’t seem like much, consider the case where a user is waiting on a web page to load because of a database query in the background. In that context, 11 seconds is an eternity.

db.values.count()
4308303

The number of documents scanned is the same as the total number of documents in the collection, so you’ve performed a complete collection scan. If your query were expected to return every document in the collection, this wouldn’t be a bad thing. But you’re returning one document, as indicated by the explain value n, so this is problematic. Furthermore, a full collection scan will only get more expensive if more documents are added. Generally speaking, you want the values of n and nscanned to be as close together as possible. When doing a collection scan, this is almost never the case. The cursor field tells you that you’ve been using a BasicCursor, which only confirms that you’re scanning the collection itself and not an index. If you had used an index, the value would’ve been BTreeCursor.

A second datum here further explains the slowness of the query: the scanAndOrder field. This indicator appears when the query optimizer can’t use an index to return a sorted result set. Therefore, in this case, not only does the query engine have to scan the collection, it also has to sort the result set manually.

The previous output from the explain() command is from an older MongoDB version. Here’s a sample output from the explain() command using a MongoDB v3.0.6 server:

> db.inventory.find({}).sort({"quantity": -1}).limit(1). explain("executionStats")
{
    "queryPlanner" : {
        "plannerVersion" : 1,
        "namespace" : "tutorial.inventory",
        "indexFilterSet" : false,
        "parsedQuery" : {
            "$and" : [ ]
        },
        "winningPlan" : {
            "stage" : "SORT",
            "sortPattern" : {
                "quantity" : -1
            },
            "limitAmount" : 1,
            "inputStage" : {
                "stage" : "COLLSCAN",
                "filter" : {
                    "$and" : [ ]
                },
                "direction" : "forward"
            }
        },
        "rejectedPlans" : [ ]
    },
    "executionStats" : {
        "executionSuccess" : true,
        "nReturned" : 1,
        "executionTimeMillis" : 0,
        "totalKeysExamined" : 0,
        "totalDocsExamined" : 11,
        "executionStages" : {
            "stage" : "SORT",
            "nReturned" : 1,
            "executionTimeMillisEstimate" : 0,
            "works" : 16,
            "advanced" : 1,
            "needTime" : 13,
            "needFetch" : 0,

            "saveState" : 0,
            "restoreState" : 0,
            "isEOF" : 1,
            "invalidates" : 0,
            "sortPattern" : {
                "quantity" : -1
            },
            "memUsage" : 72,
            "memLimit" : 33554432,
            "limitAmount" : 1,
            "inputStage" : {
                "stage" : "COLLSCAN",
                "filter" : {
                    "$and" : [ ]
                },
                "nReturned" : 11,
                "executionTimeMillisEstimate" : 0,
                "works" : 13,
                "advanced" : 11,
                "needTime" : 1,
                "needFetch" : 0,
                "saveState" : 0,
                "restoreState" : 0,
                "isEOF" : 1,
                "invalidates" : 0,
                "direction" : "forward",
                "docsExamined" : 11
            }
        }
    },
    "serverInfo" : {
        "host" : "rMacBook.local",
        "port" : 27017,
        "version" : "3.0.6",
        "gitVersion" : "nogitversion"
    },
    "ok" : 1
}

The explain() command displays more information when used with the executionStats option.

Add an index and retry

The poor performance is unacceptable, but fortunately the fix is simple. All you need to do is build an index on the close field. Go ahead and do that now:

db.values.createIndex({close: 1})

Note that building the index may take a few minutes. Once built, try the query again:

db.values.find({}).sort({close: -1}).limit(1).explain()
{
  "cursor" : "BtreeCursor close_1 reverse",
  "isMultiKey" : false,

  "n" : 1,
  "nscannedObjects" : 1,
  "nscanned" : 1,
  "nscannedObjectsAllPlans" : 1,
  "nscannedAllPlans" : 1,
  "scanAndOrder" : false,
  "indexOnly" : false,
  "nYields" : 0,
  "nChunkSkips" : 0,
  "millis" : 0,
  "indexBounds" : {
    "name" : [
      [
        {
          "$maxElement" : 1
        },
        {
          "$minElement" : 1
        }
      ]
    ]
  },
  "server" : "localhost:27017"
}

What a difference! The query now takes less than a millisecond to process. You can see from the cursor field that you’re using a BtreeCursor on the index named close_1 and that you’re iterating over the index in reverse order. In the indexBounds field, you see the special values $maxElement and $minElement. These indicate that the query spans the entire index. In this case, the query optimizer walks the rightmost edge of the B-tree until it reaches the maximum key and then works its way backward. Because you’ve specified a limit of 1, the query is complete once the maxElement is found. And of course, the index keeps the entries in order, so there’s no longer a need for the manual sort indicated by scanAndOrder.

Similarly, the MongoDB v3.0.6 output shows the improvement in the execution time and the number of documents that where examined:

> db.inventory.find({}).sort({"quantity": -1}).limit(1).explain("executionStats")
{
    "queryPlanner" : {
        "plannerVersion" : 1,
        "namespace" : "tutorial.inventory",
        "indexFilterSet" : false,
        "parsedQuery" : {
            "$and" : [ ]
        },
        "winningPlan" : {
            "stage" : "LIMIT",
            "limitAmount" : 0,
            "inputStage" : {
                "stage" : "FETCH",

                "inputStage" : {
                    "stage" : "IXSCAN",
                    "keyPattern" : {
                        "quantity" : 1
                    },
                    "indexName" : "quantity_1",
                    "isMultiKey" : false,
                    "direction" : "backward",
                    "indexBounds" : {
                        "quantity" : [
                            "[MaxKey, MinKey]"
                        ]
                    }
                }
            }
        },
        "rejectedPlans" : [ ]
    },
    "executionStats" : {
        "executionSuccess" : true,
        "nReturned" : 1,
        "executionTimeMillis" : 0,
        "totalKeysExamined" : 1,
        "totalDocsExamined" : 1,
        "executionStages" : {
            "stage" : "LIMIT",
            "nReturned" : 1,
            "executionTimeMillisEstimate" : 0,
            "works" : 2,
            "advanced" : 1,
            "needTime" : 0,
            "needFetch" : 0,
            "saveState" : 0,
            "restoreState" : 0,
            "isEOF" : 1,
            "invalidates" : 0,
            "limitAmount" : 0,
            "inputStage" : {
                "stage" : "FETCH",
                "nReturned" : 1,
                "executionTimeMillisEstimate" : 0,
                "works" : 1,
                "advanced" : 1,
                "needTime" : 0,
                "needFetch" : 0,
                "saveState" : 0,
                "restoreState" : 0,
                "isEOF" : 0,
                "invalidates" : 0,
                "docsExamined" : 1,
                "alreadyHasObj" : 0,
                "inputStage" : {
                    "stage" : "IXSCAN",
                    "nReturned" : 1,
                    "executionTimeMillisEstimate" : 0,

                    "works" : 1,
                    "advanced" : 1,
                    "needTime" : 0,
                    "needFetch" : 0,
                    "saveState" : 0,
                    "restoreState" : 0,
                    "isEOF" : 0,
                    "invalidates" : 0,
                    "keyPattern" : {
                        "quantity" : 1
                    },
                    "indexName" : "quantity_1",
                    "isMultiKey" : false,
                    "direction" : "backward",
                    "indexBounds" : {
                        "quantity" : [
                            "[MaxKey, MinKey]"
                        ]
                    },
                    "keysExamined" : 1,
                    "dupsTested" : 0,
                    "dupsDropped" : 0,
                    "seenInvalidated" : 0,
                    "matchTested" : 0
                }
            }
        }
    },
    "serverInfo" : {
        "host" : "rMacBook.local",
        "port" : 27017,
        "version" : "3.0.6",
        "gitVersion" : "nogitversion"
    },
    "ok" : 1
}

The reason for showing the output from both 2.x and 3.0 MongoDB versions is for you to have it as a reference.

Using an indexed key

You’ll see slightly different output if you use the indexed key in your query selector. Take a look at the explain plan for a query selecting closing values greater than 500:

> db.values.find({close: {$gt: 500}}).explain()
{
  "cursor" : "BtreeCursor close_1",
  "isMultiKey" : false,
  "n" : 309,
  "nscannedObjects" : 309,
  "nscanned" : 309,
  "nscannedObjectsAllPlans" : 309,
  "nscannedAllPlans" : 309,
  "scanAndOrder" : false,

  "indexOnly" : false,
  "nYields" : 0,
  "nChunkSkips" : 0,
  "millis" : 1,
  "indexBounds" : {
    "close" : [
      [
        500,
        1.7976931348623157e+308
      ]
    ]
  },
  "server" : "localhost:27017"
}

You’re still scanning the same number of documents that you’re returning (n and nscanned are the same), which is ideal. But note the difference in the way the index boundaries are specified. Instead of the $maxElement and $minElement keys, the boundaries are actual values. The lower bound is 500 and the upper bound is effectively infinite. These values must share the same class of data type that you’re querying on; you’re querying on a number, which means the index bounds are numeric. If you were to query on a string range instead, the boundaries would be strings.^[10]

¹⁰
If this isn’t making any sense, recall that a given index can contain keys of multiple data types. Thus, query results will always be limited by the data type used in the query.

As usual, output from a similar query using MongoDB v3.0 will be presented here:

> db.inventory.find({"quantity":{$gt: 150}}).limit(1).explain("executionStats")
{
    "queryPlanner" : {
        "plannerVersion" : 1,
        "namespace" : "tutorial.inventory",
        "indexFilterSet" : false,
        "parsedQuery" : {
            "quantity" : {
                "$gt" : 150
            }
        },
        "winningPlan" : {
            "stage" : "LIMIT",
            "limitAmount" : 0,
            "inputStage" : {
                "stage" : "FETCH",
                "inputStage" : {
                    "stage" : "IXSCAN",
                    "keyPattern" : {
                        "quantity" : 1
                    },
                    "indexName" : "quantity_1",
                    "isMultiKey" : false,
                    "direction" : "forward",

                    "indexBounds" : {
                        "quantity" : [
                            "(150.0, inf.0]"
                        ]
                    }
                }
            }
        },
        "rejectedPlans" : [ ]
    },
    "executionStats" : {
        "executionSuccess" : true,
        "nReturned" : 1,
        "executionTimeMillis" : 0,
        "totalKeysExamined" : 1,
        "totalDocsExamined" : 1,
        "executionStages" : {
            "stage" : "LIMIT",
            "nReturned" : 1,
            "executionTimeMillisEstimate" : 0,
            "works" : 2,
            "advanced" : 1,
            "needTime" : 0,
            "needFetch" : 0,
            "saveState" : 0,
            "restoreState" : 0,
            "isEOF" : 1,
            "invalidates" : 0,
            "limitAmount" : 0,
            "inputStage" : {
                "stage" : "FETCH",
                "nReturned" : 1,
                "executionTimeMillisEstimate" : 0,
                "works" : 1,
                "advanced" : 1,
                "needTime" : 0,
                "needFetch" : 0,
                "saveState" : 0,
                "restoreState" : 0,
                "isEOF" : 0,
                "invalidates" : 0,
                "docsExamined" : 1,
                "alreadyHasObj" : 0,
                "inputStage" : {
                    "stage" : "IXSCAN",
                    "nReturned" : 1,
                    "executionTimeMillisEstimate" : 0,
                    "works" : 1,
                    "advanced" : 1,
                    "needTime" : 0,
                    "needFetch" : 0,
                    "saveState" : 0,
                    "restoreState" : 0,
                    "isEOF" : 0,
                    "invalidates" : 0,

                    "keyPattern" : {
                        "quantity" : 1
                    },
                    "indexName" : "quantity_1",
                    "isMultiKey" : false,
                    "direction" : "forward",
                    "indexBounds" : {
                        "quantity" : [
                            "(150.0, inf.0]"
                        ]
                    },
                    "keysExamined" : 1,
                    "dupsTested" : 0,
                    "dupsDropped" : 0,
                    "seenInvalidated" : 0,
                    "matchTested" : 0
                }
            }
        }
    },
    "serverInfo" : {
        "host" : "rMacBook.local",
        "port" : 27017,
        "version" : "3.0.6",
        "gitVersion" : "nogitversion"
    },
    "ok" : 1
}

Before continuing, try running explain() on a few queries of your own using all MongoDB versions you have, and pay attention to the difference between n and nscanned, as well as the difference between totalDocsExamined and nReturned. Optimizing a query in MongoDB v2.x usually means making nscanned as small as possible, but every result must be scanned, so nscanned will never be lower than n, the number of results a query returns. In MongoDB v3.0 the nReturned value indicates the number of documents a query matches and returns. The value of totalDocsExamined indicates the number of documents that MongoDB scanned. Lastly, totalKeysExamined shows the number of index entries that MongoDB scanned.

MongoDB’s query optimizer

The query optimizer is the piece of software that determines which index, if any, will most efficiently serve a given query. To select an ideal index for your queries, the query optimizer uses a fairly simple set of rules:

1. Avoid scanAndOrder. If the query includes a sort, attempt to sort using an index.

2. Satisfy all fields with useful indexing constraints—attempt to use indexes for the fields in the query selector.

3. If the query implies a range or includes a sort, choose an index where that last key used can help satisfy the range or sort.

If all of these conditions can be met for any one index, that index will be considered optimal and will be used. If more than one index qualifies as optimal, one of the optimal indexes will be chosen arbitrarily. There’s a lesson here: if you can build optimal indexes for your queries, you make the query optimizer’s job a lot easier. Strive for that if you can.

Let’s look at a query that satisfies an index (and the query optimizer) perfectly. Go back to the stock symbol data set. Now imagine you want to issue the following query, which fetches all of Google’s closing values greater than 200:

db.values.find({stock_symbol: "GOOG", close: {$gt: 200}})

If you use .explain() with this query, you’ll see that n is 730 but nscanned is 5299. The previously created index on close helps, but the optimal index for this query includes both keys and places the close key last to allow for the range query:

db.values.createIndex({stock_symbol: 1, close: 1})

You’ll see that if you run the query, both keys are used, and the index bounds are as expected:

db.values.find({stock_symbol: "GOOG", close: {$gt: 200}}).explain()
{
  "cursor" : "BtreeCursor stock_symbol_1_close_1",
  "isMultiKey" : false,
  "n" : 730,
  "nscannedObjects" : 730,
  "nscanned" : 730,
  "nscannedObjectsAllPlans" : 730,
  "nscannedAllPlans" : 730,
  "scanAndOrder" : false,
  "indexOnly" : false,
  "nYields" : 0,
  "nChunkSkips" : 0,
  "millis" : 2,
  "indexBounds" : {
    "stock_symbol" : [
      [
        "GOOG",
        "GOOG"
      ]
    ],
    "close" : [
      [
        200,
        1.7976931348623157e+308
      ]
    ]
  },
  "server" : "localhost:27017"
}

This is the optimal explain output for this query: the values of n and nscanned are the same. But now consider the case where no one index perfectly serves the query. For example, imagine that you don’t have an index on {stock_symbol: 1, close: 1} but instead, you have a separate index on each of those fields. Using the shorthand getIndexKeys() to list indexes, you’d see this:

db.values.getIndexKeys()
[
  {
    "_id" : 1
  },
  {
    "close" : 1
  },
  {
    "stock_symbol" : 1
  }
]

Because your query includes both the stock_symbol and close keys, there’s no obvious index to use. This is where the query optimizer comes in, and the heuristic is more straightforward than you might imagine. It’s based purely on the value of nscanned. In other words, the optimizer chooses the index that requires scanning the least number of index entries. When the query is first run, the optimizer creates a query plan for each index that might efficiently satisfy the query. The optimizer then runs each plan in parallel.^[11] Usually, the plan that finishes with the lowest value for nscanned is declared the winner; but in rare occasions, the optimizer may select the full collection scan as the winning plan for a given query. The optimizer then halts any long-running plans and saves the winner for future use.

¹¹
Technically, the plans are interleaved.

The following output is from MongoDB v3.0 using a much smaller collection:

> db.inventory.find({"quantity": 500, "type":"toys"}).limit(1).explain("executionStats")
{
    "queryPlanner" : {
        "plannerVersion" : 1,
        "namespace" : "tutorial.inventory",
        "indexFilterSet" : false,
        "parsedQuery" : {
            "$and" : [
                {
                    "quantity" : {
                        "$eq" : 500
                    }
                },
                {
                    "type" : {
                        "$eq" : "toys"
                    }
                }
            ]
        },
        "winningPlan" : {
            "stage" : "LIMIT",
            "limitAmount" : 0,
            "inputStage" : {
                "stage" : "KEEP_MUTATIONS",
                "inputStage" : {
                    "stage" : "FETCH",
                    "filter" : {

                        "type" : {
                            "$eq" : "toys"
                        }
                    },
                    "inputStage" : {
                        "stage" : "IXSCAN",
                        "keyPattern" : {
                            "quantity" : 1
                        },
                        "indexName" : "quantity_1",
                        "isMultiKey" : false,
                        "direction" : "forward",
                        "indexBounds" : {
                            "quantity" : [
                                "[500.0, 500.0]"
                            ]
                        }
                    }
                }
            }
        },
        "rejectedPlans" : [ ]
    },
    "executionStats" : {
        "executionSuccess" : true,
        "nReturned" : 1,
        "executionTimeMillis" : 1,
        "totalKeysExamined" : 2,
        "totalDocsExamined" : 2,
        "executionStages" : {
            "stage" : "LIMIT",
            "nReturned" : 1,
            "executionTimeMillisEstimate" : 0,
            "works" : 3,
            "advanced" : 1,
            "needTime" : 1,
            "needFetch" : 0,
            "saveState" : 0,
            "restoreState" : 0,
            "isEOF" : 1,

            "invalidates" : 0,
            "limitAmount" : 0,
            "inputStage" : {
                "stage" : "KEEP_MUTATIONS",
                "nReturned" : 1,
                "executionTimeMillisEstimate" : 0,
                "works" : 2,
                "advanced" : 1,
                "needTime" : 1,
                "needFetch" : 0,
                "saveState" : 0,
                "restoreState" : 0,
                "isEOF" : 0,
                "invalidates" : 0,
                "inputStage" : {
                    "stage" : "FETCH",
                    "filter" : {
                        "type" : {
                            "$eq" : "toys"
                        }
                    },
                    "nReturned" : 1,
                    "executionTimeMillisEstimate" : 0,
                    "works" : 2,
                    "advanced" : 1,
                    "needTime" : 1,
                    "needFetch" : 0,
                    "saveState" : 0,
                    "restoreState" : 0,
                    "isEOF" : 1,
                    "invalidates" : 0,
                    "docsExamined" : 2,
                    "alreadyHasObj" : 0,
                    "inputStage" : {
                        "stage" : "IXSCAN",
                        "nReturned" : 2,
                        "executionTimeMillisEstimate" : 0,
                        "works" : 2,
                        "advanced" : 2,
                        "needTime" : 0,
                        "needFetch" : 0,
                        "saveState" : 0,
                        "restoreState" : 0,
                        "isEOF" : 1,
                        "invalidates" : 0,
                        "keyPattern" : {
                            "quantity" : 1
                        },
                        "indexName" : "quantity_1",
                        "isMultiKey" : false,
                        "direction" : "forward",
                        "indexBounds" : {
                            "quantity" : [
                                "[500.0, 500.0]"
                            ]
                        },

                        "keysExamined" : 2,
                        "dupsTested" : 0,
                        "dupsDropped" : 0,
                        "seenInvalidated" : 0,
                        "matchTested" : 0
                    }
                }
            }
        }
    },
    "serverInfo" : {
        "host" : "rMacBook.local",
        "port" : 27017,
        "version" : "3.0.6",
        "gitVersion" : "nogitversion"
    },
    "ok" : 1
}

The aforementioned query examined two documents to return the desired document. Now it’s time to create another index that combines two fields:

> db.inventory.createIndex( { quantity: 1, type: 1 } )
{
    "createdCollectionAutomatically" : false,
    "numIndexesBefore" : 2,
    "numIndexesAfter" : 3,
    "ok" : 1
}

Now you’re going to rerun the previous query:

> db.inventory.find({"quantity": 500, "type":"toys"}).limit(1).explain("executionStats")
{
    "queryPlanner" : {
        "plannerVersion" : 1,
        "namespace" : "tutorial.inventory",
        "indexFilterSet" : false,
        "parsedQuery" : {
            "$and" : [
                {
                    "quantity" : {
                        "$eq" : 500
                    }
                },
                {
                    "type" : {
                        "$eq" : "toys"
                    }
                }
            ]
        },

        "winningPlan" : {
            "stage" : "LIMIT",
            "limitAmount" : 0,
            "inputStage" : {
                "stage" : "FETCH",
                "inputStage" : {
                    "stage" : "IXSCAN",
                    "keyPattern" : {
                        "quantity" : 1,
                        "type" : 1
                    },
                    "indexName" : "quantity_1_type_1",
                    "isMultiKey" : false,
                    "direction" : "forward",
                    "indexBounds" : {
                        "quantity" : [
                            "[500.0, 500.0]"
                        ],
                        "type" : [
                            "["toys", "toys"]"
                        ]
                    }
                }
            }
        },
        "rejectedPlans" : [
            {
                "stage" : "LIMIT",
                "limitAmount" : 1,
                "inputStage" : {
                    "stage" : "KEEP_MUTATIONS",
                    "inputStage" : {
                        "stage" : "FETCH",
                        "filter" : {
                            "type" : {
                                "$eq" : "toys"
                            }
                        },
                        "inputStage" : {
                            "stage" : "IXSCAN",
                            "keyPattern" : {
                                "quantity" : 1
                            },
                            "indexName" : "quantity_1",
                            "isMultiKey" : false,
                            "direction" : "forward",
                            "indexBounds" : {
                                "quantity" : [
                                    "[500.0, 500.0]"
                                ]
                            }
                        }
                    }
                }
            }

        ]
    },
    "executionStats" : {
        "executionSuccess" : true,
        "nReturned" : 1,
        "executionTimeMillis" : 1,
        "totalKeysExamined" : 1,
        "totalDocsExamined" : 1,
        "executionStages" : {
            "stage" : "LIMIT",
            "nReturned" : 1,
            "executionTimeMillisEstimate" : 0,
            "works" : 2,
            "advanced" : 1,
            "needTime" : 0,
            "needFetch" : 0,
            "saveState" : 0,
            "restoreState" : 0,
            "isEOF" : 1,
            "invalidates" : 0,
            "limitAmount" : 0,
            "inputStage" : {
                "stage" : "FETCH",
                "nReturned" : 1,
                "executionTimeMillisEstimate" : 0,
                "works" : 1,
                "advanced" : 1,
                "needTime" : 0,
                "needFetch" : 0,
                "saveState" : 0,
                "restoreState" : 0,
                "isEOF" : 1,
                "invalidates" : 0,
                "docsExamined" : 1,
                "alreadyHasObj" : 0,
                "inputStage" : {
                    "stage" : "IXSCAN",
                    "nReturned" : 1,
                    "executionTimeMillisEstimate" : 0,
                    "works" : 1,
                    "advanced" : 1,
                    "needTime" : 0,
                    "needFetch" : 0,
                    "saveState" : 0,
                    "restoreState" : 0,
                    "isEOF" : 1,
                    "invalidates" : 0,
                    "keyPattern" : {
                        "quantity" : 1,
                        "type" : 1
                    },
                    "indexName" : "quantity_1_type_1",
                    "isMultiKey" : false,
                    "direction" : "forward",
                    "indexBounds" : {

                        "quantity" : [
                            "[500.0, 500.0]"
                        ],
                        "type" : [
                            "["toys", "toys"]"
                        ]
                    },
                    "keysExamined" : 1,
                    "dupsTested" : 0,
                    "dupsDropped" : 0,
                    "seenInvalidated" : 0,
                    "matchTested" : 0
                }
            }
        }
    },
    "serverInfo" : {
        "host" : "rMacBook.local",
        "port" : 27017,
        "version" : "3.0.6",
        "gitVersion" : "nogitversion"
    },
    "ok" : 1
}

This time only one document was examined to return one document. This means that the new index helped the process.

Showing the query plans and hint( )

You can see this process in action by issuing your query and running explain(). First, drop the compound index on {stock_symbol: 1, close: 1} and build separate indexes on each of these keys:

db.values.dropIndex("stock_symbol_1_close_1")
db.values.createIndex({stock_symbol: 1})
db.values.createIndex ({close: 1})

Then pass true to the explain() method, which will include the list of plans the query optimizer attempts. You can see the output in listing 8.2. When using MongoDB v3.0 the possible modes are queryPlanner, executionStats, and allPlansExecution. For backwards compatibility with earlier versions of cursor.explain(), MongoDB v3.0 interprets true as allPlansExecution and false as queryPlanner.

Listing 8.2. Viewing query plans with `explain(true)`

You’ll see right away that the query plan chooses the index on {stock_symbol: 1} to fulfill the query. Lower down, the allPlans key points to a list that includes two additional query plans: one for the index on {close: 1}, and the other a collection scan with a BasicCursor. MongoDB v3.0 calls this list rejectedPlans.

It’s understandable why the optimizer rejects the collection scan, but it might be less clear why the index on {close: 1} doesn’t satisfy. You can use hint() to find out. hint() forces the query optimizer to use a particular index:

query = {stock_symbol: "GOOG", close: {$gt: 200}}
db.values.find(query).hint({close: 1}).explain()
{
  "cursor" : "BtreeCursor close_1",
  "isMultiKey" : false,
  "n" : 730,
  "nscannedObjects" : 5299,
  "nscanned" : 5299,
  "nscannedObjectsAllPlans" : 5299,
  "nscannedAllPlans" : 5299,
  "scanAndOrder" : false,
  "indexOnly" : false,
  "nYields" : 0,
  "nChunkSkips" : 0,
  "millis" : 22,
  "indexBounds" : {
    "close" : [
      [
        200,
        1.7976931348623157e+308
      ]
    ]
  },
  "server" : "localhost:27017"
}

Look at the value for nscanned: 5,299. This is much greater than the 894 entries scanned previously, and the time it takes to complete the query bears this out.

Running the same query using MongoDB v3.0 and interpreting its output is left as an exercise for the reader.

Query plan cache

All that’s left to understand is how the query optimizer caches and expires its choice of query plan. After all, you wouldn’t want the optimizer running all those plans in parallel on each query.

When a successful plan is discovered, the query pattern, the value for nscanned, and the index spec are recorded. For the query we’ve been working with, the recorded structure looks something like this:

{
  pattern: {
    stock_symbol: 'equality',
    close: 'bound',

    index: {
      stock_symbol: 1
    },
    nscanned: 894
  }
}

The query pattern records the kind of match for each key. Here, you’re requesting an exact match on stock_symbol (equality), and a range match on close (bound).^[12] Whenever a new query matches this pattern, the index will be used.

¹²
In case you’re interested, three kinds of range matches are stored: upper, lower, and upper-and-lower. The query pattern also includes any sort specification.

But this shouldn’t hold forever, and it doesn’t. The optimizer automatically expires a plan after any of the following events:

100 writes are made to the collection.
Indexes are added or removed from the collection.
A query using a cached query plan does a lot more work than expected. Here, what qualifies as “a lot more work” is a value for nscanned exceeding the cached nscanned value by at least a factor of 10.

In the last of these cases, the optimizer will immediately begin interleaving other query plans in case a different index proves more efficient. As you spend time optimizing queries, you’ll likely notice several patterns of queries and indexes that work well together. In the next section, we codify some of these patterns.

If you’re running MongoDB v3.0 you can find more information about query plan cache methods at http://docs.mongodb.org/manual/reference/method/js-plan-cache/.

8.3.3. Query patterns

Here we present several common query patterns and the indexes to use with them. This section’s goal is to help you plan out how to pair your application’s queries with MongoDB’s indexes.

Single-key indexes

To review single-key indexes, recall the index you created for the stock values collection on closing numbers in section 8.3.2, {close: 1}. This index can be used in the following scenarios.

Exact matches

An exact match. The index is used whether 0, 1, or many results are returned. An exact match is used in this query, returning all entries with a closing value of 100:

db.values.find({close: 100})

Sorting

A sort on the indexed field. For example:

db.values.find({}).sort({close: 1})

In the case of a sort with no query selector, you’ll probably want to tack on a limit unless you actually plan to iterate over the entire collection.

Range queries

A range query with or without a sort on the same field. For example, all closing values greater than or equal to 100:

db.values.find({close: {$gte: 100}})

If you add a sort clause on the same key, the optimizer will still be able to use the same index:

db.values.find({close: {$gte: 100}}).sort({close: 1})

Compound-key indexes

Compound-key indexes are a little more complicated, but their uses are analogous to those of single-key indexes. The main thing to remember is that a compound-key index can efficiently serve only a single range or sort per query. Let’s imagine a triple-compound key index, again for stock values, on {close: 1, open: 1, date: 1}. Let’s look at some possible scenarios.

Exact matches

An exact match on the first key, the first and second keys, or the first, second, and third keys, in that order:

db.values.find({close: 1})
db.values.find({close: 1, open: 1})
db.values.find({close: 1, open: 1, date: "1985-01-08"})

Range matches

An exact match on any set of leftmost keys (including none), followed by either a range or a sort using the next key to the right. Thus, all the following queries are ideal for the triple-key index:

db.values.find({}).sort({close: 1})
db.values.find({close: {$gt: 1}})
db.values.find({close: 100}).sort({open: 1})
db.values.find({close: 100, open: {$gt: 1}})
db.values.find({close: 1, open: 1.01, date: {$gt: "2005-01-01"}})
db.values.find({close: 1, open: 1.01}).sort({date: 1})

Covering indexes

If you’ve never heard of covering indexes, realize from the start that the term is something of a misnomer. A covering index isn’t, as the name would suggest, a kind of index but rather a special use of an index. In particular, an index can be said to cover a query if all the data required by the query resides in the index itself. Covered index queries are also known as index-only queries because these queries are served without having to reference the indexed documents themselves. This can result in increased query performance.

Using a covering index in MongoDB is easy. Simply select a set of fields that reside in a single index and exclude the _id field (this field likely isn’t part of the index you’re using). Here’s an example that uses the triple-compound index you created in the previous section:

db.values.find({close: 1}, {open: 1, close: 1, date: 1, _id: 0})

In earlier versions of MongoDB, cursor.explain() returned the indexOnly field to indicate whether the index covered a query and no actual collection data was used to serve the query. In MongoDB v3.0, when an index covers a query, the explain result has an IXSCAN stage that isn’t a descendant of a FETCH stage, and in the executionStats, the value of totalDocsExamined is 0.

8.4. Summary

This chapter is hefty, as indexing is an admittedly rich subject. If some of the ideas are unclear, that’s okay. You should at least come away with a few techniques for examining indexes and avoiding slow queries, and you should know enough to keep learning.

Query optimization is always application-specific, but our hope is that the ideas and techniques provided here will help you tune your queries for the better. Empirical approaches are always useful. Make a habit of profiling and explaining your queries. In the process, you’ll continue learning about the hidden corners of the query optimizer, and you’ll ensure efficient queries for your application.

As you write your own applications, here are some things to remember:

Indexes are incredibly useful but carry a cost—they make writes slower.
MongoDB generally uses only one index in a query, so queries on multiple fields require compound indexes to be efficient.
Order matters when you declare compound indexes.
You should plan for, and avoid, expensive queries. Use MongoDB’s explain command, its expensive query logs, and its profiler to discover queries that should be optimized.
MongoDB gives you several commands for building indexes, but these always include a cost and may interfere with your application. This means you should optimize your queries and create indexes early, before you have much traffic or data.
Optimize queries by reducing the number of documents scanned. The explain command is immensely useful for discovering what a query is doing; use it as a guide for optimization.

With the complexity involved in indexing and query optimization, plain old experimentation may be your best teacher from here on out.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 8. Indexing and query optimization

Create new playlist

Sign In

Sign Up

Chapter 8. Indexing and query optimization

8.1. Indexing theory

8.1.1. A thought experiment

A simple index

A compound index

Figure 8.1. A compound index inside a cookbook

Indexing rules

8.1.2. Core indexing concepts

Single-key indexes

Compound-key indexes

Figure 8.2. Single-key index traversal

Figure 8.3. Compound-key index traversal

Figure 8.4. A compound-key index with the keys reversed

Index efficiency

8.1.3. B-trees

Figure 8.5. Sample B-tree structure

8.2. Indexing in practice

8.2.1. Index types

Unique indexes

Sparse indexes

Multikey indexes

Hashed indexes

Geospatial indexes

8.2.2. Index administration

Creating and deleting indexes

Building indexes

Listing 8.1. Checking the index build process with the shell currentOP() method

Background indexing

Offline indexing

Backups

Defragmenting

8.3. Query optimization

8.3.1. Identifying slow queries

Using the profiler

Profiling results

8.3.2. Examining slow queries

Using and understanding explain( )

Add an index and retry

Using an indexed key

MongoDB’s query optimizer

Showing the query plans and hint( )

Listing 8.2. Viewing query plans with explain(true)

Query plan cache

8.3.3. Query patterns

Single-key indexes

Exact matches

Sorting

Range queries

Compound-key indexes

Exact matches

Range matches

Covering indexes

8.4. Summary

Table of Contents for
Chapter 8. Indexing and query optimization

Listing 8.1. Checking the index build process with the shell `currentOP()` method

Listing 8.2. Viewing query plans with `explain(true)`