Creating an index and viewing plans of queries

In this recipe, we will look at querying data, analyzing its performance by explaining the query plan, and then optimizing it by creating indexes.

Getting ready

For the creation of indexes, we need to have a server up and running. A simple single node is what we will need. Refer to the Single node installation of MongoDB recipe in Chapter 1, Installing and Starting the MongoDB Server, to learn how to start the server. The data with which we will be operating needs to be imported into the database. The steps to import the data are given in the Creating test data recipe. Once we have this prerequisite, we are good to go.

How to do it…

We will trying to write a query that will find all the zip codes in a given state. To do this, perform the following steps:

  1. Execute the following query to view the plan of a query:
    > db.postalCodes.find({state:'Maharashtra'}).explain()
    

    Take a note of the cursor, n, nscannedObjects, and millis fields in the result of the explain plan operation

  2. Let's execute the same query again; this time, however, we will limit the results to only 100 results:
    > db.postalCodes.find({state:'Maharashtra'}).limit(100).explain()
    

    Again, take a note of the cursor, n, nscannedObjects, and millis fields in the result

  3. We will now create an index on the state and pincode fields as follows:
    > db.postalCodes.ensureIndex({state:1, pincode:1})
    
  4. Execute the following query:
    > db.postalCodes.find({state:'Maharashtra'}).explain()
    

    Again, take a note of the cursor, n, nscannedObjects, millis, and indexOnly fields in the result

  5. Since we want only the pin codes, we will modify the query as follows and view its plan:
    > db.postalCodes.find({state:'Maharashtra'}, {pincode:1, _id:0}).explain()
    

    Take a note of the cursor, n, nscannedObjects, nscanned, millis, and indexOnly fields in the result.

How it works…

There is a lot to explain here. We will first discuss what we just did and how to analyze the stats. Next, we will discuss some points to be kept in mind for index creation and some gotchas.

Analyzing the plan

Let's look at the first step and analyze the output we executed:

> db.postalCodes.find({state:'Maharashtra'}).explain()

The output on my machine is as follows (I am skipping the nonrelevant fields for now):

{
  "cursor" : "BasicCursor",

  "n" : 6446,
  "nscannedObjects" : 39732,
  "nscanned" : 39732,

  "millis" : 55,

}

The value of the cursor field in the result is BasicCursor, which means a full collection scan (all the documents are scanned one after another) has happened to search the matching documents in the entire collection. The value of n is 6446, which is the number of results that matched the query. The nscanned and nscannedobjects fields have values of 39,732, which is the number of documents in the collection that are scanned to retrieve the results. This is the also the total number of documents present in the collection, and all were scanned for the result. Finally, millis is the number of milliseconds taken to retrieve the result.

Improving the query execution time

So far, the query doesn't look too good in terms of performance, and there is great scope for improvement. To demonstrate how the limit applied to the query affects the query plan, we can find the query plan again without the index but with the limit clause:

> db.postalCodes.find({state:'Maharashtra'}).limit(100).explain()

{
  "cursor" : "BasicCursor",…
  "n" : 100,
  "nscannedObjects" : 19951,
  "nscanned" : 19951,

  "millis" : 30,

}

The query plan this time around is interesting. Though we still haven't created an index, we saw an improvement in the time the query took for execution and the number of objects scanned to retrieve the results. This is due to the fact that Mongo does not scan the remaining documents once the number of documents specified in the limit function is reached. We can thus conclude that it is recommended that you use the limit function to limit your number of results, whereas the maximum number of documents accessed is known upfront. This might give better query performance. The word "might" is important as, in the absence of an index, the collection might still be completely scanned if the number of matches is not met.

Improvement using indexes

Moving on, we will create a compound index on state and pincode. The order of the index is ascending in this case (as the value is 1) and is not significant unless we plan to execute a multikey sort. This is a deciding factor as to whether the result can be sorted using only the index or whether Mongo needs to sort it in memory later on, before we return the results. As far as the plan of the query is concerned, we can see that there is a significant improvement:

{
  "cursor" : "BtreeCursor state_1_pincode_1",
  

  "n" : 6446,
  "nscannedObjects" : 6446,
  "nscanned" : 6446,

  "indexOnly" : false,

  "millis" : 16,

}

The cursor field now has the BtreeCursor state_1_pincode_1 value , which shows that the index is indeed used now. As expected, the number of results stays the same at 6446. The number of objects scanned in the index and documents scanned in the collection have now reduced to the same number of documents as in the result. This is because we have now used an index that gave us the starting document from which we could scan; then, only the required number of documents was scanned. This is similar to using the book's index to find a word or scanning the entire book to search for the word. The time, millis, has come down too, as expected.

Improvement using covered indexes

This leaves us with one field, indexOnly, and we will see what this means. To know what this value is, we need to look briefly at how indexes operate.

Indexes store a subset of fields of the original document in the collection. The fields present in the index are the same as those on which the index is created. The fields, however, are kept sorted in the index in an order specified during the creation of the index. Apart from the fields, there is an additional value stored in the index; this acts as a pointer to the original document in the collection. Thus, whenever the user executes a query, if the query contains fields on which an index is present, the index is consulted to get a set of matches. The pointer stored with the index entries that match the query is then used to make another I/O operation to fetch the complete document from the collection; this document is then returned to the user.

The value of indexOnly, which is false, indicates that the data requested by the user in the query is not entirely present in the index; an additional I/O operation is needed to retrieve the entire document from the collection that follows the pointer from the index. Had the value been present in the index itself, an additional operation to retrieve the document from the collection would not be necessary, and the data from the index will be returned. This is called a covered index, and the value of indexOnly, in this case, will be true.

In our case, we just need the pin codes, so why not use projection in our queries to retrieve just what we need? This will also make the index covered as the index entry that just has the state's name and pin code, and the required data, can be served completely without retrieving the original document from the collection. The plan of the query in this case is interesting too. Executing the following query plan:

db.postalCodes.find({state:'Maharashtra'}, {pincode:1, _id:0}).explain()
{
  "cursor" : "BtreeCursor state_1_pincode_1",

  "n" : 6446,
  "nscannedObjects" : 0,
  "nscanned" : 6446,

  "indexOnly" : true,

  "millis" : 15,

}

The values of the nscannedobjects and indexOnly fields are something to be observed. As expected, since the data we requested in the projection in the find query is the pin code only, which can be served from the index alone, the value of indexOnly is true. In this case, we scanned 6,446 entries in the index; thus, the nscanned value is 6446. We, however, didn't reach out to any document in the collection on the disk, as this query was covered by the index alone, and no additional I/O was needed to retrieve the entire document. Hence, the value of nscannedobjects is 0.

As this collection in our case is small, we do not see a significant difference in the execution time of the query. This will be more evident on larger collections. Making use of indexes is great and gives good performance. Making use of covered indexes gives even better performance.

Note

Another thing to remember is that, wherever possible, try and use projection to retrieve only the number of fields we need. The _id field is retrieved every time by default, unless we plan to set _id:0 to not retrieve it if it is not part of the index. Executing a covered query is the most efficient way to query a collection.

Some gotchas of index creation

We will now see some pitfalls in index creation and some facts about the array field, which is used in the index.

Some of the operators that do not use the index efficiently are the $where, $nin, and $exists operators. Whenever these operators are used in the query, one should bear in mind a possible performance bottleneck when the data size increases. Similarly, the $in operator must be preferred over the $or operator, as both can be more or less used to achieve the same result. As an exercise, try to find the pin codes in the state of Maharashtra and Gujarat from the postalCodes collection. Write two queries: one using the $or operator and the other using the $in operator. Explain the plan for both these queries.

What happens when an array field is used in the index? Mongo creates an index entry for each element present in the array field of a document. So, if there are 10 elements in an array in a document, there will be 10 index entries, one for each element in the array. However, there is a constraint while creating indexes that contain array fields. When creating indexes using multiple fields, no more than one field can be of the array type. This is done to prevent a possible explosion in the number of indexes on adding even a single element to the array used in the index. If we think about it carefully, for each element in the array, an index entry is created. If multiple fields of type array were allowed to be part of an index, we would have a large number of entries in the index that would be a product of the length of these array fields. For example, a document added with two array fields, each of length 10, will add 100 entries to the index, had it been allowed to create one index using these two array fields.

This should be good enough for now to scratch the surfaces of a plain vanilla index. We will see more options and types in some of the upcoming recipes.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset