Monitoring instances and setting up alerts on MMS

The previous couple of recipes showed us how to set up an MMS account, set up an agent, add hosts, and manage user access to MMS console. The core objective of MMS is monitoring the host instances, which has not been discussed yet. In this recipe, we will perform some operations on the host that we added to MMS in the first recipe and monitor it from the MMS console.

Getting ready

Follow the recipe Signing up for MMS and setting up an MMS monitoring agent, and that is pretty much all that is needed for this recipe. You may choose to have a standalone instance or a replica set, either way is fine. Also, open a mongo shell and connect to the primary instance from it (it is a replica set).

How to do it…

  1. Start by logging into MMS console and clicking on Deployment on the left. Then, click on the Deployment link in the submenu again, as shown in the following screenshot:
    How to do it…

    Click on one of the hostnames to see a large variety of graphs showing various statistics. In this recipe, we will analyze the majority of these.

  2. Open the bundle downloaded for the book. In Chapter 4, Administration, we used a JavaScript to keep the server busy with some operations named KeepServerBusy.js. We will be using the same script this time around.
  3. In the operating system shell, execute the following with the .js file in current directory. The shell connects to port 27000 in my case for the primary:
    $ mongo KeepServerBusy.js --port 27000 --quiet
    
  4. Once it's started, keep it running and give it some 5 to 10 minutes before you start monitoring the graphs on MMS console.

How it works…

In Chapter 4, Administration, we saw a recipe, The mongostat and mongotop utilities that demonstrated how these utilities can be used to get the current operations and resource utilization. That is a fairly basic and helpful way to monitor a particular instance. MMS, however, gives us one place to monitor the MongoDB instance with pretty easy-to-understand graphs. MMS also gives us historical stats, which mongostat and mongotop cannot give.

Before we go ahead with the analysis of the metrics, I would like to mention that in the case of MMS monitoring, the data is not queried nor sent out over the public network. It is just the statistics that are sent over a secure channel by the agent. The source code for the agent is open source and is available for examination if needed. The mongod servers need not be accessible from the public network as the cloud-based MMS service never communicates with the server instances directly. It is the MMS agent that communicates with the MMS service. Typically, one agent is enough to monitor several servers unless you plan to segregate them into different groups. Also, it is recommended to run the agent on a dedicated machine/virtual machine and not share it with any of the mongod or mongos instances unless it is a less crucial test instance group you are monitoring.

Let's see some of these statistics on the console; we start with the memory-related ones. The following graph shows the resident, mapped, and virtual memory.

How it works…

As we can see, the resident memory for the data set is 82 MB, which is very low and it is the actual physical memory used up by the mongod process. This current value is significantly below the free memory available and generally this will increase over a period of time till it reaches a point where it has used up a large chunk of the total physical available memory. This is automatically taken care of by the mongod server process, and we can't force it to use up more memory even though it is available on the machine it is running on.

The mapped memory, on other hand, is about the total size of the database and is mapped by MongoDB. This size can be (and usually is) much higher than the physical memory available, which enables the mongod process to address the entire dataset as it is present in memory even if it isn't. MongoDB offloads the responsibility of mapping and loading data to and from disk to the underlying operating system. Whenever a memory location is accessed and it is not available in the RAM (that is, the resident memory), the operating system fetches the page into memory, evicting some pages to make space for the new page if necessary. What exactly is a memory mapped file? Let's try to see with a super scaled down version. Suppose we have a file of 1 KB (1024 bytes) and the RAM is only 512 bytes, then obviously we cannot have the whole file in memory. However, you can ask the operating system to map this file to available RAM in pages. Suppose each page is 128 bytes, then the total file is 8 pages (128 * 8 = 1024). But the OS can only load four pages, and we assume it loaded the first 4 pages (up to 512 bytes) in the memory. When we access byte number 200, it is okay and found in the memory as it is in present on page 2. But what if we access byte 800, which is logically on page 7 that is not loaded in memory? What OS does is takes one page out from the memory and loads this page 7 containing byte number 800. MongoDB as an application gives the impression that everything was loaded in the memory and was accessed by the byte index, but actually it wasn't and the OS transparently did the work for us. Since the page accessed was not present in the memory and we had to go to the disk to load it in the memory it is called a page fault.

Getting back to the stats shown in the graph, the virtual memory contains all the memory usage including the mapped memory plus any additional memory used, such as the memory associated with the thread stack associated with each connection. If journaling is enabled, this size will definitely be more than twice than that of the mapped memory as journaling too will have a separate memory mapping for the data. Thus, we have two addresses mapping the same memory location. This doesn't mean that the page will be loaded twice. It just means that two different memory locations can be used to address the same physical memory. Very high virtual memory might need some investigation. There is no predetermined definition of a too high or low value; generally these values are monitored for your system under normal circumstances when you are happy with the performance of your system. These benchmark values should then be compared with the figures seen when system performance goes down and then appropriate action can be taken.

As we saw earlier, page faults are caused when an accessed memory location is not present in the resident memory, causing the OS to load the page from memory. This IO activity will definitely cause performance to go down and too many page faults can bring down database performance dramatically. The following screenshot shows quite a few page faults happening per minute. However, if the disk used is an SSD instead of spinning disks, the hit in terms of seek time from the drive might not be significantly high.

How it works…

A large number of page faults usually occur when there isn't enough physical memory to accommodate the data set and the OS needs to get the data from the disk into memory. Note that this stat shown in the preceding screenshot is taken on a Windows platform and might seem high for a very trivial operation. This value is the sum of hard and soft page faults and doesn't really give a true figure of how good (or bad) the system is. These figures would be different on a Unix-based OS. There is a JIRA (https://jira.mongodb.org/browse/SERVER-5799) open as of the writing of this book which reports this problem.

One thing you might need to remember is that in production systems, MongoDB doesn't work well with a NUMA architecture and you might see a lot of page faults happening even if the available memory seems to be high enough. Refer to the URL http://docs.mongodb.org/manual/administration/production-notes/ for more details.

There is an additional graph which gives some details about non-mapped memory. As we saw earlier in this section, there are three types of memory: mapped, resident, and virtual. Mapped memory is always less than virtual memory. Virtual memory will be more than twice that of mapped memory if journaling is enabled. If we look at the image given in this section earlier, we see that the mapped memory is 192 MB whereas the virtual memory is 532MB. Since journaling is enabled, the memory is more than twice that of the mapped memory. When journaling is enabled, the same page of data is mapped twice in memory. Note that the page is physically loaded only once, it is just that the same location can be accessed using two different addresses. Let's find the difference between the virtual memory, which is 532MB, and twice the mapped memory that is 384 MB (2 * 192 = 384). The difference between these figures is 148 MB (532 - 384).

What we see here is the portion of virtual memory that is not mapped memory. This value is the same as what we just calculated.

How it works…

As mentioned earlier, a high or low value for non-mapped memory is not defined, however when the value reaches GBs we might have to investigate; possibly the number of open connections is high and we need to check if there is a leak with client applications not closing them after using it. There is a graph that gives us the number of connections open and it looks as follows:

How it works…

Once we know the number of connections and find it too high as compared to the expected count, we will need to find the clients who have opened the connections to that instance. We can execute the following JavaScript code from the shell to get those details. Unfortunately, at the time of writing this book, MMS doesn't have this feature to list out the client connection details.

testMon:PRIMARY> var currentOps = db.currentOp(true).inprog;
 currentOps.forEach(function(c) {
  if(c.hasOwnProperty('client')) {
    print('Client: ' + c.client + ", connection id is: " + c.desc);
  }
  //Get other details as needed 
 });

The db.currentOp method returns all the idle and system operations in the result. We then iterate through all the results and print out the client host and the connection details. A typical document in the result of the currentOp looks like this. You can choose to tweak the preceding code to include more details as per your need:

  {
    "opid" : 62052485,
    "active" : false,
    "op" : "query",
    "ns" : "",
    "query" : {
      "replSetGetStatus" : 1,
      "forShell" : 1
    },
    "client" : "127.0.0.1:64460",
    "desc" : "conn3651",
    "connectionId" : 3651,
    "waitingForLock" : false,
    "numYields" : 0,
    "lockStats" : {
    "timeLockedMicros" : {

    },
    "timeAcquiringMicros" : {

    }
    }
  }

In Chapter 4, Administration, we saw a recipe, The mongostat and mongotop utilities that was used to get some details on the percent of time a database was locked and the number of update, insert, delete, and getmore operations executed per second. You may refer to these recipes and try them out. We had used the same JavaScript that we have used currently to keep the server busy.

In MMS console, we have graphs giving these details as follows:

How it works…

The first one, opcounters, shows the number of operations executed at a particular point in time. This should be similar to what we saw using the mongostat utility. The one on the right shows us the percentage of time a database was locked. The drop-down menu lists the database names. We can select an appropriate database that we want to see the stats for. Again, this statistic can be seen using the mongostat utility. The only difference is that with the command-line utility, we see the stats as of the current time, whereas here we see the historical stats too.

In MongoDB, indexes are stored in BTrees and the next graph shows the number of times the BTree index was accessed, hit, and missed. At the minimum, the RAM should be enough to accommodate the indexes for optimum performance. So in this metric, the misses should be 0 or very low. A high number of misses results in a page fault for the index and possibly additional page faults for the corresponding data if the query is not covered, that is, all its data cannot be sourced from the index, which is a double blow for performance. One good practice while querying is to use projections and fetch only the necessary fields from the document. This is helpful whenever we have our selected fields present in an index, in which case the query becomes covered and all the necessary data is sourced from the index only. To learn more about on covered indexes, refer to the recipe Creating index and viewing plans of queries in Chapter 2, Command-line Operations and Indexes.

How it works…

For busy applications if the volumes are very high, with multiple write and read operations contending for lock, the operations queue up. Untill Version 2.4 of MongoDB, the locks are at the database level. Thus, even if the writes are happening on another collection, read operations on any collection in that database will block. This queuing operation affects the performance of the system and is a good indicator that the data might need to be sharded across to scale the system.

Tip

Remember, no value is defined as high or low; it is the acceptable value on an application-to-application basis.

How it works…

MongoDB flushes the data from the journal immediately and the data file periodically to disk. The following metrics give us the flush time per minute at a given point of time. If the flush takes up a significant percentage of the time per minute, we can safely say that the write operations are forming a bottleneck for performance.

How it works…

There's more…

We have seen monitoring of the MongoDB instances/cluster in this recipe. However, setting up alerts to get notifications when certain threshold values are crossed is what we still haven't seen. In the next recipe, we will see how to achieve this with a sample alert that is sent out over an e-mail when the page faults exceed a predetermined value.

See also

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset