Monitoring your cluster's state and health

Monitoring is essential when it comes to handling your cluster and ensuring it is in a healthy state. It allows administrators and develops to detect possible problems and prevent them before they occur or to act as soon as they start showing. In the worst case, monitoring allows us to do a post mortem analysis of what happened to the application—in this case, our Elasticsearch cluster and each of the nodes.

Elasticsearch provides very detailed information that allows us to check and monitor our nodes or the cluster as a whole. This includes statistics and information about the servers, nodes, indices, and shards. Of course, we are also able to get information about the entire cluster state. Before we get into the details about the mentioned API, please remember that the API is complex and we are only describing the basics. We will try to show you where to start so you'll be able to know what to look for when you need very detailed information.

Cluster health API

One of the most basic APIs is the cluster health API, which allows us to get information about the entire cluster state with a single HTTP command. For example, let's run the following command:

curl -XGET 'localhost:9200/_cluster/health?pretty'

A sample response returned by Elasticsearch for the preceding command looks as follows:

{
  "cluster_name" : "elasticsearch",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 11,
  "active_shards" : 11,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 11,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 50.0
}

The most important information is about the status of the cluster. In our example, we see that the cluster is in yellow status. This means that all the primary shards have been allocated properly, but the replicas were not (because of a single node in the cluster, but that doesn't matter for now).

Of course, apart from the cluster name and status, we can see how the request was timed out, how many nodes there are, how many data nodes, primary shards, initializing shards, unassigned ones, and so on.

Let's stop here and talk about the cluster and when the cluster, as a whole, is fully operational. Cluster is fully operational when Elasticsearch is able to allocate all the shards and replicas according to the configuration. This is when the cluster is in the green state. The yellow state means that we are ready to handle requests because the primary shards are allocated, but some (or all) replicas are not. The last state, the red one, means that at least one primary shard was not allocated and because of this, the cluster is not ready yet. That means that the queries may return errors or not complete results.

The preceding command can also be executed to check the health state of certain indices. For example, if we would like to check the health of the library and map indices, we would run the following command:

curl -XGET 'localhost:9200/_cluster/health/library,map/?pretty'

Controlling information details

Elasticsearch allows us to specify a special level parameter, which can take the value of cluster (default), indices, or shards. This allows us to control the details of information returned by the health API. We've already seen the default behavior. When setting the level parameter to indices, apart from the cluster information, we will also get per index health. Setting the mentioned parameter to shards tells Elasticsearch to return per shard information in addition to what we've seen in the example.

Additional parameters

In addition to the level parameter, we have a few additional parameters that can control the behavior of the health API.

The first of the mentioned parameters is timeout and allows us to control how long at the most, the command execution will wait when one of the following parameters is used: wait_for_status, wait_for_nodes, wait_for_relocating_shards, and wait_for_active_shards. By default, it is set to 30s and means that the health command will wait 30 seconds maximum and return the response by then.

The wait_for_status parameter allows us to tell Elasticsearch which health status the cluster should be at to return the command. It can take the values of green, yellow, and red. For example, when set to green, the health API call will return the results until the green status or timeout is reached.

The wait_for_nodes parameter allows us to set the required number of nodes available to return the health command response (or until a defined timeout is reached). It can be set to an integer number like 3 or to a simple equation like >=3 (means, greater than or equal to three nodes) or <=3 (means less than or equal to three nodes).

The wait_for_active_shards parameter means that Elasticsearch will wait for a specified number of active shards to be present before returning the response.

The last parameter is the wait_for_relocating_shard, which is by default not specified. It allows us to tell Elasticsearch how many relocating shards it should wait for (or until the timeout is reached). Setting this parameter to 0 means that Elasticsearch should wait for all the relocating shards.

An example usage of the health command with some of the mentioned parameters is as follows:

curl -XGET 'localhost:9200/_cluster/health?wait_for_status=green&wait_for_nodes=>=3&timeout=100s'

Indices stats API

Elasticsearch index is the place where our data lives and it is a crucial part for most deployments. With the use of the indices stats API available using the _stats endpoint, we can get a lot of information about the indices living inside our cluster. Of course, as with most of the API's in Elasticsearch, we can send a command to get the information about all the indices (using the pure _stats endpoint), about one particular index (for example library/_stats) or several indices at the same time (for example library,map/_stats). For example, to check the statistics for the map and library indices we've used in the book, we could run the following command:

curl -XGET 'localhost:9200/library,map/_stats?pretty'

The response to the preceding command has more than 700 lines, so we only describe its structure omitting the response itself. Apart from the information about the response status and the response time, we can see three objects named primaries, total (in _all object), and indices. The indices object contains information about the library and map indices. The primaries object contains information about the primary shards allocated to the current node, and the total object contains information about all the shards including replicas. All these objects can contain objects describing a particular statistic such as the following: docs, store, indexing, get, search, merges, refresh, flush, warmer, query_cache, fielddata, percolate, completion, segments, translog, suggest, request_cache, and recovery.

We can limit the amount of information that we get from the indices stats API by providing the type of data we are interested in using the names of the statistics mentioned previously. For example, if we want to get information about indexing and searching, we can run the following command:

curl -XGET 'localhost:9200/library,map/_stats/indexing,search?pretty'

Let's discuss the information stored in those objects.

Docs

The docs section of the response shows information about indexed documents. For example, it could look as follows:

"docs" : {
 "count" : 4,
 "deleted" : 0
}

The main information is the count, indicating the number of documents in the described index. When we delete documents from the index, Elasticsearch doesn't remove these documents immediately and only marks them as deleted. Documents are physically deleted during the segment merge process. The number of documents marked as deleted is presented by the deleted attribute and should be 0 right after the merge.

Store

The next statistic, the store one, provides information regarding storage. For example, such a section could look as follows:

"store" : {
 "size_in_bytes" : 6003,
 "throttle_time_in_millis" : 0
}

The main information is about the index (or indices) size. We can also look at throttling statistics. This information is useful when the system has problems with the I/O performance and has configured limits on an internal operation during segment merging.

Indexing, get, and search

The indexing, get, and search sections of the response provide information about data manipulation indexing with delete operations, using real-time get and searching. Let's look at the following example returned by Elasticsearch:

"indexing" : {
 "index_total" : 0,
 "index_time_in_millis" : 0,
 "index_current" : 0,
 "delete_total" : 0,
 "delete_time_in_millis" : 0,
 "delete_current" : 0,
 "noop_update_total" : 0,
 "is_throttled" : false,
 "throttle_time_in_millis" : 0
},
"get" : {
 "total" : 0,
 "time_in_millis" : 0,
 "exists_total" : 0,
 "exists_time_in_millis" : 0,
 "missing_total" : 0,
 "missing_time_in_millis" : 0,
 "current" : 0
},
"search" : {
 "open_contexts" : 0,
 "query_total" : 0,
 "query_time_in_millis" : 0,
 "query_current" : 0,
 "fetch_total" : 0,
 "fetch_time_in_millis" : 0,
 "fetch_current" : 0,
 "scroll_total" : 0,
 "scroll_time_in_millis" : 0,
 "scroll_current" : 0
}

As you can see, all of these statistics have similar structures. We can read the total time spent in various request types (in milliseconds), the number of requests (which with the total time allows us to calculate the average time of a single query). In the case of get requests, valuable information is how many fetches were unsuccessful (missing documents); an indexing request has information about throttling, and search includes information regarding scrolling.

Additional information

In addition to the previously described section, Elasticsearch provides the following information:

  • merges: This section contains information about Lucene segment merges
  • refresh: This section contains information about the refresh operation
  • flush: This section contains information about flushes
  • warmer: This section contains information about warmers and for how long they were executed
  • query_cache: This query caches statistics
  • fielddata: This field data caches statistics
  • percolate: This section contains information about the percolator usage
  • completion: This section contains information about the completion suggester
  • segments: This section contains information about Lucene segments
  • translog: This section contains information about the transaction logs count and size
  • suggest: This section contains suggesters-related statistics
  • request_cache: This contains shard request caches statistics
  • recovery: This contains shards recovery information

Nodes info API

The nodes info API provides us with information about the nodes in the cluster. To get information from this API, we need to send the request to the _nodes REST endpoints. The simplest command to retrieve nodes related information from Elasticsearch would be as follows:

curl -XGET 'localhost:9200/_nodes?pretty'

This API can be used to fetch information about particular nodes or a single node using the following:

  • Node name: If we would like to get information about the node named Pulse, we could run a command to the following REST endpoint: _nodes/Pulse
  • Node identifier: If we would like to get information about the node with an identifier equal to ny4hftjNQtuKMyEvpUdQWg, we could run a command to the following REST endpoint: _nodes/ny4hftjNQtuKMyEvpUdQWg
  • IP address: We can use IP addresses to get information about the nodes. For example, if we would like to get information about the node with an IP address equal to 192.168.1.103, we could run a command to the following REST endpoint: _nodes/192.168.1.103
  • Parameters from the Elasticsearch configuration: If we would like to get information about all the nodes with the node.rack property set to 2, we could run a command to the following REST endpoint: /_nodes/rack:2

This API also allows us to get information about several nodes at once using these:

  • Patterns, for example: _nodes/192.168.1.* or _nodes/P*
  • Nodes enumeration, for example: _nodes/Pulse,Slab
  • Both patterns and enumerations, for example: /_nodes/P*,S*

Returned information

By default, the nodes API will return extensive information about each node along with the name, identifier, and addresses. This extensive information includes the following:

  • settings: The Elasticsearch configuration
  • os: Information about the server such as processor, RAM, and swap space
  • process: Process identifier and refresh interval
  • jvm: Information about Java Virtual Machine such as memory limits, memory pools, and garbage collectors
  • thread_pool: The configuration of thread pools for various operations
transport: Listening addresses for the transport protocol
  • http: Information about listening addresses for an HTTP-based API
  • plugins: Information about the plugins installed by the user
  • modules: Information about the built-in plugins

An example usage of this API can be illustrated by the following command:

curl 'localhost:9200/_nodes/Pulse/os,jvm,plugins'

The preceding command will return the basic information about the node named Pulse and, in addition to this, it will include the operating system information, java virtual machine information, and plugins-related information.

Nodes stats API

The nodes stats API is similar to the nodes info API described in the preceding section. The main difference is that the previous API provided information about the environment in which the node is running, while the one we are currently discussing tells us about what happened with the cluster during its work. To use the nodes stats API, you need to send a command to the /_nodes/stats REST endpoint. However, similar to the nodes info API, we can also retrieve information about specific nodes (for example: _nodes/Pulse/stats).

The simplest command to retrieve nodes related information from Elasticsearch would be as follows:

curl -XGET 'localhost:9200/_nodes/stats?pretty'

By default, Elasticsearch returns all the available statistics but we can limit the ones we are interested in. The available options are as follows:

  • indices: Information about the indices including size, document count, indexing related statistics, search and get time, caches, segment merges, and so on
  • os: Operating system related information such as free disk space, memory, swap usage, and so on
  • process: Memory, CPU, and file handler usage related to the Elasticsearch process
  • jvm: Java virtual machine memory and garbage collector statistics
  • transport: Information about data sent and received by the transport module
  • http: Information about http connections
  • fs: Information about available disk space and I/O operations statistics
  • thread_pool: Information about the state of the threads assigned to various operations
  • breakers: Information about circuit breakers
  • script: Scripting engine related information

An example usage of this API can be illustrated by the following command:

curl 'localhost:9200/_nodes/Pulse/stats/os,jvm,breaker'

Cluster state API

Another API provided by Elasticsearch is the cluster state API. As its name suggests, it allows us to get information about the entire cluster (we can also limit the returned information to a local node by adding the local=true parameter to the request). The basic command used to get all the information returned by this API looks as follows:

curl -XGET 'localhost:9200/_cluster/state?pretty'

We can also limit the provided information to the given metrics in comma–separated form, specified after the _cluster/state part of the REST call. For example:

curl -XGET 'localhost:9200/_cluster/state/version,nodes?pretty'

We can also limit the information to the given metrics and indices. For example, if we would like to get the metadata for the library index, we could run the following command:

curl -XGET 'localhost:9200/_cluster/state/metadata/library?pretty'

The following metrics are allowed to be used:

  • version: This returns information about the cluster state version.
  • master_node: This returns information about the elected master node.
  • nodes: This returns nodes information.
  • routing_table: This returns routing related information.
  • metadata: This returns metadata related information. When specifying retrieving the metadata metric we can also include an additional parameter such as index_templates=true, which will result in including the defined index templates.
  • blocks: This returns the blocks part of the response.

Cluster stats API

The cluster stats API allows us to get statistics about the indices and nodes from the cluster wide perspective. To use this API, we need to run the GET request to the /_cluster/stats REST endpoint, for example:

curl -XGET 'localhost:9200/_cluster/stats?pretty'

The response size depends on the number of shards, indices, and nodes in the cluster. It will include basic indices information such as shards, their state, recovery information, caches information, and node related information.

Pending tasks API

One of the API's that helps us in seeing what Elasticsearch is doing; it allows us to check which tasks are waiting to be executed. To retrieve this information, we need to send a request to the /_cluster/pending_tasks REST endpoint. In this response, we will see an array of tasks with information about them, such as task priority and time in queue.

Indices recovery API

The recovery API gives us insight about the recovery status of the shards that are building indices in our cluster (learn more about recovery in The gateway and recovery modules section of Chapter 9, Elasticsearch Cluster in Detail).

The simplest command that would return the information about the recovery of all the shards in the cluster would look as follows:

curl -XGET 'http://localhost:9200/_recovery?pretty'

We can also get information about recovery for particular indices, such as the library index for example:

curl -XGET 'http://localhost:9200/library/_recovery?pretty'

The response returned by Elasticsearch is divided by indices and shards. A response for a single shard could look as follows:

{
 "id" : 2,
 "type" : "STORE",
 "stage" : "DONE",
 "primary" : true,
 "start_time_in_millis" : 1446132761730,
 "stop_time_in_millis" : 1446132761734,
 "total_time_in_millis" : 4,
 "source" : {
  "id" : "DboTibRlT1KJSQYnDPxwZQ",
  "host" : "127.0.0.1",
  "transport_address" : "127.0.0.1:9300",
  "ip" : "127.0.0.1",
  "name" : "Plague"
 },
 "target" : {
  "id" : "DboTibRlT1KJSQYnDPxwZQ",
  "host" : "127.0.0.1",
  "transport_address" : "127.0.0.1:9300",
  "ip" : "127.0.0.1",
  "name" : "Plague"
 },
 "index" : {
  "size" : {
   "total_in_bytes" : 156,
   "reused_in_bytes" : 156,
   "recovered_in_bytes" : 0,
   "percent" : "100.0%"
  },
  "files" : {
   "total" : 1,
   "reused" : 1,
   "recovered" : 0,
   "percent" : "100.0%"
  },
  "total_time_in_millis" : 0,
  "source_throttle_time_in_millis" : 0,
  "target_throttle_time_in_millis" : 0
 },
 "translog" : {
  "recovered" : 0,
  "total" : -1,
  "percent" : "-1.0%",
  "total_on_start" : -1,
  "total_time_in_millis" : 3
 },
 "verify_index" : {
  "check_index_time_in_millis" : 0,
  "total_time_in_millis" : 0
 }
}

In the preceding response, we can see information about the shard identifier, the stage of recovery, information whether the shard is a primary or a replica, the timestamps of the start and end of recovery, and the total time the recovery process took. We can see the source node, target node, and information about the shard's physical statistics, such as size, number of files, transaction log-related statistics, and index verification time.

It is worth knowing the information about the stages of recovery and types. When it comes to the types of recovery (the type attribute in the response), we can expect the following: the STORE, SNAPSHOT, REPLICA, and RELOCATING values. When it comes to the stage of recovery (the stage attribute in the response), we can expect values such as INIT (recovery has not started), INDEX (Elasticsearch copies metadata information and data from source to destination), START (Elasticsearch is opening the shard for use), FINALIZE (final stage, which cleans up garbage), and DONE (recovery has ended).

We can limit the response returned by the indices recovery API to only the shards that are currently in active recovery by including the active_only=true parameter in the request. Finally, we can request more detailed information by adding the detailed=true parameter in the API call.

Indices shard stores API

The indices shard stores API gives us information about the store for the shards of our indices. We use this API by running a simple command to the /_shard_stores REST endpoint and providing or not providing the comma-separated indices names.

For example, to get information about all the indices, we would run the following command:

curl -XGET 'http://localhost:9200/_shard_stores?pretty'

We can also get information about particular indices, such as the library and map ones:

curl -XGET 'http://localhost:9200/library,map/_shard_stores?pretty'

The response returned by Elasticsearch contains information about the store for each shard. For example, this is what Elasticsearch returned for one of the shards of the library index:

"0" : {
 "stores" : [ {
  "DboTibRlT1KJSQYnDPxwZQ" : {
   "name" : "Plague",
   "transport_address" : "127.0.0.1:9300",
   "attributes" : { }
  },
  "version" : 6,
  "allocation" : "primary"
 } ]
}

We can see information about the node in the stores arrays. Each entry contains node related information (the node where the shard is physically located), the version of the store copy, and the allocation, which can take the values of primary (for primary shards), replica (for replicas), and unused (for unassigned shards).

Indices segments API

The last API we want to mention is the Lucene segments API that can be availed by using the /_segments endpoint. We can either run it for the entire cluster, for example like this:

curl -XGET 'localhost:9200/_segments?pretty'

We can also run the command for individual indices. For example, if we would like to get segments related information for the map and library indices, we would use the following command:

curl -XGET 'localhost:9200/library,map/_segments?pretty'

This API provides information about shards, their placements, and information about segments connected with the physical index managed by the Apache Lucene library.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset