Monitoring is essential when it comes to handling your cluster and ensuring it is in a healthy state. It allows administrators and develops to detect possible problems and prevent them before they occur or to act as soon as they start showing. In the worst case, monitoring allows us to do a post mortem analysis of what happened to the application—in this case, our Elasticsearch cluster and each of the nodes.
Elasticsearch provides very detailed information that allows us to check and monitor our nodes or the cluster as a whole. This includes statistics and information about the servers, nodes, indices, and shards. Of course, we are also able to get information about the entire cluster state. Before we get into the details about the mentioned API, please remember that the API is complex and we are only describing the basics. We will try to show you where to start so you'll be able to know what to look for when you need very detailed information.
One of the most basic APIs is the cluster health API, which allows us to get information about the entire cluster state with a single HTTP command. For example, let's run the following command:
curl -XGET 'localhost:9200/_cluster/health?pretty'
A sample response returned by Elasticsearch for the preceding command looks as follows:
{ "cluster_name" : "elasticsearch", "status" : "yellow", "timed_out" : false, "number_of_nodes" : 1, "number_of_data_nodes" : 1, "active_primary_shards" : 11, "active_shards" : 11, "relocating_shards" : 0, "initializing_shards" : 0, "unassigned_shards" : 11, "delayed_unassigned_shards" : 0, "number_of_pending_tasks" : 0, "number_of_in_flight_fetch" : 0, "task_max_waiting_in_queue_millis" : 0, "active_shards_percent_as_number" : 50.0 }
The most important information is about the status of the cluster. In our example, we see that the cluster is in yellow status. This means that all the primary shards have been allocated properly, but the replicas were not (because of a single node in the cluster, but that doesn't matter for now).
Of course, apart from the cluster name and status, we can see how the request was timed out, how many nodes there are, how many data nodes, primary shards, initializing shards, unassigned ones, and so on.
Let's stop here and talk about the cluster and when the cluster, as a whole, is fully operational. Cluster is fully operational when Elasticsearch is able to allocate all the shards and replicas according to the configuration. This is when the cluster is in the green state. The yellow state means that we are ready to handle requests because the primary shards are allocated, but some (or all) replicas are not. The last state, the red one, means that at least one primary shard was not allocated and because of this, the cluster is not ready yet. That means that the queries may return errors or not complete results.
The preceding command can also be executed to check the health state of certain indices. For example, if we would like to check the health of the library
and map
indices, we would run the following command:
curl -XGET 'localhost:9200/_cluster/health/library,map/?pretty'
Elasticsearch allows us to specify a special level
parameter, which can take the value of cluster
(default), indices
, or shards
. This allows us to control the details of information returned by the health API. We've already seen the default behavior. When setting the level
parameter to indices
, apart from the cluster information, we will also get per index health. Setting the mentioned parameter to shards
tells Elasticsearch to return per shard information in addition to what we've seen in the example.
In addition to the level
parameter, we have a few additional parameters that can control the behavior of the health API.
The first of the mentioned parameters is timeout
and allows us to control how long at the most, the command execution will wait when one of the following parameters is used: wait_for_status
, wait_for_nodes
, wait_for_relocating_shards
, and wait_for_active_shards
. By default, it is set to 30s
and means that the health command will wait 30 seconds maximum and return the response by then.
The wait_for_status
parameter allows us to tell Elasticsearch which health status the cluster should be at to return the command. It can take the values of green
, yellow
, and red
. For example, when set to green
, the health API call will return the results until the green status or timeout is reached.
The wait_for_nodes
parameter allows us to set the required number of nodes available to return the health command response (or until a defined timeout is reached). It can be set to an integer number like 3
or to a simple equation like >=3
(means, greater than or equal to three nodes) or <=3
(means less than or equal to three nodes).
The wait_for_active_shards
parameter means that Elasticsearch will wait for a specified number of active shards to be present before returning the response.
The last parameter is the wait_for_relocating_shard
, which is by default not specified. It allows us to tell Elasticsearch how many relocating shards it should wait for (or until the timeout is reached). Setting this parameter to 0
means that Elasticsearch should wait for all the relocating shards.
An example usage of the health command with some of the mentioned parameters is as follows:
curl -XGET 'localhost:9200/_cluster/health?wait_for_status=green&wait_for_nodes=>=3&timeout=100s'
Elasticsearch index is the place where our data lives and it is a crucial part for most deployments. With the use of the indices stats API available using the _stats
endpoint, we can get a lot of information about the indices living inside our cluster. Of course, as with most of the API's in Elasticsearch, we can send a command to get the information about all the indices (using the pure _stats
endpoint), about one particular index (for example library/_stats
) or several indices at the same time (for example library,map/_stats
). For example, to check the statistics for the map
and library
indices we've used in the book, we could run the following command:
curl -XGET 'localhost:9200/library,map/_stats?pretty'
The response to the preceding command has more than 700 lines, so we only describe its structure omitting the response itself. Apart from the information about the response status and the response time, we can see three objects named primaries
, total
(in _all
object), and indices
. The indices
object contains information about the library
and map
indices. The primaries
object contains information about the primary shards allocated to the current node, and the total
object contains information about all the shards including replicas. All these objects can contain objects describing a particular statistic such as the following: docs
, store
, indexing
, get
, search
, merges
, refresh
, flush
, warmer
, query_cache
, fielddata
, percolate
, completion
, segments
, translog
, suggest
, request_cache
, and recovery
.
We can limit the amount of information that we get from the indices stats API by providing the type of data we are interested in using the names of the statistics mentioned previously. For example, if we want to get information about indexing and searching, we can run the following command:
curl -XGET 'localhost:9200/library,map/_stats/indexing,search?pretty'
Let's discuss the information stored in those objects.
The docs
section of the response shows information about indexed documents. For example, it could look as follows:
"docs" : { "count" : 4, "deleted" : 0 }
The main information is the count
, indicating the number of documents in the described index. When we delete documents from the index, Elasticsearch doesn't remove these documents immediately and only marks them as deleted. Documents are physically deleted during the segment merge process. The number of documents marked as deleted is presented by the deleted
attribute and should be 0 right after the merge.
The next statistic, the store
one, provides information regarding storage. For example, such a section could look as follows:
"store" : { "size_in_bytes" : 6003, "throttle_time_in_millis" : 0 }
The main information is about the index (or indices) size. We can also look at throttling statistics. This information is useful when the system has problems with the I/O performance and has configured limits on an internal operation during segment merging.
The indexing
, get
, and search
sections of the response provide information about data manipulation indexing with delete operations, using real-time get and searching. Let's look at the following example returned by Elasticsearch:
"indexing" : { "index_total" : 0, "index_time_in_millis" : 0, "index_current" : 0, "delete_total" : 0, "delete_time_in_millis" : 0, "delete_current" : 0, "noop_update_total" : 0, "is_throttled" : false, "throttle_time_in_millis" : 0 }, "get" : { "total" : 0, "time_in_millis" : 0, "exists_total" : 0, "exists_time_in_millis" : 0, "missing_total" : 0, "missing_time_in_millis" : 0, "current" : 0 }, "search" : { "open_contexts" : 0, "query_total" : 0, "query_time_in_millis" : 0, "query_current" : 0, "fetch_total" : 0, "fetch_time_in_millis" : 0, "fetch_current" : 0, "scroll_total" : 0, "scroll_time_in_millis" : 0, "scroll_current" : 0 }
As you can see, all of these statistics have similar structures. We can read the total time spent in various request types (in milliseconds), the number of requests (which with the total time allows us to calculate the average time of a single query). In the case of get
requests, valuable information is how many fetches were unsuccessful (missing documents); an indexing request has information about throttling, and search includes information regarding scrolling.
In addition to the previously described section, Elasticsearch provides the following information:
merges
: This section contains information about Lucene segment mergesrefresh
: This section contains information about the refresh operationflush
: This section contains information about flusheswarmer
: This section contains information about warmers and for how long they were executedquery_cache
: This query caches statisticsfielddata
: This field data caches statisticspercolate
: This section contains information about the percolator usagecompletion
: This section contains information about the completion suggestersegments
: This section contains information about Lucene segmentstranslog
: This section contains information about the transaction logs count and sizesuggest
: This section contains suggesters-related statisticsrequest_cache
: This contains shard request caches statisticsrecovery
: This contains shards recovery informationThe nodes info API provides us with information about the nodes in the cluster. To get information from this API, we need to send the request to the _nodes
REST endpoints. The simplest command to retrieve nodes related information from Elasticsearch would be as follows:
curl -XGET 'localhost:9200/_nodes?pretty'
This API can be used to fetch information about particular nodes or a single node using the following:
Pulse
, we could run a command to the following REST endpoint: _nodes/Pulse
ny4hftjNQtuKMyEvpUdQWg
, we could run a command to the following REST endpoint: _nodes/ny4hftjNQtuKMyEvpUdQWg
192.168.1.103
, we could run a command to the following REST endpoint: _nodes/192.168.1.103
node.rack
property set to 2
, we could run a command to the following REST endpoint: /_nodes/rack:2
This API also allows us to get information about several nodes at once using these:
_nodes/192.168.1.*
or _nodes/P*
_nodes/Pulse,Slab
/_nodes/P*,S*
By default, the nodes API will return extensive information about each node along with the name, identifier, and addresses. This extensive information includes the following:
settings
: The Elasticsearch configurationos
: Information about the server such as processor, RAM, and swap spaceprocess
: Process identifier and refresh intervaljvm
: Information about Java Virtual Machine such as memory limits, memory pools, and garbage collectorsthread_pool
: The configuration of thread pools for various operations
transport
: Listening addresses for the transport protocolhttp
: Information about listening addresses for an HTTP-based APIplugins
: Information about the plugins installed by the user modules
: Information about the built-in pluginsAn example usage of this API can be illustrated by the following command:
curl 'localhost:9200/_nodes/Pulse/os,jvm,plugins'
The preceding command will return the basic information about the node named Pulse
and, in addition to this, it will include the operating system information, java virtual machine information, and plugins-related information.
The nodes stats API is similar to the nodes info API described in the preceding section. The main difference is that the previous API provided information about the environment in which the node is running, while the one we are currently discussing tells us about what happened with the cluster during its work. To use the nodes stats API, you need to send a command to the /_nodes/stats
REST endpoint. However, similar to the nodes info API, we can also retrieve information about specific nodes (for example: _nodes/Pulse/stats
).
The simplest command to retrieve nodes related information from Elasticsearch would be as follows:
curl -XGET 'localhost:9200/_nodes/stats?pretty'
By default, Elasticsearch returns all the available statistics but we can limit the ones we are interested in. The available options are as follows:
indices
: Information about the indices including size, document count, indexing related statistics, search and get time, caches, segment merges, and so onos
: Operating system related information such as free disk space, memory, swap usage, and so on process
: Memory, CPU, and file handler usage related to the Elasticsearch processjvm
: Java virtual machine memory and garbage collector statisticstransport
: Information about data sent and received by the transport modulehttp
: Information about http
connectionsfs
: Information about available disk space and I/O operations statisticsthread_pool
: Information about the state of the threads assigned to various operationsbreakers
: Information about circuit breakersscript
: Scripting engine related informationAn example usage of this API can be illustrated by the following command:
curl 'localhost:9200/_nodes/Pulse/stats/os,jvm,breaker'
Another API provided by Elasticsearch is the cluster state API. As its name suggests, it allows us to get information about the entire cluster (we can also limit the returned information to a local node by adding the local=true
parameter to the request). The basic command used to get all the information returned by this API looks as follows:
curl -XGET 'localhost:9200/_cluster/state?pretty'
We can also limit the provided information to the given metrics in comma–separated form, specified after the _cluster/state
part of the REST call. For example:
curl -XGET 'localhost:9200/_cluster/state/version,nodes?pretty'
We can also limit the information to the given metrics and indices. For example, if we would like to get the metadata for the library
index, we could run the following command:
curl -XGET 'localhost:9200/_cluster/state/metadata/library?pretty'
The following metrics are allowed to be used:
version
: This returns information about the cluster state version.master_node
: This returns information about the elected master node.nodes
: This returns nodes information.routing_table
: This returns routing related information.metadata
: This returns metadata related information. When specifying retrieving the metadata metric we can also include an additional parameter such as index_templates=true
, which will result in including the defined index templates.blocks
: This returns the blocks
part of the response.The cluster stats API allows us to get statistics about the indices and nodes from the cluster wide perspective. To use this API, we need to run the GET
request to the /_cluster/stats
REST endpoint, for example:
curl -XGET 'localhost:9200/_cluster/stats?pretty'
The response size depends on the number of shards, indices, and nodes in the cluster. It will include basic indices information such as shards, their state, recovery information, caches information, and node related information.
One of the API's that helps us in seeing what Elasticsearch is doing; it allows us to check which tasks are waiting to be executed. To retrieve this information, we need to send a request to the /_cluster/pending_tasks
REST endpoint. In this response, we will see an array of tasks with information about them, such as task priority and time in queue.
The recovery API gives us insight about the recovery status of the shards that are building indices in our cluster (learn more about recovery in The gateway and recovery modules section of Chapter 9, Elasticsearch Cluster in Detail).
The simplest command that would return the information about the recovery of all the shards in the cluster would look as follows:
curl -XGET 'http://localhost:9200/_recovery?pretty'
We can also get information about recovery for particular indices, such as the library
index for example:
curl -XGET 'http://localhost:9200/library/_recovery?pretty'
The response returned by Elasticsearch is divided by indices and shards. A response for a single shard could look as follows:
{ "id" : 2, "type" : "STORE", "stage" : "DONE", "primary" : true, "start_time_in_millis" : 1446132761730, "stop_time_in_millis" : 1446132761734, "total_time_in_millis" : 4, "source" : { "id" : "DboTibRlT1KJSQYnDPxwZQ", "host" : "127.0.0.1", "transport_address" : "127.0.0.1:9300", "ip" : "127.0.0.1", "name" : "Plague" }, "target" : { "id" : "DboTibRlT1KJSQYnDPxwZQ", "host" : "127.0.0.1", "transport_address" : "127.0.0.1:9300", "ip" : "127.0.0.1", "name" : "Plague" }, "index" : { "size" : { "total_in_bytes" : 156, "reused_in_bytes" : 156, "recovered_in_bytes" : 0, "percent" : "100.0%" }, "files" : { "total" : 1, "reused" : 1, "recovered" : 0, "percent" : "100.0%" }, "total_time_in_millis" : 0, "source_throttle_time_in_millis" : 0, "target_throttle_time_in_millis" : 0 }, "translog" : { "recovered" : 0, "total" : -1, "percent" : "-1.0%", "total_on_start" : -1, "total_time_in_millis" : 3 }, "verify_index" : { "check_index_time_in_millis" : 0, "total_time_in_millis" : 0 } }
In the preceding response, we can see information about the shard identifier, the stage of recovery, information whether the shard is a primary or a replica, the timestamps of the start and end of recovery, and the total time the recovery process took. We can see the source node, target node, and information about the shard's physical statistics, such as size, number of files, transaction log-related statistics, and index verification time.
It is worth knowing the information about the stages of recovery and types. When it comes to the types of recovery (the type
attribute in the response), we can expect the following: the STORE
, SNAPSHOT
, REPLICA
, and RELOCATING
values. When it comes to the stage of recovery (the stage
attribute in the response), we can expect values such as INIT
(recovery has not started), INDEX
(Elasticsearch copies metadata information and data from source to destination), START
(Elasticsearch is opening the shard for use), FINALIZE
(final stage, which cleans up garbage), and DONE
(recovery has ended).
We can limit the response returned by the indices recovery API to only the shards that are currently in active recovery by including the active_only=true
parameter in the request. Finally, we can request more detailed information by adding the detailed=true
parameter in the API call.
The indices shard stores API gives us information about the store for the shards of our indices. We use this API by running a simple command to the /_shard_stores
REST endpoint and providing or not providing the comma-separated indices names.
For example, to get information about all the indices, we would run the following command:
curl -XGET 'http://localhost:9200/_shard_stores?pretty'
We can also get information about particular indices, such as the library
and map
ones:
curl -XGET 'http://localhost:9200/library,map/_shard_stores?pretty'
The response returned by Elasticsearch contains information about the store for each shard. For example, this is what Elasticsearch returned for one of the shards of the library
index:
"0" : { "stores" : [ { "DboTibRlT1KJSQYnDPxwZQ" : { "name" : "Plague", "transport_address" : "127.0.0.1:9300", "attributes" : { } }, "version" : 6, "allocation" : "primary" } ] }
We can see information about the node in the stores arrays. Each entry contains node related information (the node where the shard is physically located), the version of the store copy, and the allocation, which can take the values of primary
(for primary shards), replica
(for replicas), and unused
(for unassigned shards).
The last API we want to mention is the Lucene segments API that can be availed by using the /_segments
endpoint. We can either run it for the entire cluster, for example like this:
curl -XGET 'localhost:9200/_segments?pretty'
We can also run the command for individual indices. For example, if we would like to get segments related information for the map
and library
indices, we would use the following command:
curl -XGET 'localhost:9200/library,map/_segments?pretty'
This API provides information about shards, their placements, and information about segments connected with the physical index managed by the Apache Lucene library.