Very hot threads

When you are in trouble and your cluster works slower than usual and uses large amounts of CPU power, you know you need to do something to make it work again. This is the case when the Hot Threads API can give you the information necessary to find the root cause of problems. A hot thread in this case is a Java thread that uses a high CPU volume and executes for longer periods of time. Such a thread doesn't mean that there is something wrong with Elasticsearch itself; it gives you information on what can be a possible hotspot and allows you to see which part of your deployment you need to look more deeply at, such as query execution or Lucene segments merging. The Hot Threads API returns information about which parts of the Elasticsearch code are hot spots from the CPU side or where Elasticsearch is stuck for some reason.

When using the Hot Threads API, you can examine all nodes, a selected few of them, or a particular node using the /_nodes/hot_threads or /_nodes/{node or nodes}/hot_threads endpoints. For example, to look at hot threads on all the nodes, we would run the following command:

curl 'localhost:9200/_nodes/hot_threads'

The API supports the following parameters:

  • threads (the default: 3): This is the number of threads that should be analyzed. Elasticsearch takes the specified number of the hottest threads by looking at the information determined by the type parameter.
  • interval (the default: 500ms): Elasticsearch checks threads twice to calculate the percentage of time spent in a particular thread on an operation defined by the type parameter. We can use the interval parameter to define the time between these checks.
  • type (the default: cpu): This is the type of thread state to be examined. The API can check the CPU time taken by the given thread (cpu), the time in the blocked state (block), or the time in the waiting (wait) state. If you would like to know more about the thread states, refer to http://docs.oracle.com/javase/7/docs/api/java/lang/Thread.State.html.
  • snapshots (the default: 10): This is the number of stack traces (a nested sequence of method calls at a certain point of time) snapshots to take.

Using the Hot Threads API is very simple; for example, to look at hot threads on all the nodes that are in the waiting state with check intervals of one second, we would use the following command:

curl 'localhost:9200/_nodes/hot_threads?type=wait&interval=1s'

Usage clarification for the Hot Threads API

Unlike other Elasticsearch API responses where you can expect JSON to be returned, the Hot Threads API returns formatted text, which contains several sections. Before we discuss the response structure itself, we would like to tell you a bit about the logic that is responsible for generating this response. Elasticsearch takes all the running threads and collects various information about the CPU time spent in each thread, the number of times the particular thread was blocked or was in the waiting state, how long it was blocked or was in the waiting state, and so on. The next thing is to wait for a particular amount of time (specified by the interval parameter), and after that time passes, collect the same information again. After this is done, threads are sorted on the basis of time each particular thread was running. The sort is done in a descending order so that the threads running for the longest period of time are on top of the list. Of course, the mentioned time is measured for a given operation type specified by the type parameter. After this, the first N threads (where N is the number of threads specified by the threads parameter) are analyzed by Elasticsearch. What Elasticsearch does is that, at every few milliseconds, it takes a few snapshots (the number of snapshots is specified by the snapshot parameter) of stack traces of the threads that were selected in the previous step. The last thing that needs to be done is the grouping of stack traces in order to visualize changes in the thread state and return the response to the caller.

The Hot Threads API response

Now, let's go through the sections of the response returned by the Hot Threads API. For example, the following screenshot is a fragment of the Hot Threads API response generated for Elasticsearch that was just started:

The Hot Threads API response

Now, let's discuss the sections of the response. To do that, we will use a slightly different response compared to the one shown previously. We do this to better visualize what is happening inside Elasticsearch. However, please remember that the general structure of the response will not change.

The first section of the Hot Threads API response shows us which node the thread is located on. For example, the first line of the response can look as follows:

::: [N'Gabthoth][aBb5552UQvyFCk1PNCaJnA][Banshee-  3.local][inet[/10.0.1.3:9300]]

Thanks to it, we can see which node the Hot Threads API returns information about and which node is very handy when the Hot Threads API call goes to many nodes.

The next lines of the Hot Threads API response can be divided into several sections, each starting with a line similar to the following one:

0.5% (2.7ms out of 500ms) cpu usage by thread  'elasticsearch[N'Gabthoth][search][T#10]'

In our case, we see a thread named search, which takes 0.5 percent of all the CPU time at the time when the measurement was done. The cpu usage part of the preceding line indicates that we are using type equal to cpu (other values you can expect here are block usage for threads in the blocked state and wait usage for threads in the waiting states). The thread name is very important here, because by looking at it, we can see which Elasticsearch functionality is the hot one. In our example, we see that this thread is all about searching (the search value). Other example values that you can expect to see are recovery_stream (for recovery module events), cache (for caching events), merge (for segments merging threads), index (for data indexing threads), and so on.

The next part of the Hot Threads API response is the section starting with the following information:

10/10 snapshots sharing following 10 elements

This information will be followed by a stack trace. In our case, 10/10 means that 10 snapshots have been taken for the same stack trace. In general, this means that all the examination time was spent in the same part of the Elasticsearch code.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset