Knowing about garbage collector

You know that Elasticsearch is a Java application and, because of that, it runs in the Java Virtual Machine. Each Java application is compiled into a so-called byte code, which can be executed by the JVM. In the most general way of thinking, you can imagine that the JVM is just executing other programs and controlling their behavior. However, this is not what you will care about unless you develop plugins for Elasticsearch, which we will discuss in Chapter 9, Developing Elasticsearch Plugins. What you will care about is the garbage collector—the piece of JVM that is responsible for memory management. When objects are de-referenced, they can be removed from the memory by the garbage collector. When the memory is running, the low garbage collector starts working and tries to remove objects that are no longer referenced. In this section, we will see how to configure the garbage collector, how to avoid memory swapping, how to log the garbage collector behavior, how to diagnose problems, and how to use some Java tools that will show you how it all works.

Note

You can learn more about the architecture of JVM in many places you find on the World Wide Web, for example, on Wikipedia: http://en.wikipedia.org/wiki/Java_virtual_machine.

Java memory

When we specify the amount of memory using the Xms and Xmx parameters (or the ES_MIN_MEM and ES_MAX_MEM properties), we specify the minimum and maximum size of the JVM heap space. It is basically a reserved space of physical memory that can be used by the Java program, which in our case, is Elasticsearch. A Java process will never use more heap memory than what we've specified with the Xmx parameter (or the ES_MAX_MEM property). When a new object is created in a Java application, it is placed in the heap memory. After it is no longer used, the garbage collector will try to remove that object from the heap to free the memory space and for JVM to be able to reuse it in the future. You can imagine that if you don't have enough heap memory for your application to create new objects on the heap, then bad things will happen. JVM will throw an OutOfMemory exception, which is a sign that something is wrong with the memory—either we don't have enough of it, or we have some memory leak and we don't release the object that we don't use.

Note

When running Elasticsearch on machines that are powerful and have a lot of free RAM memory, we may ask ourselves whether it is better to run a single large instance of Elasticsearch with plenty of RAM given to the JVM or a few instances with a smaller heap size. Before we answer this question, we need to remember that the more the heap memory is given to the JVM, the harder the work for the garbage collector itself gets. In addition to this, when setting the heap size to more than 31 GB, we don't benefit from the compressed operators, and JVM will need to use 64-bit pointers for the data, which means that we will use more memory to address the same amount of data. Given these facts, it is usually better to go for multiple smaller instances of Elasticsearch instead of one big instance.

The JVM memory (in Java 7) is divided into the following regions:

  • eden space: This is the part of the heap memory where the JVM initially allocates most of the object types.
  • survivor space: This is the part of the heap memory that stores objects that survived the garbage collection of the eden space heap. The survivor space is divided into survivor space 0 and survivor space 1.
  • tenured generation: This is the part of the heap memory that holds objects that were living for some time in the survivor space heap part.
  • permanent generation: This is the non-heap memory that stores all the data for the virtual machine itself, such as classes and methods for objects.
  • code cache: This is the non-heap memory that is present in the HotSpot JVM that is used for the compilation and storage of native code.

The preceding classification can be simplified. The eden space and the survivor space is called the young generation heap space, and the tenured generation is often called old generation.

The life cycle of Java objects and garbage collections

In order to see how the garbage collector works, let's go through the life cycle of a sample Java object.

When a new object is created in a Java application, it is placed in the young generation heap space inside the eden space part. Then, when the next young generation garbage collection is run and the object survives that collection (basically, if it was not a one-time used object and the application still needs it), it will be moved to the survivor part of the young generation heap space (first to survivor 0 and then, after another young generation garbage collection, to survivor 1).

After living for sometime in the survivor 1 space, the object is moved to the tenured generation heap space, so it will now be a part of the old generation. From now on, the young generation garbage collector won't be able to move that object in the heap space. Now, this object will be live in the old generation until our application decides that it is not needed anymore. In such a case, when the next full garbage collection comes in, it will be removed from the heap space and will make place for new objects.

Note

There is one thing to remember: what you usually try to aim to do is smaller, but more garbage collections count rather than one but longer. This is because you want your application to be running at the same constant performance level and the garbage collector work to be transparent for Elasticsearch. When a big garbage collection happens, it can be a stop for the world garbage collection event, where Elasticsearch will be frozen for a short period of time, which will make your queries very slow and will stop your indexing process for some time.

Based on the preceding information, we can say (and it is actually true) that at least till now, Java used generational garbage collection; the more garbage collections our object survives, the further it gets promoted. Because of this, we can say that there are two types of garbage collectors working side by side: the young generation garbage collector (also called minor) and the old generation garbage collector (also called major).

Note

With the update 9 of Java 7, Oracle introduced a new garbage collector called G1. It is promised to be almost totally unaffected by stop the world events and should be working faster compared to other garbage collectors. To read more about G1, please refer to http://www.oracle.com/technetwork/tutorials/tutorials-1876574.html. Although Elasticsearch creators advise against using G1, numerous companies use it with success, and it allowed them to overcome problems with stop the world events when using Elasticsearch with large volumes of data and heavy queries.

Dealing with garbage collection problems

When dealing with garbage collection problems, the first thing you need to identify is the source of the problem. It is not straightforward work and usually requires some effort from the system administrator or the people responsible for handling the cluster. In this section, we will show you two methods of observing and identifying problems with the garbage collector; the first is to turn on logging for the garbage collector in Elasticsearch, and the second is to use the jstat command, which is present in most Java distributions.

In addition to the presented methods, please note that there are tools out there that can help you diagnose issues related to memory and the garbage collector. These tools are usually provided in the form of monitoring software solutions such as Sematext Group SPM (http://sematext.com/spm/index.html) or NewRelic (http://newrelic.com/). Such solutions provide sophisticated information not only related to garbage collection, but also the memory usage as a whole.

An example dashboard from the mentioned SPM application showing the garbage collector work looks as follows:

Dealing with garbage collection problems

Turning on logging of garbage collection work

Elasticsearch allows us to observe periods when the garbage collector is working too long. In the default elasticsearch.yml configuration file, you can see the following entries, which are commented out by default:

monitor.jvm.gc.young.warn: 1000ms
monitor.jvm.gc.young.info: 700ms
monitor.jvm.gc.young.debug: 400ms
monitor.jvm.gc.old.warn: 10s
monitor.jvm.gc.old.info: 5s
monitor.jvm.gc.old.debug: 2s

As you can see, the configuration specifies three log levels and the thresholds for each of them. For example, for the info logging level, if the young generation collection takes 700 milliseconds or more, Elasticsearch will write the information to logs. In the case of the old generation, it will be written to logs if it will take more than five seconds.

Note

Please note that in older Elasticsearch versions (before 1.0), the prefix to log information related to young generation garbage collection was monitor.jvm.gc.ParNew.*, while the prefix to log old garbage collection information was monitor.jvm.gc.ConcurrentMarkSweep.*.

What you'll see in the logs is something like this:

[2014-11-09 15:22:52,355][WARN ][monitor.jvm              ]  [Lizard] [gc][old][964][1] duration [14.8s], collections  [1]/[15.8s], total [14.8s]/[14.8s], memory [8.6gb]- >[3.4gb]/[11.9gb], all_pools {[Code Cache] [8.3mb]- >[8.3mb]/[48mb]}{[young] [13.3mb]->[3.2mb]/[266.2mb]}{[survivor]  [29.5mb]->[0b]/[33.2mb]}{[old] [8.5gb]->[3.4gb]/[11.6gb]}

As you can see, the preceding line from the log file says that it is about the old garbage collector work. We can see that the total collection time took 14.8 seconds. Before the garbage collection operation, there was 8.6 GB of heap memory used (out of 11.9 GB). After the garbage collection work, the amount of heap memory used was reduced to 3.4 GB. After this, you can see information in more detailed statistics about which parts of the heap were taken into consideration by the garbage collector: the code cache, young generation space, survivor space, or old generation heap space.

When turning on the logging of the garbage collector work at a certain threshold, we can see when things don't run the way we would like by just looking at the logs. However, if you would like to see more, Java comes with a tool for that: jstat.

Using JStat

Running the jstat command to look at how our garbage collector works is as simple as running the following command:

jstat -gcutil 123456 2000 1000

The -gcutil switch tells the command to monitor the garbage collector work, 123456 is the virtual machine identifier on which Elasticsearch is running, 2000 is the interval in milliseconds between samples, and 1000 is the number of samples to be taken. So, in our case, the preceding command will run for a little more than 33 minutes (2000 * 1000 / 1000 / 60).

In most cases, the virtual machine identifier will be similar to your process ID or even the same but not always. In order to check which Java processes are running and what their virtual machines identifiers are, one can just run a jps command, which is provided with most JDK distributions. A sample command would be like this:

jps

The result would be as follows:

16232 Jps
11684 ElasticSearch

In the result of the jps command, we see that each line contains the JVM identifier, followed by the process name. If you want to learn more about the jps command, please refer to the Java documentation at http://docs.oracle.com/javase/7/docs/technotes/tools/share/jps.html.

Note

Please remember to run the jstat command from the same account that Elasticsearch is running, or if that is not possible, run jstat with administrator privileges (for example, using the sudo command on Linux systems). It is crucial to have access rights to the process running Elasticsearch, or the jstat command won't be able to connect to that process.

Now, let's look at a sample output of the jstat command:

S0     S1    E      O     P      YGC   YGCT    FGC   FGCT     GCT
12.44  0.00  27.20  9.49  96.70  78    0.176   5     0.495    0.672
12.44  0.00  62.16  9.49  96.70  78    0.176   5     0.495    0.672
12.44  0.00  83.97  9.49  96.70  78    0.176   5     0.495    0.672
0.00   7.74  0.00   9.51  96.70  79    0.177   5     0.495    0.673
0.00   7.74  23.37  9.51  96.70  79    0.177   5     0.495    0.673
0.00   7.74  43.82  9.51  96.70  79    0.177   5     0.495    0.673
0.00   7.74  58.11  9.51  96.71  79    0.177   5     0.495    0.673

The preceding example comes from the Java documentation and we decided to take it because it nicely shows us what jstat is all about. Let's start by saying what each of the columns mean:

  • S0: This means that survivor space 0 utilization is a percentage of the space capacity
  • S1: This means that survivor space 1 utilization is a percentage of the space capacity
  • E: This means that the eden space utilization is a percentage of the space capacity
  • O: This means that the old space utilization is a percentage of the space capacity
  • YGC: This refers to the number of young garbage collection events
  • YGCT: This is the time of young garbage collections in seconds
  • FGC: This is the number of full garbage collections
  • FGCT: This is the time of full garbage collections in seconds
  • GCT: This is the total garbage collection time in seconds

Now, let's get back to our example. As you can see, there was a young garbage collection event after sample three and before sample four. We can see that the collection took 0.001 of a second (0.177 YGCT in the fourth sample minus 0.176 YGCT in the third sample). We also know that the collection promoted objects from the eden space (which is 0 percent in the fourth sample and was 83.97 percent in the third sample) to the old generation heap space (which was increased from 9.49 percent in the third sample to 9.51 percent in the fourth sample). This example shows you how you can analyze the output of jstat. Of course, it can be time consuming and requires some knowledge about how garbage collector works, and what is stored in the heap. However, sometimes, it is the only way to see why Elasticsearch is stuck at certain moments.

Remember that if you ever see Elasticsearch not working correctly—the S0, S1 or E columns at 100 percent and the garbage collector working and not being able to handle these heap spaces—then either your young is too small and you should increase it (of course, if you have sufficient physical memory available), or you have run into some memory problems. These problems can be related to memory leaks when some resources are not releasing the unused memory. On the other hand, when your old generation space is at 100 percent and the garbage collector is struggling with releasing it (frequent garbage collections) but it can't, then it probably means that you just don't have enough heap space for your Elasticsearch node to operate properly. In such cases, what you can do without changing your index architecture is to increase the heap space that is available for the JVM that is running Elasticsearch (for more information about JVM parameters, refer to http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html).

Creating memory dumps

One additional thing that we didn't mention till now is the ability to dump the heap memory to a file. Java allows us to get a snapshot of the memory for a given point in time, and we can use that snapshot to analyze what is stored in the memory and find problems. In order to dump the Java process memory, one can use the jmap (http://docs.oracle.com/javase/7/docs/technotes/tools/share/jmap.html) command, for example, like this:

jmap -dump:file=heap.dump 123456

The 123456 heap dump, in our case, is the identifier of the Java process we want to get the memory dump for, and -dump:file=heap.dump specifies that we want the dump to be stored in the file named heap.dump. Such a dump can be further analyzed by specialized software, such as jhat (http://docs.oracle.com/javase/7/docs/technotes/tools/share/jhat.html), but the usage of such programs are beyond the scope of this book.

More information on the garbage collector work

Tuning garbage collection is not a simple process. The default options set for us in Elasticsearch deployment are usually sufficient for most cases, and the only thing you'll need to do is adjust the amount of memory for your nodes. The topic of tuning the garbage collector work is beyond the scope of the book; it is very broad and is called black magic by some developers. However, if you would like to read more about garbage collector, what the options are, and how they affect your application, I can suggest a great article that can be found at http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html. Although the article in the link is concentrated on Java 6, most of the options, if not all, can be successfully used with deployments running on Java 7.

Adjusting the garbage collector work in Elasticsearch

We now know how the garbage collector works and how to diagnose problems with it, so it would be nice to know how we can change Elasticsearch start up parameters to change how garbage collector works. It depends on how you run Elasticsearch. We will look at the two most common ones: standard start up script provided with the Elasticsearch distribution package and when using the service wrapper.

Using a standard start up script

When using a standard start up script in order to add additional JVM parameters, we should include them in the JAVA_OPTS environment property. For example, if we would like to include -XX:+UseParNewGC -XX:+UseConcMarkSweepGC in our Elasticsearch start up parameters in Linux-like systems, we would do the following:

export JAVA_OPTS="-XX:+UseParNewGC -XX:+UseConcMarkSweepGC"

In order to check whether the property was properly considered, we can just run another command:

echo $JAVA_OPTS

The preceding command should result in the following output in our case:

-XX:+UseParNewGC -XX:+UseConcMarkSweepGC

Service wrapper

Elasticsearch allows the user to install it as a service using the Java service wrapper (https://github.com/elasticsearch/elasticsearch-servicewrapper). If you are using the service wrapper, setting up JVM parameters is different when compared to the method shown previously. What we need to do is modify the elasticsearch.conf file, which will probably be located in /opt/elasticsearch/bin/service/ (if your Elasticsearch was installed in /opt/elasticsearch). In the mentioned file, you will see properties such as:

set.default.ES_HEAP_SIZE=1024

You will see properties such as these as well:

wrapper.java.additional.1=-Delasticsearch-service
wrapper.java.additional.2=-Des.path.home=%ES_HOME%
wrapper.java.additional.3=-Xss256k
wrapper.java.additional.4=-XX:+UseParNewGC
wrapper.java.additional.5=-XX:+UseConcMarkSweepGC
wrapper.java.additional.6=-XX:CMSInitiatingOccupancyFraction=75
wrapper.java.additional.7=-XX:+UseCMSInitiatingOccupancyOnly
wrapper.java.additional.8=-XX:+HeapDumpOnOutOfMemoryError
wrapper.java.additional.9=-Djava.awt.headless=true

The first property is responsible for setting the heap memory size for Elasticsearch, while the rest are additional JVM parameters. If you would like to add another parameter, you can just add another wrapper.java.additional property, followed by a dot and the next available number, for example:

wrapper.java.additional.10=-server

Note

One thing to remember is that tuning the garbage collector work is not something that you do once and forget. It requires experimenting, as it is very dependent on your data, queries and all that combined. Don't fear making changes when something is wrong, but also observe them and look how Elasticsearch works after making changes.

Avoid swapping on Unix-like systems

Although this is not strict about garbage collection and heap memory usage, we think that it is crucial to see how to disable swap. Swapping is the process of writing memory pages to the disk (swap partition in Unix-based systems) when the amount of physical memory is not sufficient or the operating system decides that for some reason, it is better to have some part of the RAM memory written into the disk. If the swapped memory pages will be needed again, the operating system will load them from the swap partition and allow processes to use them. As you can imagine, such processes take time and resources.

When using Elasticsearch, we want to avoid its process memory being swapped. You can imagine that having parts of memory used by Elasticsearch written to the disk and then again read from it can hurt the performance of both searching and indexing. Because of this, Elasticsearch allows us to turn off swapping for it. In order to do that, one should set bootstrap.mlockall to true in the elasticsearch.yml file.

However, the preceding setting is only the beginning. You also need to ensure that the JVM won't resize the heap by setting the Xmx and Xms parameters to the same values (you can do that by specifying the same values for the ES_MIN_MEM and ES_MAX_MEM environment variables for Elasticsearch). Also remember that you need to have enough physical memory to handle the settings you've set.

Now if we run Elasticsearch, we can run into the following message in the logs:

[2013-06-11 19:19:00,858][WARN ][common.jna               ]  Unknown mlockall error 0

This means that our memory locking is not working. So now, let's modify two files on our Linux operating system (this will require administration rights). We assume that the user who will run Elasticsearch is elasticsearch.

First, we modify /etc/security/limits.conf and add the following entries:

elasticsearch - nofile 64000 
elasticsearch - memlock unlimited

The second thing is to modify the /etc/pam.d/common-session file and add the following:

session required pam_limits.so

After re-logging to the elasticsearch user account, you should be able to start Elasticsearch and not see the mlockall error message.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset