Strangely enough, even though CPU might be considered the most important computing resource, RAM allocation for clustered services is even more important due to the fact that RAM overuse can (and will) cause Out of Memory (OOM) process and task failures for anything running on the same host. With the prevalence of memory leaks in software, this usually is not a matter of "if" but "when", so setting limits for RAM allocation is generally very desirable, and in some orchestration configurations it is even mandatory. Suffering from this issue is usually indicated by seeing SIGKILL, "Process killed", or exit code -9 on your service.
By limiting the available RAM, instead of a random process on the host being killed by OOM manager, only the offending task's processes will be targeted for killing, so the identification of faulty code is much easier and faster because you can see the large number of failures from that service and your other services will stay operational, increasing the stability of the cluster.
To use the RAM-limiting cgroup configuration, run the container with a combination of the following flags:
- -m / --memory: A hard limit on the maximum amount of memory that a container can use. Allocations of new memory over this limit will fail, and the kernel will terminate a process in your container that will usually be the main one running the service.
- --memory-swap: The total amount of memory including swap that the container can use. This must be used with the previous option and be larger than it. By default, a container can use up to twice the amount of allowed memory maximum for a container. Setting this to -1 allows the container to use as much swap as the host has.
- --memory-swappiness: How eager the system will be to move pages from physical memory to on-disk swap space. The value is between 0 and 100, where 0 means that pages will try to stay in resident RAM as much as possible, and vice versa. On most machines this value is 80 and will be used as the default, but since swap space access is very slow compared to RAM, my recommendation is to set this number as close to 0 as you can afford.
- --memory-reservation: A soft limit for the RAM usage of a service, which is generally used only for the detection of resource contentions with the generally expected RAM usage so that the orchestration engine can schedule tasks for maximum usage density. This flag does not have any guarantees that it will keep the service's RAM usage below this level.
There are a few more flags that can be used for memory limiting, but even the preceding list is a bit more verbose than you will probably ever need to worry about. For most deployments, big and small, you will probably only need to use -m and set a low value of --memory-swappiness, the latter usually being done on the host itself through the sysctl.d boot setting so that all services will utilize it.
$ echo "vm.swappiness = 10" | sudo tee -a /etc/sysctl.d/60-swappiness.conf
To see this in action, we will first run one of the most resource-intensive frameworks (JBoss) with a limit of 30 MB of RAM and see what happens:
$ docker run -it
--rm
-m 30m
jboss/wildfly
Unable to find image 'jboss/wildfly:latest' locally
latest: Pulling from jboss/wildfly
<snip>
Status: Downloaded newer image for jboss/wildfly:latest
=========================================================================
JBoss Bootstrap Environment
JBOSS_HOME: /opt/jboss/wildfly
JAVA: /usr/lib/jvm/java/bin/java
JAVA_OPTS: -server -Xms64m -Xmx512m -XX:MetaspaceSize=96M -XX:MaxMetaspaceSize=256m -Djava.net.preferIPv4Stack=true -Djboss.modules.system.pkgs=org.jboss.byteman -Djava.awt.headless=true
=========================================================================
*** JBossAS process (57) received KILL signal ***
As expected, the container used up too much RAM and was promptly killed by the kernel. Now, what if we try the same thing but give it 400 MB of RAM?
$ docker run -it
--rm
-m 400m
jboss/wildfly
=========================================================================
JBoss Bootstrap Environment
JBOSS_HOME: /opt/jboss/wildfly
JAVA: /usr/lib/jvm/java/bin/java
JAVA_OPTS: -server -Xms64m -Xmx512m -XX:MetaspaceSize=96M -XX:MaxMetaspaceSize=256m -Djava.net.preferIPv4Stack=true -Djboss.modules.system.pkgs=org.jboss.byteman -Djava.awt.headless=true
=========================================================================
14:05:23,476 INFO [org.jboss.modules] (main) JBoss Modules version 1.5.2.Final
<snip>
14:05:25,568 INFO [org.jboss.ws.common.management] (MSC service thread 1-6) JBWS022052: Starting JBossWS 5.1.5.Final (Apache CXF 3.1.6)
14:05:25,667 INFO [org.jboss.as] (Controller Boot Thread) WFLYSRV0060: Http management interface listening on http://127.0.0.1:9990/management
14:05:25,667 INFO [org.jboss.as] (Controller Boot Thread) WFLYSRV0051: Admin console listening on http://127.0.0.1:9990
14:05:25,668 INFO [org.jboss.as] (Controller Boot Thread) WFLYSRV0025: WildFly Full 10.1.0.Final (WildFly Core 2.2.0.Final) started in 2532ms - Started 331 of 577 services (393 services are lazy, passive or on-demand)
Our container can now start without any issues!
If you have worked a lot with applications in bare metal environments, you might be asking yourselves why exactly the JBoss JVM didn't know ahead of time that it wouldn't be able to run within such a constrained environment and fail even sooner. The answer here lies in a really unfortunate quirk (though I think it might be considered a feature depending on your point of view) of cgroups that presents the host's resources unaltered to the container even though the container itself is constrained. You can see this pretty easily if you run a memory-limited container and print out the available RAM limits:
$ # Let's see what a low allocation shows
$ docker run -it --rm -m 30m ubuntu /usr/bin/free -h
total used free shared buff/cache available
Mem: 7.6G 1.4G 4.4G 54M 1.8G 5.9G
Swap: 0B 0B 0B
$ # What about a high one?
$ docker run -it --rm -m 900m ubuntu /usr/bin/free -h
total used free shared buff/cache available
Mem: 7.6G 1.4G 4.4G 54M 1.8G 5.9G
Swap: 0B 0B 0B
As you can imagine, this causes all kinds of cascade issues with applications launched in a cgroup limited container such as this, the primary one being that the application does not know that there is a limit at all so it will just go and try to do its job assuming that it has full access to the available RAM. Once the application reaches the predefined limits, the app process will usually be killed and the container will die. This is a huge problem with apps and runtimes that can react to high memory pressures as they might be able to use less RAM in the container but because they cannot identify that they are running constrained, they tend to gobble up memory at a much higher rate than they should.
Sadly, things are even worse on this front for containers. You must not only give the service a big enough RAM limit to start it, but also enough that it can handle any dynamically allocated memory during the full duration of the service. If you do not, the same situation will occur but at a much less predictable time. For example, if you ran an NGINX container with only a 4 MB of RAM limit, it will start just fine but after a few connections to it, the memory allocation will cross the threshold and the container will die. The service may then restart the task and unless you have a logging mechanism or your orchestration provides good tooling for it, you will just end up with a service that has a running state but, in actuality, it is unable to process any requests.
If that wasn't enough, you also really should not arbitrarily assign high limits either. This is due to the fact that one of the purposes of containers is to maximize service density for a given hardware configuration. By setting limits that are statistically nearly impossible to be reached by the running service, you are effectively wasting those resources because they can't be used by other services. In the long run, this increases both the cost of your infrastructure and the resources needed to maintain it, so there is a high incentive to keep the service limited by the minimum amount that can run it safely instead of using really high limits.
So, when looking at all the things we must keep in mind, tweaking the limits is closer to an art form than anything else because it is almost like a variation of the famous bin-packing problem (https://en.wikipedia.org/wiki/Bin_packing_problem), but also adds the statistical component of the service on top of it, because you might need to figure out the optimum service availability compared to wasted resources due to loose limits.
Let's say we have a service with the following distribution:
- Three physical hosts with 2 GB RAM each (yes, this is really low, but it is to demonstrate the issues on smaller scales)
- Service 1 (database) that has a memory limit of 1.5 GB, two tasks, and has a 1 percent chance of running over the hard limit
- Service 2 (application) that has a memory limit of 0.5 GB, three tasks, and has a 5 percent chance of running over the hard limit
- Service 3 (data processing service) that has a memory limit of 0.5 GB, three tasks, and has a 5 percent chance of running over the hard limit
A scheduler may allocate the services in this manner:
overcapacity = avg(service_sizes) * avg(service_counts) * avg(max_rolling_service_restarts)
We will discuss this a bit more further in the text.
What if we take our last example and now say that we should just run with 1 percent OOM failure rates across the board, increasing our Service 2 and Service 3 memory limit from 0.5 GB to 0.75 GB, without taking into account that maybe having higher failure rates on the data processing service and application tasks might be acceptable (or even not noticeable if you are using messaging queues) to the end users?
The new service spread would now look like this:
Our new configuration has a massive amount of pretty obvious issues:
- 25 percent reduction in service density. This number should be as high as possible to get all the benefits of using microservices.
- 25 percent reduction in hardware utilization. Effectively, 1/4 of the available hardware resources are being wasted in this setup.
- Node count has increased by 66 percent. Most cloud providers charge by the number of machines you have running assuming they are the same type. By making this change you have effectively raised your cloud costs by 66 percent and may need that much extra ops support to keep your cluster working.
Even though this example has been intentionally rigged to cause the biggest impact when tweaked, it should be obvious that slight changes to these limits can have massive repercussions on your whole infrastructure. While in real-world scenarios this impact will be reduced because there will be larger host machines than in the example which will make them better able to stack smaller (relative to total capacity) services in the available space, do not underestimate the cascading effects of increasing service resource allocations.