Performance Monitoring and Optimizing
An installed Linux server comes with default performance settings. That means that it will perform well for an average workload. Unfortunately, many servers are going beyond average, which means that optimization can be applied. In this chapter, you’ll read how to monitor and optimize performance. The first part of this chapter is about performance monitoring. In the second part, you’ll learn how to optimize performance.
The following topics are covered in this chapter:
Performance Monitoring
Before you can actually optimize anything, you have to know what’s going on. In this first section of the chapter, you’ll learn how to analyze performance. We’ll start with one of the most common but also one of the most informative tools: top.
Interpreting What’s Going On: top
Before starting to look at details, you should have a general overview of the current state of your server. The top utility is an excellent tool to help you with that. Let’s start by having a look at a server that is used as a virtualization server, hosting multiple virtual machines (see Listing 15-1).
Listing 15-1. Using top on a Busy Server
top - 10:47: 49 up 1 day, 16:56, 3 users, load average: 0.08, 0.06, 0.10
Tasks: 409 total, 1 running, 408 sleeping, 0 stopped, 0 zombie
%Cpu(s): 1.6 us, 0.4 sy, 0.0 ni, 98.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 16196548 total, 13197772 used, 2998776 free, 4692 buffers
KiB Swap: 4194300 total, 0 used, 4194300 free. 4679428 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1489 root 20 0 1074368 23568 11836 S 3.3 0.1 51:18.32 libvirtd
12730 root 20 0 6018668 2.058g 56760 S 2.7 13.3 52:07.62 virt-manager
19586 qemu 20 0 1320328 532616 8028 S 2.0 3.3 23:08.54 qemu-kvm
13719 qemu 20 0 1211512 508476 8028 S 1.7 3.1 23:42.33 qemu-kvm
18450 qemu 20 0 1336528 526252 8016 S 1.7 3.2 23:39.71 qemu-kvm
18513 qemu 20 0 1274928 463408 8036 S 1.7 2.9 23:28.97 qemu-kvm
18540 qemu 20 0 1274932 467276 8020 S 1.7 2.9 23:32.23 qemu-kvm
19542 qemu 20 0 1320840 514224 8032 S 1.7 3.2 23:03.55 qemu-kvm
19631 qemu 20 0 1315620 501828 8012 S 1.7 3.1 23:10.92 qemu-kvm
24773 qemu 20 0 1342848 547784 8016 S 1.7 3.4 23:38.80 qemu-kvm
3572 root 20 0 950484 148812 42644 S 1.3 0.9 39:24.33 firefox
16388 qemu 20 0 1275076 465400 7996 S 1.3 2.9 22:51.46 qemu-kvm
18919 qemu 20 0 1318728 510000 8020 S 1.3 3.1 23:46.81 qemu-kvm
28791 root 20 0 123792 1876 1152 R 0.3 0.0 0:00.03 top
1 root 20 0 53500 7644 3788 S 0.0 0.0 0:07.07 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.13 kthreadd
3 root 20 0 0 0 0 S 0.0 0.0 0:03.27 ksoftirqd/0
5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H
7 root rt 0 0 0 0 S 0.0 0.0 0:00.19 migration/0
8 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcu_bh
9 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcuob/0
10 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcuob/1
11 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcuob/2
CPU Monitoring with top
When analyzing performance, you start at the first line of the top output. The load average parameters at the end of the line are of special interest. There are three of them, indicating the load average for the last minute, the last five minutes, and the last fifteen minutes. The load average gives the average amount of processes that were in the run queue. Stated otherwise, the load average gives that which is actually being handled, or waiting to be handled. As ultimately a CPU core can handle one process at any moment only, a load average of 1.00 on a 1-CPU would be the ideal load, indicating that the CPU is completely busy.
Looking at load average in this way is a little bit too simple, though. Some processes don’t demand that much from the CPU; other processes do. So, in some cases, performance can be good on a 1-CPU system that gives a load average of 8.00, while on other occasions, performance might be suffering, if load average is only at 1.00. Load average is a good start, but it’s not good enough just by itself.
Consider, for example, a task that is running completely on the CPU. You can force such a task by entering the following code line:
while true; do true; done
This task will completely claim one CPU core, thus causing a workload of 1.00. Because, however, this is a task that doesn’t do any input/output (I/O), the task does not have waiting times, and therefore, for a task like this, 1.00 is considered a heavy workload, because if another task is started, processes will have to be queued owing to of a lack of available resources.
Let’s now consider a task that is I/O intensive, such as a task in which your complete hard drive is copied to the null device (dd if=/dev/sda of=/dev/null). This task will also easily cause a workload that is 1.00 or higher, but because there is a lot of waiting for I/O involved in a task like that, it’s not as bad as the while true task. That is because while waiting for I/O, the CPU can do something else. So don’t be too quick in drawing conclusions from the load line.
When seeing that your server’s CPUs are very busy, you should further analyze. First, you should relate the load average to the amount of CPUs in your server. By default, top provides a summary for all CPUs in your server. Press the 1 on the keyboard, to show a line for each CPU core in your server. Because most modern servers are multi-core, you should apply this option, as it gives you information about the multiprocessing environment as well. In Listing 15-2, you can see an example in which usage statistics are provided on a four-core server:
Listing 15-2. Monitoring Performance on a Four-Core Server
top - 11:06:29 up 1 day, 17:15, 3 users, load average: 6.80, 4.20, 1.95
Tasks : 424 total, 3 running, 421 sleeping, 0 stopped, 0 zombie
%Cpu0 : 84.9 us, 11.7 sy, 0.0 ni, 2.0 id, 0.7 wa, 0.0 hi, 0.7 si, 0.0 st
%Cpu1 : 86.6 us, 9.4 sy, 0.0 ni, 3.0 id, 0.3 wa, 0.0 hi, 0.7 si, 0.0 st
%Cpu2 : 86.6 us, 9.7 sy, 0.0 ni, 2.7 id, 0.7 wa, 0.0 hi, 0.3 si, 0.0 st
%Cpu3 : 88.0 us, 9.0 sy, 0.0 ni, 2.7 id, 0.3 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 16196548 total, 16021536 used, 175012 free, 3956 buffers
KiB Swap: 4194300 total, 10072 used, 4184228 free. 3700732 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
29694 qemu 20 0 1424580 658276 8068 S 72.6 4.1 3:30.70 qemu-kvm
29934 qemu 20 0 1221208 614936 8064 S 69.7 3.8 1:08.35 qemu-kvm
29863 qemu 20 0 1386616 637948 8052 S 56.7 3.9 1:54.51 qemu-kvm
29627 qemu 20 0 1417552 643716 8064 S 56.1 4.0 4:37.15 qemu-kvm
29785 qemu 20 0 1425656 657500 8064 S 54.7 4.1 2:39.03 qemu-kvm
12730 root 20 0 7276512 2.566g 70496 R 26.5 16.6 54:20.94 virt-manager
3225 root 20 0 1950632 215728 35300 S 25.2 1.3 14:52.82 gnome-shell
1489 root 20 0 1074368 23600 11836 S 6.6 0.1 52:09.98 libvirtd
1144 root 20 0 226540 51348 35704 S 6.3 0.3 4:19.12 Xorg
18540 qemu 20 0 1274932 467276 8020 R 6.0 2.9 23:47.89 qemu-kvm
18450 qemu 20 0 1336528 526252 8016 S 2.3 3.2 23:55.18 qemu-kvm
18919 qemu 20 0 1318728 510000 8020 S 1.0 3.1 24:02.42 qemu-kvm
19631 qemu 20 0 1315620 501828 8012 S 1.0 3.1 23:26.65 qemu-kvm
24773 qemu 20 0 1334652 538816 8016 S 1.0 3.3 23:54.71 qemu-kvm
3572 root 20 0 950484 172500 42636 S 0.7 1.1 39:36.99 firefox
28791 root 20 0 123792 1876 1152 R 0.7 0.0 0:03.25 top
339 root 0 -20 0 0 0 S 0.3 0.0 0:04.00 kworker/1:1H
428 root 20 0 0 0 0 S 0.3 0.0 0:39.65 xfsaild/dm-1
921 root 20 0 19112 1164 948 S 0.3 0.0 0:09.19 irqbalance
26424 root 20 0 0 0 0 S 0.3 0.0 0:00.13 kworker/u8:2
When considering exactly what your server is doing, the CPU lines are an important indicator. In there, you can monitor CPU performance, divided in different performance categories. In the following list, you can see these options summarized:
Memory Monitoring with top
The second set of information to get from top concerns the lines about memory and swap usage. The memory lines contain five parameters (of which the last is in the swap line). These are
Understanding swap
When considering memory usage, you should also consider the amount of swap that is being allocated. Swap is RAM that is emulated on disk. That may sound like a bad idea that really slows down server performance, but it doesn’t have to be.
To understand swap usage, you should understand the different kinds of memory that are in use on a Linux server. Linux distinguishes between active and inactive memory, and between file and anon memory. You can get these parameters from the /proc/meminfo file (see Listing 15-3).
Listing 15-3. Getting Detailed Memory Information from /proc/meminfo
[root@lab ~]# cat /proc/meminfo
MemTotal: 16196548 kB
MemFree: 1730808 kB
MemAvailable: 5248720 kB
Buffers: 3956 kB
Cached: 4045672 kB
SwapCached: 0 kB
Active: 10900288 kB
Inactive: 3019436 kB
Active(anon): 9725132 kB
Inactive(anon): 627268 kB
Active(file): 1175156 kB
Inactive(file): 2392168 kB
Unevictable: 25100 kB
Mlocked: 25100 kB
SwapTotal: 4194300 kB
SwapFree: 4194300 kB
Anon (anonymous) memory refers to memory that is allocated by programs. File memory refers to memory that is used as cache or buffers. On any Linux system, these two kinds of memory can be flagged as active or inactive. Inactive file memory typically exists on a server that doesn’t need the RAM for anything else. If memory pressure arises, the kernel can clear this memory immediately to make more RAM available. Inactive anon memory is memory that has to be allocated. However, as it hasn’t been used actively, it can be moved to a slower kind of memory. That exactly is what swap is used for.
If in swap there’s only inactive anon memory, swap helps optimizing the memory performance of a system. By moving out these inactive memory pages, more memory becomes available for caching, which is good for the overall performance of a server. Hence, if a Linux server shows some activity in swap, that is not a bad sign at all.
EXERCISE 15-1. MONITORING BUFFER AND CACHE MEMORY
In this exercise, you’ll monitor how buffer and cache memory are used. To start with a clean image, you’ll first restart your server, so that no old data is in buffers or cache. Next, you’ll run some commands that will cause the buffer and cache memory to be filled. At the end, you’ll clear the total amount of buffer and cache memory by using /proc/sys/vm/drop_caches.
cd /etc
for I in *
do
cat $I
done
Process Monitoring with top
The lower part of top is reserved for information about the most active processes. In this part, you’ll see a few parameters related to these processes. By default, the following parameters are shown:
Understanding Linux Memory Allocation
When analyzing Linux memory usage, you should know how Linux uses virtual and resident memory. Virtual memory on Linux is to be taken literally: it is a nonexisting amount of memory that the Linux kernel can be referred to. When looking at the contents of the /proc/meminfo file, you can see that the amount of virtual memory is set to approximately 35TB of RAM:
VmallocTotal: 34359738367 kB
VmallocUsed: 486380 kB
VmallocChunk: 34359160008 kB
Virtual memory is used by the Linux kernel to allow programs to make a memory reservation. After making this reservation, no other application can reserve the same memory. Making the reservation is a matter of setting pointers and nothing else. It doesn’t mean that the memory reservation is also actually going to be used. When a program has to use the memory it has reserved, it is going to issue a malloc system call, which means that the memory is actually going to be allocated. At that moment, we’re talking about resident memory.
That Linux uses virtual memory when reserving memory may cause trouble later on. A program that has reserved memory—even if it is virtual memory—would expect that it can also use that memory. But that is not the case, as virtual memory, in general, is much more than the amount of physical RAM + Swap that is available. This is known as memory over-commit or over-allocation, and in some cases, memory over-allocation can cause trouble. If a process has reserved virtual memory that cannot be mapped to physical memory, you may encounter an OOM (out of memory) situation. If that happens, processes will get killed. In the “Optimizing Performance” section, later in this chapter, you’ll learn about some parameters that tell you how to prevent such situations.
Analyzing CPU Performance
The top utility offers a good starting point for performance tuning. However, if you really need to dig deep into a performance problem, top does not offer sufficient information, and more advanced tools will be required. In this section, you’ll learn what you can do to find out more about CPU performance-related problems.
Most people tend to start analyzing a performance problem at the CPU, since they think CPU performance is the most important on a server. In most situations, this is not true. Assuming that you have a recent CPU, and not an old 486-based CPU, you will not often see a performance problem that really is related to the CPU. In most cases, a problem that appears to be CPU-related is likely caused by something else. For example, your CPU may just be waiting for data to be written to disk. Before getting into details, let’s have a look at a brief exercise that teaches how CPU performance can be monitored.
EXERCISE 15-2. ANALYZING CPU PERFORMANCE
In this exercise, you’ll run two different commands that will both analyze CPU performance. You’ll notice a difference in the behavior of both commands.
[root@hnl ~]# cat wait
#!/bin/bash
COUNTER=0
while true
do
dd if=/dev/urandom of=/root/file.$COUNTER bs=1M count=1
COUNTER=$(( COUNTER + 1 ))
[ COUNTER = 1000 ] && exit
done
Understanding CPU Performance
To monitor what is happening on your CPU, you should know how the Linux kernel works with the CPU. A key component is the run queue. Before being served by the CPU, every process enters the run queue. There’s a run queue for every CPU core in the system. Once a process is in the run queue, it can be runnable or blocked. A runnable process is a process that is competing for CPU time; a blocked process is just waiting.
The Linux scheduler decides which runnable process to run next, based on the current priority of the process. A blocked process doesn’t compete for CPU time. The load average line in top gives a summary of the workload that results from all runnable and blocked processes combined. If you want to know how many of the processes are currently in either runnable or blocked state, use vmstat. The columns r and b show the amount of runnable and blocked processes. In Listing 15-4, you can see what this looks like on a system in which vmstat has polled the system five times, with a two-second interval.
Listing 15-4. Use vmstat to See How Many Processes Are in Runnable or Blocked State
[root@lab ~]# vmstat 2 5
procs -----------memory----------- --swap-- ----io---- --system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 0 1412260 3956 3571532 0 0 39 62 0 71 3 1 97 0 0
0 0 0 1412252 3956 3571564 0 0 0 0 1217 3478 2 1 97 0 0
0 0 0 1412376 3956 3571564 0 0 0 0 1183 3448 2 1 97 0 0
0 0 0 1412220 3956 3571564 0 0 0 0 1189 3388 2 1 97 0 0
0 0 0 1412252 3956 3571564 0 0 0 0 1217 3425 2 1 97 0 0
Context Switches and Interrupts
A modern Linux system is always a multitasking system. This is true for every processor architecture that can be used, because the Linux kernel constantly switches between different process. In order to perform this switch, the CPU needs to save all the context information for the old process and retrieve the context information for the new process. The performance price of these context switches, therefore, is heavy.
In an ideal world, you have to make sure that the number of context switches is limited as much as possible. You may do this by using a multi-core CPU architecture or a server with multiple CPUs, or a combination of both, but if you do, you have to make sure that processes are locked to a dedicated CPU core, to prevent context switches.
Processes that are serviced by the kernel scheduler, however, are not the only cause of context switching. Another important reason for a context switch to occur is hardware interrupts. This is a piece of hardware demanding processor time. To see what it has been doing, you can look at the contents of the /proc/interrupts file (see Listing 15-5).
Listing 15-5. The /proc/interrupts File Shows You Exactly How Many of Each Interrupt Has Been Handled
[root@lab proc]# cat interrupts
CPU0 CPU1 CPU2 CPU3
0: 54 0 0 0 IR-IO-APIC-edge timer
8: 0 0 0 1 IR-IO-APIC-edge rtc0
9: 0 0 0 0 IR-IO-APIC-fasteoi acpi
23: 0 0 36 1 IR-IO-APIC-fasteoi ehci_hcd:usb1
56: 0 0 0 0 DMAR_MSI-edge dmar0
57: 0 0 0 0 DMAR_MSI-edge dmar1
58: 68468 113385 59982 38591 IR-PCI-MSI-edge xhci_hcd
59: 17 9185792 29 6 IR-PCI-MSI-edge eno1
60: 660908 640712 274180 280446 IR-PCI-MSI-edge ahci
61: 379094 149796 827403 152584 IR-PCI-MSI-edge i915
62: 13 0 0 0 IR-PCI-MSI-edge mei_me
63: 263 1 6 1 IR-PCI-MSI-edge snd_hda_intel
64: 1770 506 106 516 IR-PCI-MSI-edge snd_hda_intel
NMI: 967 983 762 745 Non-maskable interrupts
LOC: 32241233 32493830 20152850 20140483 Local timer interrupts
SPU: 0 0 0 0 Spurious interrupts
PMI: 967 983 762 745 Performance monitoring interrupts
IWI: 122505 122449 110316 112272 IRQ work interrupts
RTR: 0 0 0 0 APIC ICR read retries
RES: 2486212 2351025 1841935 1821599 Rescheduling interrupts
CAL: 483791 496810 318516 290537 Function call interrupts
TLB: 231573 234010 173163 171368 TLB shootdowns
TRM: 0 0 0 0 Thermal event interrupts
THR: 0 0 0 0 Threshold APIC interrupts
MCE: 0 0 0 0 Machine check exceptions
MCP: 512 512 512 512 Machine check polls
As mentioned, in a multi-core environment, context switches can cause a performance overhead. You can see if these occur often by using the top utility. It can provide information about the CPU that was last used by any process, but you have to switch this on. To do that, from the top utility, first press the f command and type j. (on some distributions, you’ll have to scroll instead, to select the appropriate option). This will switch the option last used CPU (SMP) on for an SMP environment. In Listing 15-6, you can see the interface from which you can do this. Note that to make this setting permanent, you can use the W command from top. This causes all modifications to the top program to be written to the ~/.toprc file, so that they can be loaded again at restart of top.
Listing 15-6. After Pressing the F Key, You Can Switch Different Options On or Off in top
Fields Management for window 1:Def, whose current sort field is %CPU
Navigate with Up/Dn, Right selects for move then <Enter> or Left commits,
'd' or <Space> toggles display, 's' sets sort. Use 'q' or <Esc> to end!
* PID = Process Id TIME = CPU Time
* USER = Effective User Name SWAP = Swapped Size (KiB)
* PR = Priority CODE = Code Size (KiB)
* NI = Nice Value DATA = Data+Stack (KiB)
* VIRT = Virtual Image (KiB) nMaj = Major Page Faults
* RES = Resident Size (KiB) nMin = Minor Page Faults
* SHR = Shared Memory (KiB) nDRT = Dirty Pages Count
* S = Process Status WCHAN = Sleeping in Function
* %CPU = CPU Usage Flags = Task Flags <sched.h>
* %MEM = Memory Usage (RES) CGROUPS = Control Groups
* TIME+ = CPU Time, hundredths SUPGIDS = Supp Groups IDs
* COMMAND = Command Name/Line SUPGRPS = Supp Groups Names
PPID = Parent Process pid TGID = Thread Group Id
UID = Effective User Id ENVIRON = Environment vars
RUID = Real User Id vMj = Major Faults delta
RUSER = Real User Name vMn = Minor Faults delta
SUID = Saved User Id USED = Res+Swap Size (KiB)
SUSER = Saved User Name nsIPC = IPC namespace Inode
GID = Group Id nsMNT = MNT namespace Inode
GROUP = Group Name nsNET = NET namespace Inode
PGRP = Process Group Id nsPID = PID namespace Inode
TTY = Controlling Tty nsUSER = USER namespace Inode
TPGID = Tty Process Grp Id nsUTS = UTS namespace Inode
SID = Session Id
nTH = Number of Threads
P = Last Used Cpu (SMP)
After switching the last used CPU option on, you will see the column P in top that displays the number of the CPU that was last used by a process.
Using vmstat
To monitor CPU utilization, top offers a very good starting point. If that doesn’t offer you enough, you may prefer the vmstat utility. With vmstat, you can get a nice, detailed view of what is happening on your server. Of special interest is the CPU section, which contains the five most important parameters on CPU usage:
When working with vmstat, you should know that there are two ways to use it. Probably the most useful way to run it is in the so-called sample mode. In this mode, a sample is taken every n seconds. Specify the amount of seconds for the sample as an option when starting vmstat. Running performance monitoring utilities in this way is always good, because it will show your progress over a given amount of time. You may find it useful, as well, to run vmstat for a given amount of time only.
Another useful way to run vmstat is with the option -s. In this mode, vmstat shows you the statistics since the system has booted. As you can see in Listing 15-7, apart from the CPU-related options, vmstat shows information about processors, memory, swap, io, and system as well. These options are covered later in this chapter.
Listing 15-7. Using vmstat -s
[root@lab ~]# vmstat -s
16196548 K total memory
14783440 K used memory
11201308 K active memory
3031324 K inactive memory
1413108 K free memory
3956 K buffer memory
3571580 K swap cache
4194300 K total swap
0 K used swap
4194300 K free swap
1562406 non-nice user cpu ticks
1411 nice user cpu ticks
294539 system cpu ticks
57856573 idle cpu ticks
22608 IO-wait cpu ticks
12 IRQ cpu ticks
5622 softirq cpu ticks
0 stolen cpu ticks
23019937 pages paged in
37008693 pages paged out
842 pages swapped in
3393 pages swapped out
129706133 interrupts
344528651 CPU context switches
1408204254 boot time
132661 forks
Analyzing Memory Usage
Memory is probably the most important component of your server, from a performance perspective. The CPU can only work smoothly if processes are ready in memory and can be offered from there. If this is not the case, the server has to get its data from the I/O channel, which is about 1,000 times slower to access than memory. From the processor’s point of view, even system RAM is relatively slow. Therefore, modern server processors have large amounts of cache, which is even faster than memory.
You have read how to interpret basic memory statistics, as provided by top earlier in this chapter; therefore, I will not cover them again. In this section, you can read about some more advanced memory-related information.
Page Size
A basic concept in memory handling is the memory page size. On an x86_64 system, typically 4KB pages are used. This means that everything that happens, happens in chunks of 4KB. Nothing wrong with that, if you have a server handling large amounts of small files. If, however, your server handles huge files, it is highly inefficient if only these small 4KB pages are used. For that purpose, your server can use huge pages with a default size of 2MB per page. Later in this chapter, you’ll learn how to configure huge pages.
A server can run out of memory. In that event, it uses swapping. Swap memory is emulated RAM on the server’s hard drive. Because in swap the hard disk is involved, you should avoid it, if possible. Access times to a hard drive are about 1,000 times slower than access times to RAM. To monitor current swap use, you can use free -m, which will show you the amount of swap that is currently being used. See Listing 15-8 for an example.
Listing 15-8. free -m Provides Information About Swap Usage
[root@lab ~]# free -m
total used free shared buffers cached
Mem: 15816 14438 1378 475 3 3487
-/+buffers/cache: 10946 4870
Swap: 4095 0 4095
As you can see in the preceding listing, on the server where this sample comes from, nothing is wrong; there is no swap usage at all, and that is good.
If, on the other hand, you see that your server is swapping, the next thing you must know is how actively it is swapping. To provide information about this, the vmstat utility provides useful information. This utility provides swap information in the si (swap in) and so (swap out) columns.
If swap space is used, you should also have a look at the /proc/meminfo file, to relate the use of swap to the amount of inactive anon memory pages. If the amount of swap that is used is larger than the amount of anon memory pages that you observe in /proc/meminfo, it means that active memory is being swapped. That is bad news for performance, and if that happens, you must install more RAM. If the amount of swap that is in use is smaller than the amount of inactive anon memory pages in /proc/meminfo, there’s no problem, and you’re good. If, however, you have more memory in swap than the amount of inactive anonymous pages, you’re probably in trouble, because active memory is being swapped. That means that there’s too much I/O traffic, which will slow down your system.
Kernel Memory
When analyzing memory usage, you should also take into account the memory that is used by the kernel itself. This is called slab memory. You can see in the /proc/meminfo file the amount of slab currently in use. Normally, the amount of kernel memory that is in use is relatively small. To get more information about it, you can use the slabtop utility.
This utility provides information about the different parts (referred to as objects) of the kernel and what exactly they are doing. For normal performance analysis purposes, the SIZE and NAME columns are the most interesting ones. The other columns are of interest mainly to programmers and kernel developers and, therefore, are not described in this chapter. In Listing 15-9, you can see an example of information provided by slabtop.
Listing 15-9. The slabtop Utility Provides Information About Kernel Memory Usage
Active / Total Objects (% used) : 1859018 / 2294038 (81.0%)
Active / Total Slabs (% used) : 56547 / 56547 (100.0%)
Active / Total Caches (% used) : 75 / 109 (68.8%)
Active / Total Size (% used) : 275964.30K / 327113.79K (84.4%)
Minimum / Average / Maximum Object : 0.01K / 0.14K / 15.69K
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
1202526 786196 65% 0.10K 30834 39 123336K buffer_head
166912 166697 99% 0.03K 1304 128 5216K kmalloc-32
134232 134106 99% 0.19K 6392 21 25568K dentry
122196 121732 99% 0.08K 2396 51 9584K selinux_inode_security
115940 115940 100% 0.02K 682 170 2728K fsnotify_event_holder
99456 98536 99% 0.06K 1554 64 6216K kmalloc-64
79360 79360 100% 0.01K 155 512 620K kmalloc-8
70296 70296 100% 0.64K 2929 24 46864K proc_inode_cache
64512 63218 97% 0.02K 252 256 1008K kmalloc-16
38248 26376 68% 0.57K 1366 28 21856K radix_tree_node
29232 29232 100% 1.00K 1827 16 29232K xfs_inode
28332 28332 100% 0.11K 787 36 3148K sysfs_dir_cache
28242 27919 98% 0.21K 1569 18 6276K vm_area_struct
18117 17926 98% 0.58K 671 27 10736K inode_cache
14992 14150 94% 0.25K 937 16 3748K kmalloc-256
10752 10752 100% 0.06K 168 64 672K anon_vma
9376 8206 87% 0.12K 293 32 1172K kmalloc-128
8058 8058 100% 0.04K 79 102 316K Acpi-Namespace
7308 7027 96% 0.09K 174 42 696K kmalloc-96
4788 4788 100% 0.38K 228 21 1824K blkdev_requests
4704 4704 100% 0.07K 84 56 336K Acpi-ParseExt
The most interesting information a system administrator would receive from slabtop is the amount of memory a particular slab (part of the kernel) is using. If, for instance, you’ve recently performed some tasks on the file system, you may find that the inode_cache is relatively high. If that is just for a short period of time, it’s no problem. The Linux kernel wakes up routines when they are needed, while they can be closed fast when they’re no longer needed. If, however, you see that one part of the routine that is started continuously uses high amounts of memory, that might be an indication that you have some optimization to do.
EXERCISE 15-3. ANALYZING KERNEL MEMORY
In this exercise, you’ll cause a little bit of stress on your server, and you’re going to use slabtop to find out which parts of the kernel are getting busy. As the Linux kernel is sophisticated and uses its resources as efficiently as possible, you won’t see huge changes, but some subtle changes can be detected anyway.
Using ps for Analyzing Memory
When tuning memory utilization, there is one more utility that you should never forget, and that is ps. The advantage of ps, is that it gives memory usage information on all processes on your server and it is easy to grep on its result to find information about particular processes. To monitor memory usage, the ps aux command is very useful. It provides memory information in the VSZ and the RSS columns. The VSZ (Virtual Size) parameter provides information about the virtual memory that is used. This relates to the total amount of memory that is claimed by a process. The RSS (Resident Size) parameter refers to the amount of memory that is really in use. Listing 15-10 gives an example of some lines of ps aux output.
Listing 15-10. ps aux Gives Memory Usage Information for Particular Processes
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 53500 7664 ? Ss Aug16 0:07 /usr/lib/systemd/systemd--switched-root --system --deserialize 23
root 2 0.0 0.0 0 0 ? Z S Aug16 0:00 [kthreadd]
...
qemu 31274 2.0 2.5 1286920 407748 ? Sl 11:16 4:56 /usr/libexec/qemu-kvm -name vm
root 31276 0.0 0.0 0 0 ? S 11:16 0:00 [vhost-31274]
root 31280 0.0 0.0 0 0 ? S 11:16 0:00 [kvm-pit/31274]
qemu 31301 2.0 2.5 1287656 412868 ? Sl 11:16 4:58 /usr/libexec/qemu-kvm -name vm
root 31303 0.0 0.0 0 0 ? S 11:16 0:00 [vhost-31301]
root 31307 0.0 0.0 0 0 ? S 11:16 0:00 [kvm-pit/31301]
root 31314 0.0 0.0 0 0 ? S 11:16 0:00 [kworker/u8:2]
qemu 31322 2.1 2.5 1284036 413216 ? Sl 11:16 5:01 /usr/libexec/qemu-kvm -name vm
root 31324 0.0 0.0 0 0 ? S 11:16 0:00 [vhost-31322]
root 31328 0.0 0.0 0 0 ? S 11:16 0:00 [kvm-pit/31322]
qemu 31347 2.1 2.5 1284528 408636 ? Sl 11:16 5:01 /usr/libexec/qemu-kvm -name vm
root 31350 0.0 0.0 0 0 ? S 11:16 0:00 [vhost-31347]
root 31354 0.0 0.0 0 0 ? S 11:16 0:00 [kvm-pit/31347]
When looking at the output of ps aux, you may notice that there are two different kinds of processes. The name of some are between square brackets; the names of others are not. If the name of a process is between square brackets, the process is part of the kernel. All other processes are “normal” processes.
If you need more information about a process and what exactly it is doing, there are two ways to get that information. First, you can check the /proc directory for the particular process, for example, /proc/5658 gives information for the process with PID 5658. In this directory, you’ll find the maps file that gives some more insight into how memory is mapped for this process. As you can see in Listing 15-11, this information is rather detailed. It includes the exact memory addresses this process is using and even tells you about subroutines and libraries that are related to this process.
Listing 15-11. The /proc/PID/maps File Gives Detailed Information on Memory Utilization of Particular Processes
00400000-004dd000 r-xp 00000000 fd:01 134326347 /usr/bin/bash
006dc000-006dd000 r--p 000dc000 fd:01 134326347 /usr/bin/bash
006dd000-006e6000 rw-p 000dd000 fd:01 134326347 /usr/bin/bash
006e6000-006ec000 rw-p 00000000 00:00 0
014d0000-015d6000 rw-p 00000000 00:00 0 [heap]
7fcae4779000-7fcaeaca0000 r--p 00000000 fd:01 201334187 /usr/lib/locale/locale-archive
7fcaeaca0000-7fcaeacab000 r-xp 00000000 fd:01 201334158 /usr/lib64/libnss_files-2.17.so
7fcaeacab000-7fcaeaeaa000 ---p 0000b000 fd:01 201334158 /usr/lib64/libnss_files-2.17.so
7fcaeaeaa000-7fcaeaeab000 r--p 0000a000 fd:01 201334158 /usr/lib64/libnss_files-2.17.so
7fcaeaeab000-7fcaeaeac000 rw-p 0000b000 fd:01 201334158 /usr/lib64/libnss_files-2.17.so
7fcaeaeac000-7fcaeb062000 r-xp 00000000 fd:01 201334140 /usr/lib64/libc-2.17.so
7fcaeb062000-7fcaeb262000 ---p 001b6000 fd:01 201334140 /usr/lib64/libc-2.17.so
7fcaeb262000-7fcaeb266000 r--p 001b6000 fd:01 201334140 /usr/lib64/libc-2.17.so
7fcaeb266000-7fcaeb268000 rw-p 001ba000 fd:01 201334140 /usr/lib64/libc-2.17.so
7fcaeb268000-7fcaeb26d000 rw-p 00000000 00:00 0
The pmap command also shows what a process is doing. It gets its information from the /proc/PID/maps file. One of the advantages of the pmap command is that it gives detailed information about the order in which a process does its work. You can see calls to external libraries, as well as additional memory allocation (malloc) requests that the program is doing, as reflected in the lines that have [anon] at the end.
Monitoring Storage Performance
One of the hardest things to do properly is the monitoring of storage utilization. The reason is that the storage channel typically is at the end of the chain. Other elements in your server can have a positive as well as a negative influence on storage performance. For example, if your server is low on memory, that will be reflected in storage performance, because if you don’t have enough memory, there can’t be a lot of cache and buffers, and thus, your server has more work to do on the storage channel.
Likewise, a slow CPU can have a negative impact on storage performance, because the queue of runnable processes can’t be cleared fast enough. Therefore, before jumping to the conclusion that you have bad performance on the storage channel, you should really try to take other factors into consideration as well.
It is generally hard to optimize storage performance on a server. The best behavior really depends on the kind of workload your server typically has. For instance, a server that has a lot of reads has other needs than a server that does mainly write. A server that is doing writes most of the time can benefit from a storage channel with many disks, because more controllers can work on clearing the write buffer cache from memory. If, however, your server is mainly reading data, the effect of having many disks is just the opposite. Because of the large amount of disks, seek times will increase, and therefore, performance will be negatively affected.
Following are some indicators of storage performance problems. Have a look and see if one of these is the case with your server, and if it is, go and analyze what is happening.
Understanding Disk Working
Before trying to understand storage performance, there is another factor that you should consider, and that is the way that storage activity typically takes place. First, a storage device, in general, handles large sequential transfers better than small random transfers. This is because, in memory, you can configure read ahead and write ahead, which means that the storage controller already goes to the next block it probably has to go to. If your server handles small files mostly, read ahead buffers will have no effect at all. On the contrary, they will only slow it down.
In addition, you should be aware that in modern environments, three different types of storage devices are used. If storage is handled by a Storage Area Network (SAN), it’s often not possible to do much about storage optimization. If local storage is used, it makes a big difference if that is SSD-based storage or storage that uses rotating platters.
From the tools perspective, there are three tools that really count when doing disk performance analysis. The first tool to start your disk performance analysis is vmstat. This tool has a couple of options that help you see what is happening on a particular disk device, such as -d, which gives you statistics for individual disks, or -p, which gives partition performance statistics. As you have already seen, you can use vmstat with an interval parameter and a count parameter as well. In Listing 15-12, you can see the result of the command vmstat -d, which gives detailed information on storage utilization for all disk devices on your server.
Listing 15-12. To Understand Storage Usage, Start with vmstat
[root@lab ~]# vmstat -d
disk- ------------reads------------ ------------writes----------- -----IO------
total merged sectors ms total merged sectors ms cur sec
sda 932899 1821123 46129712 596065 938744 2512536 74210979 3953625 0 731
dm-0 1882 0 15056 537 3397 0 27160 86223 0 0
dm-1 17287 0 1226434 17917 62316 0 17270450 2186073 0 93
sdb 216 116 1686 182 0 0 0 0 0 0
dm-2 51387 0 2378598 16168 58063 0 3224216 130009 0 35
dm-3 51441 0 2402329 25443 55309 0 3250147 140122 0 40
In the output of this command, you can see detailed statistics about the reads and writes that have occurred on a disk. The following parameters are displayed when using vmstat -d:
Another way of monitoring disk performance with vmstat is by running it in sample mode. For example, the command vmstat 2 10 will run ten samples with a two-second interval. Listing 15-13 shows the result of this command.
Listing 15-13. In Sample Mode, You Can Get a Real-Time Impression of Disk Utilization
[root@lab ~]# vmstat 2 10
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 0 1319012 3956 3574176 0 0 36 58 26 8 3 1 97 0 0
0 0 0 1318532 3956 3574176 0 0 0 2 1212 3476 2 1 97 0 0
0 0 0 1318540 3956 3574176 0 0 0 0 1189 3469 2 1 97 0 0
0 0 0 1318788 3956 3574176 0 0 0 0 1250 3826 3 1 97 0 0
0 0 0 1317852 3956 3574176 0 0 0 0 1245 3816 3 1 97 0 0
0 0 0 1318044 3956 3574176 0 0 0 0 1208 3675 2 0 97 0 0
1 0 0 1318044 3956 3574176 0 0 0 0 1193 3384 2 1 97 0 0
0 0 0 1318044 3956 3574176 0 0 0 0 1212 3419 2 0 97 0 0
0 0 0 1318044 3956 3574176 0 0 0 0 1229 3506 2 1 97 0 0
3 0 0 1318028 3956 3574176 0 0 0 0 1227 3738 2 1 97 0 0
The columns that count in the preceding sample listing are the io: bi and io: bo columns, because they show the number of blocks that came in from the storage channel (bi) and the number of blocks that were written to the storage channel (bo).
Another tool to monitor performance on the storage channel, is iostat. It is not installed by default. Use zypper in sysstat, if you don’t have it. It provides an overview per device of the amount of reads and writes. In the example in Listing 15-14, you can see the following device parameters being displayed:
Listing 15-14. The iostat Utility Provides Information About the Number of Blocks That Was Read and Written per Second
[root@hnl ~]# iostat
Linux 3.10.0-123.el7.x86_64 (lab.sandervanvugt.nl) 08/18/2014 _x86_64_ (4 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
2.63 0.00 0.53 0.04 0.00 96.80
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 11.28 138.98 223.59 23064928 37106736
dm-0 0.03 0.05 0.08 7528 13580
dm-1 0.48 3.70 52.04 613289 8636472
sdb 0.00 0.01 0.00 843 0
dm-2 0.66 7.17 9.71 1189299 1612108
dm-3 0.64 7.24 9.79 1201164 1625073
dm-4 0.65 7.24 9.62 1201986 1596805
dm-5 0.65 7.38 9.62 1225284 1596418
dm-6 0.65 7.38 9.57 1224767 1588105
dm-7 0.65 7.31 9.53 1213582 1582201
If, when used in this way, iostat doesn’t give you enough detail, you can use the -x option as well. This option gives much more information and, therefore, doesn’t fit on the screen nicely, in most cases. In Listing 15-15, you can see an example.
Listing 15-15. iostat -x Gives You Much More Information About What Is Happening on the Storage Channel
[root@hnl ~]# iostat -x
Linux 3.10.0-123.el7.x86_64 (lab.sandervanvugt.nl) 08/18/2014 _x86_64_ (4 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
2.63 0.00 0.53 0.04 0.00 96.80
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 10.97 15.13 5.62 5.66 138.94 223.52 64.29 0.03 2.43 0.64 4.21 0.39 0.44
dm-0 0.00 0.00 0.01 0.02 0.05 0.08 8.00 0.00 16.43 0.29 25.38 0.15 0.00
dm-1 0.00 0.00 0.10 0.38 3.69 52.02 231.77 0.01 27.61 1.04 34.96 1.18 0.06
sdb 0.00 0.00 0.00 0.00 0.01 0.00 7.81 0.00 0.84 0.84 0.00 0.82 0.00
When using the -x option, iostat gives you the following information:
Finding Most Busy Processes with iotop
The most useful tool to analyze performance on a server is iotop. This tool also is not installed by default. Use zypper install iostat to install it. Running iotop is as easy as running top. Just start the utility, and you will see which process is causing you an I/O headache. The busiest process is listed on top, and you can also see details about the reads and writes that this process performs (see Listing 15-16).
Within iotop, you’ll see two different kinds of processes. There are processes whose name is written between square brackets. These are kernel processes that aren’t loaded as a separate binary but are a part of the kernel itself. All other processes listed are normal binaries.
Listing 15-16. Analyzing I/O Performance with iotop
[root@hnl ~]# iotop
Total DISK READ : 0.00 B/s | Total DISK WRITE : 0.00 B/s
Actual DISK READ: 0.00 B/s | Actual DISK WRITE: 0.00 B/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
24960 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.01 % [kworker/1:2]
1 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % systemd --switche~ --deserialize 23
2 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [kthreadd]
3 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [ksoftirqd/0]
16388 be/4 qemu 0.00 B/s 0.00 B/s 0.00 % 0.00 % qemu-kvm -name vm~us=pci.0,addr=0x7
5 be/0 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [kworker/0:0H]
16390 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [vhost-16388]
7 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/0]
8 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [rcu_bh]
9 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [rcuob/0]
10 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [rcuob/1]
11 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [rcuob/2]
12 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [rcuob/3]
Normally, you would start analyzing I/O performance because of an abnormality in the regular I/O load. For example, you may find a high wa indicator in top. In Exercise 15-4, you’ll explore an I/O problem using this approach.
EXERCISE 15-4. EXPLORING I/O PERFORMANCE
In this exercise, you’ll start a couple of I/O-intensive tasks. You’ll first see abnormal behavior occurring in top, after which you’ll use iotop to explore what is going on.
#!/bin/bash
while true
do
cp -R / blah.tmp
rm -f /blah.tmp
sync
done
Understanding Network Performance
On a typical server, network performance is as important as disk, memory, and CPU performance. After all, the data has to be delivered over the network to the end user. The problem, however, is that things aren’t always as they seem. In some cases, a network problem can be caused by misconfiguration in server RAM. If, for example, packets get dropped on the network, the reason may very well be that your server just doesn’t have enough buffers reserved for receiving packets, which may be because your server is low on memory. Again, everything is related, and it’s your task to find the real cause of the troubles.
When considering network performance, you should always ask yourself what exactly you want to know. As you are aware, several layers of communication are used on the network. If you want to analyze a problem with your Samba server, that requires a completely different approach from analyzing a problem with dropped packets. A good network performance analysis always bottom-up. That means that you first have to check what is happening at the physical layer of the OSI model and then go up through the Ethernet, IP, TCP/UDP, and protocol layers.
When analyzing network performance, you should always start by checking the status of the network interface itself. Don’t use ifconfig; it really is a deprecated utility. Use ip -s link instead (see Listing 15-17).
Listing 15-17. Use ip -s link to See What Is Happening on Your Network Board
[root@vm8 ~]# ip -s link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
RX: bytes packets errors dropped overrun mcast
0 0 0 0 0 0
TX: bytes packets errors dropped carrier collsns
0 0 0 0 0 0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT qlen 1000
link/ether 52:54:00:30:3f:94 brd ff:ff:ff:ff:ff:ff
RX: bytes packets errors dropped overrun mcast
2824323 53309 0 0 0 0
TX: bytes packets errors dropped carrier collsns
8706 60 0 0 0 0
The most important information that is given by ip -s link is that about the number of packets that has been transmitted and received.
It’s not especially the number of packets that is of interest here but, mainly, the number of erroneous packets. In fact, all of these parameters should be 0 at all times. If you see anything else, you should check what is going on. The following error indicators are displayed:
If you see a problem when using ip -s link, the next step should be to check your network board settings. Use ethtool to find out the settings you’re currently using and make sure they match the settings of other network components, such as switches. (Note that this command does not work on many KVM virtual machines.) Listing 15-18 shows what you can expect.
Listing 15-18. Use ethtool to Check Settings of Your Network Board
[root@lab ~]# ethtool eno1
Settings for eno1:
Supported ports: [ TP ]
Supported link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Supported pause frame use: No
Supports auto-negotiation: Yes
Advertised link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Advertised pause frame use: No
Advertised auto-negotiation: Yes
Speed: 1000Mb/s
Duplex: Full
Port: Twisted Pair
PHYAD: 2
Transceiver: internal
Auto-negotiation: on
MDI-X: off (auto)
Supports Wake-on: pumbg
Wake-on: g
Current message level: 0x00000007 (7)
drv probe link
Link detected: yes
Typically, there are just a few parameters from the ethtool output that are of interest, and these are the Speed and Duplex settings. They show you how your network board is talking to other nodes. If you see, for example, that your server is set to full duplex, whereas all other nodes in your network use half duplex, you’ve found your problem and know what you need to fix. Duplex setting misconfigurations are becoming more and more uncommon, however. A common error is that the supported link speed cannot be reached. If a network card supports gigabit, but only gives 100Mbit/s, that is often due to a hardware misconfiguration of one of the network devices that is involved.
Another good tool with which to monitor what is happening on the network is IPTraf-ng (start it by typing iptraf-ng). This useful tool, however, is not included in the default installation or SLES repositories. You can download the RPM from the Internet, after which it can be installed manually. This is a real-time monitoring tool that shows what is happening on the network from a text-user interface. After starting, it will show you a menu from which you can choose what you want to see. Different useful filtering options are offered. (See Figure 15-1.)
Figure 15-1. IPTraf allows you to analyze network traffic from a menu
Before starting IPTraf, use the configure option. From there, you can specify exactly what you want to see and how you want it to be displayed. For instance, a useful setting to change is the additional port range. By default, IPTraf shows activity on privileged TCP/UDP ports only. If you have a specific application that you want to monitor that doesn’t use one of these privileged ports, select Additional ports from the configuration interface and specify additional ports that you want to monitor. (See Figure 15-2.)
Figure 15-2. Use the filter options to select what you want to see
After telling IPTraf how to do its work, use the IP traffic monitor option to start the tool. Next, you can select the interface on which you want to listen, or just hit Enter to listen on all interfaces. This will start the IPTraf interface, which displays everything that is going on at your server and also exactly on what port it is happening. In Figure 15-3, you can see that the server that is monitored currently has two sessions enabled, and also you can see which are the IP addresses and ports involved in that session.
Figure 15-3. IPtraf gives a quick overview of the kind of traffic sent on an interface
If it’s not so much the performance on the network board that you are interested in but more what is happening at the service level, netstat is a good basic network performance tool. It uses different parameters to show you what ports are open and on what ports your server sees activity. My personal favorite way of using netstat is by issuing the netstat -tulpn command. This gives an overview of all listening ports on the server and even tells you what other node is connected to a particular port. See Listing 15-19 for an overview.
Listing 15-19. With netstat, You Can See What Ports Are Listening on Your Server and Who Is Connected
[root@lab ~]# netstat -tulpn
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 127.0.0.1:5913 0.0.0.0:* LISTEN 31322/qemu-kvm
tcp 0 0 127.0.0.1:25 0.0.0.0:* LISTEN 1980/master
tcp 0 0 127.0.0.1:5914 0.0.0.0:* LISTEN 31347/qemu-kvm
tcp 0 0 127.0.0.1:6010 0.0.0.0:* LISTEN 28676/sshd: sander@
tcp 0 0 0.0.0.0:48702 0.0.0.0:* LISTEN 1542/rpc.statd
tcp 0 0 0.0.0.0:2022 0.0.0.0:* LISTEN 1509/sshd
tcp 0 0 127.0.0.1:5900 0.0.0.0:* LISTEN 13719/qemu-kvm
tcp 0 0 127.0.0.1:5901 0.0.0.0:* LISTEN 16388/qemu-kvm
tcp 0 0 127.0.0.1:5902 0.0.0.0:* LISTEN 18513/qemu-kvm
tcp 0 0 127.0.0.1:5903 0.0.0.0:* LISTEN 18540/qemu-kvm
tcp 0 0 0.0.0.0:111 0.0.0.0:* LISTEN 1498/rpcbind
tcp 0 0 127.0.0.1:5904 0.0.0.0:* LISTEN 18450/qemu-kvm
tcp 0 0 127.0.0.1:5905 0.0.0.0:* LISTEN 18919/qemu-kvm
tcp 0 0 127.0.0.1:5906 0.0.0.0:* LISTEN 19542/qemu-kvm
tcp 0 0 127.0.0.1:5907 0.0.0.0:* LISTEN 19586/qemu-kvm
tcp 0 0 127.0.0.1:5908 0.0.0.0:* LISTEN 19631/qemu-kvm
tcp 0 0 127.0.0.1:5909 0.0.0.0:* LISTEN 24773/qemu-kvm
tcp 0 0 192.168.122.1:53 0.0.0.0:* LISTEN 2939/dnsmasq
tcp 0 0 127.0.0.1:5910 0.0.0.0:* LISTEN 31234/qemu-kvm
tcp 0 0 127.0.0.1:5911 0.0.0.0:* LISTEN 31274/qemu-kvm
tcp 0 0 127.0.0.1:631 0.0.0.0:* LISTEN 3228/cupsd
tcp 0 0 127.0.0.1:5912 0.0.0.0:* LISTEN 31301/qemu-kvm
tcp6 0 0 ::1:25 :::* LISTEN 1980/master
tcp6 0 0 ::1:6010 :::* LISTEN 28676/sshd: sander@
tcp6 0 0 :::2022 :::* LISTEN 1509/sshd
tcp6 0 0 :::111 :::* LISTEN 1498/rpcbind
tcp6 0 0 :::58226 :::* LISTEN 1542/rpc.statd
tcp6 0 0 :::21 :::* LISTEN 25370/vsftpd
tcp6 0 0 fe80::fc54:ff:fe88:e:53 :::* LISTEN 2939/dnsmasq
tcp6 0 0 ::1:631 :::* LISTEN 3228/cupsd
udp 0 0 192.168.122.1:53 0.0.0.0:* 2939/dnsmasq
udp 0 0 0.0.0.0:67 0.0.0.0:* 2939/dnsmasq
udp 0 0 0.0.0.0:111 0.0.0.0:* 1498/rpcbind
udp 0 0 0.0.0.0:123 0.0.0.0:* 926/chronyd
udp 0 0 127.0.0.1:323 0.0.0.0:* 926/chronyd
udp 0 0 0.0.0.0:816 0.0.0.0:* 1498/rpcbind
udp 0 0 127.0.0.1:870 0.0.0.0:* 1542/rpc.statd
udp 0 0 0.0.0.0:35523 0.0.0.0:* 891/avahi-daemon: r
udp 0 0 0.0.0.0:52582 0.0.0.0:* 1542/rpc.statd
udp 0 0 0.0.0.0:5353 0.0.0.0:* 891/avahi-daemon: r
When using netstat, many options are available. Following, you’ll find an overview of the most interesting ones:
There are many other tools to monitor the network as well, most of them fall beyond the scope of this chapter, because they are rather protocol- or service-specific and won’t help you as much in finding performance problems on the network. There is, however, one very simple performance-testing method that I always use when analyzing a performance problem, which I will talk about at the end of this section.
In many cases, to judge network performance, you’re only interested in knowing how fast data can be copied to and from your server. After all, that’s the only parameter that you can change. To measure that, you can use a simple test. I like to create a big file (1GB, for example) and copy that over the network. To measure time, I use the time command, which gives a clear impression of how long it really took to copy the file. For example, time scp server:/bigfile /localdir will end with a summary of the total time it took to copy the file over. This is an excellent test, especially when you start optimizing performance, as it will show you immediately whether or not you’ve reached your goals.
Optimizing Performance
Now that you know what to look for in your server’s performance, it’s time to start optimizing. Optimizing performance is a complicated job, and you shouldn’t have the impression that after reading the tips in this chapter you know everything about server performance optimization. Nevertheless, it’s good to know about at least some of the basic approaches to make your server perform better.
You can look at performance optimization in two different ways. For some people, it involves just changing some parameters and seeing what happens. That is not the best approach. A much better approach is when you first start with performance monitoring. This will give you some clear ideas on what exactly is happening with performance on your server. Before optimizing anything, you should know what exactly to optimize. For example, if the network performs poorly, you should know if that is because of problems on the network, or just because you don’t have enough memory allocated for the network. So make sure you know what to optimize. You’ve just read in the previous sections how you can do this.
Once you know what to optimize, it comes down to doing it. In many situations, optimizing performance means writing a parameter to the /proc file system. This file system is created by the kernel when your server comes up and normally contains the settings that your kernel is working with. Under /proc/sys, you’ll find many system parameters that can be changed. The easy way to do this is by just echoing the new value to the configuration file. For example, the /proc/sys/vm/swappiness file contains a value that indicates how willing your server is to swap. The range of this value is between 0 and 100, a low value means that your server will avoid a swap as long as possible; a high value means that your server is more willing to swap. The default value in this file is 60. If you think your server is too eager to swap, you could change it, using the following:
echo "30" > /proc/sys/vm/swappiness
This method works well, but there is a problem. As soon as the server restarts, you will lose this value. So the better solution is to store it in a configuration file and make sure that the configuration file is read when your server comes up again. A configuration file exists for this purpose, and the name of the file is /etc/sysctl.conf. When booting, your server starts the sysctl service that reads this configuration file and applies all settings in it. The sysctl file is always read when your server starts to apply the settings it contains.
In /etc/sysctl.conf, you refer to files that exist in the /proc/sys hierarchy. So the name of the file you are referring to is relative to this directory. Also, instead of using a slash as the separator between directory, subdirectories, and files, it is common to use a dot (even if the slash is accepted as well). That means that to apply the change to the swappiness parameter as explained above, you would include the following line in /etc/sysctl.conf:
vm.swappiness=30
This setting would be applied the next time that your server reboots. Instead of just writing it to the configuration file, you can apply it to the current sysctl settings as well. To do that, use the sysctl command. The following command can be used to apply this setting immediately:
sysctl -w vm.swappiness=30
Using sysctl -w is exactly the same as using the echo "30" > /proc/sys/vm/swappiness command—it does not also write the setting to the sysctl.conf file. The most practical way of applying these settings is to write them to /etc/sysctl.conf first and then activate them using sysctl -p /etc/sysctl.conf. Once activated in this way, you can also get an overview of all current sysctl settings, using sysctl -a. In Listing 15-20, you can see a part of the output of this command.
Listing 15-20. sysctl -a Shows All Current sysctl Settings
vm.min_free_kbytes = 67584
vm.min_slab_ratio = 5
vm.min_unmapped_ratio = 1
vm.mmap_min_addr = 4096
vm.nr_hugepages = 0
vm.nr_hugepages_mempolicy = 0
vm.nr_overcommit_hugepages = 0
vm.nr_pdflush_threads = 0
vm.numa_zonelist_order = default
vm.oom_dump_tasks = 1
vm.oom_kill_allocating_task = 0
vm.overcommit_kbytes = 0
vm.overcommit_memory = 0
vm.overcommit_ratio = 50
vm.page-cluster = 3
vm.panic_on_oom = 0
vm.percpu_pagelist_fraction = 0
vm.scan_unevictable_pages = 0
vm.stat_interval = 1
vm.swappiness = 60
vm.user_reserve_kbytes = 131072
vm.vfs_cache_pressure = 100
vm.zone_reclaim_mode = 0
The output of sysctl -a is overwhelming, as all the kernel tunables are shown, and there are hundreds of them. I recommend that you use it in combination with grep, to find the information you need. For example, sysctl -a | grep huge would only show you lines that have the text huge in their output.
Using a Simple Performance Optimization Test
Although sysctl and its configuration file sysctl.conf are very useful tools to change performance-related settings, you shouldn’t use them immediately. Before writing a parameter to the system, make sure this really is the parameter you need. The big question, though, is how to know that for sure. There’s only one answer to that: testing. Before starting any test, be aware that tests always have their limitations. The test proposed here is far from perfect, and you shouldn’t use this test alone to draw definitive conclusions about the performance optimization of your server. Nevertheless, it gives a good impression especially of the write performance on your server.
The test consists of creating a 1GB file, using the following:
dd if=/dev/zero of=/root/1GBfile bs=1M count=1024
By copying this file around and measuring the time it takes to copy it, you can get a decent idea of the effect of some of the parameters. Many tasks you perform on your Linux server are I/O-related, so this simple test can give you an impression of whether or not there is any improvement. To measure the time it takes to copy this file, use the time command, followed by cp, as in time cp /root/1GBfile /tmp. In Listing 15-21, you can see what this looks like when doing it on your server.
Listing 15-21. Timing How Long It Takes to Copy a Large File Around, to Get an Idea of the Current Performance of Your Server
[root@hnl ~]# dd if=/dev/zero of=/1Gfile bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 16.0352 s, 67.0 MB/s
[root@hnl ~]# time cp /1Gfile /tmp
real 0m20.469s
user 0m0.005s
sys 0m7.568s
The time command gives you three different indicators: the real time, the user time, and the sys (system) time it took to complete the command. Real time is the time from initiation to completion of the command. User time is the time the kernel spent in user space, and sys time is the time the kernel spent in system space. When doing a test such as this, it is important to interpret it in the right way. Consider, for example, Listing 15-22, in which the same command is repeated a few seconds later.
Listing 15-22. The Same Test, Ten Seconds Later
[root@hnl ~]# time cp /1Gfile /tmp
real 0m33.511s
user 0m0.003s
sys 0m7.436s
As you can see, the command now performs slower than the first time it was used. This is only in real time, however, and not in sys time. Is this the result of a performance parameter that I’ve changed in between? No, but let’s have a look at the result of free –m, as in Listing 15-23.
Listing 15-23. Take Other Factors into Consideration
root@hnl:~# free -m
total used free shared buffers cached
Mem: 3987 2246 1741 0 17 2108
-/+buffers/cache: 119 3867
Swap: 2047 0 2047
Any idea what has happened here? The entire 1GB file was put in cache when the command was first executed. As you can see, free -m shows almost 2GB of data in cache, which wasn’t there before and that has an influence on the time it takes to copy a large file around.
So what lesson is there to learn? Performance optimization is complex. You have to take into account multiple factors that all have an influence on the performance of your server. Only when this is done the right way will you truly see how your server performs and whether or not you have succeeded in improving its performance. When not looking properly, you may miss things and think you have improved performance, while in reality, you have worsened it. So, it is important to develop reliable procedures for performance testing and stick to them.
CPU Tuning
In this section, you’ll learn what you can do to optimize the performance of your server’s CPU. First you’ll learn about some aspects of the working of the CPU that are important when trying to optimize performance parameters for the CPU, then you’ll read about some common techniques to optimize CPU utilization.
Understanding CPU Performance
To be able to tune the CPU, you should know what is important with regard to this part of your system. To understand the CPU, you should know about the thread scheduler. This part of the kernel makes sure that all process threads get an equal amount of CPU cycles. Because most processes will do some I/O as well, it’s not really bad that the scheduler puts process threads on hold for a given moment. While not being served by the CPU, the process thread can handle its I/O. The scheduler operates by using fairness, meaning that all threads are moving forward in an even manner. By using fairness, the scheduler makes sure there is not too much latency.
The scheduling process is pretty simple in a single CPU / core environment. If, however, multiple cores are used, it becomes more complicated. To work in a multi-CPU or multi-core environment, your server will use a specialized symmetric multiprocessing (SMP) kernel. If needed, this kernel is installed automatically. In an SMP environment, the scheduler should make sure that some kind of load balancing is used. This means that process threads are spread over the available CPU cores. Some programs are written to be used in an SMP environment and are able to use multiple CPUs by themselves. Most programs can’t do that and for this depend on the capabilities of the kernel.
A specific worry in a multi-CPU environment is that the scheduler should prevent processes and threads from being moved to other CPU cores. Moving a process means that the information the process has written in the CPU cache has to be moved as well, and that is a relatively expensive process.
You may think that a server will always benefit from installing multiple CPU cores, but that is not true. When working on multiple cores, chances increase that processes swap around between cores, taking their cached information with them, and that slows down performance in a multiprocessing environment. When using multi-core systems, you should always optimize your system for that.
Optimizing CPU Performance
CPU performance optimization really is just about two things: priority and optimization of the SMP environment. Every process gets a static priority from the scheduler. The scheduler can differentiate between real time (RT) processes and normal processes, but if a process falls in one of these categories, it will be equal to all other processes in the same category. Be aware, however, that some real-time processes (most are part of the Linux kernel) will run with the highest priority, whereas the rest of available CPU cycles must be divided among the other processes. In that procedure, it’s all about fairness: the longer a process is waiting, the higher its priority will be. You have already learned how to use the nice command to tune process priority.
If you are working in an SMP environment, a good utility to improve performance is the taskset command. You can use taskset to set CPU affinity for a process to one or more CPUs. The result is that your process is less likely to be moved to another CPU. The taskset command uses a hexadecimal bitmask to specify which CPU to use. In this bitmap, the value 0x1 refers to CPU0, 0x2 refers to CPU1, 0x4 to CPU2, 0x8 to CPU3, and so on. Note that these numbers do combine, so use 0x3 to refer to CPUs 0 and 1.
So, if you have a command that you would like to bind to CPUs 2 and 3, you would use the following command:
taskset 0x12 somecommand
You can also use taskset on running processes, by using the -p option. With this option, you can refer to the PID of a processes, for instance,
taskset -p 0x3 7034
would set the affinity of the process using PID 7034 to CPUs 0 and 1.
You can specify CPU affinity for IRQs as well. To do this, you can use the same bitmask that you use with taskset. Every interrupt has a subdirectory in /proc/irq/, and in that subdirectory, there is a file with the name smp_affinity. So, if, for example, your IRQ 5 is producing a very high workload (check /proc/interrupts to see if this is the case), and, therefore, you want that IRQ to work on CPU1, use the following command:
echo 2 > /proc/irq/3/smp_affinity
Another approach to optimize CPU performance is by using cgroups. Cgroups provide a new way to optimize all aspects of performance, including CPU, memory, I/O, and more. At the end of this chapter, you’ll read about using cgroups.
Apart from the generic settings discussed here, there are some more specific ways of optimizing CPU performance. Most of them relate to the working of the scheduler. You can find these settings in /proc/sys/kernel. All files with a name that begins with sched relate to CPU optimization. One example of these is the sched_latency_ns, which defines the latency of the scheduler in nanoseconds. You could consider decreasing the latency that you find here, to get better CPU performance. However, you should realize that optimizing the CPU brings benefits only in very specific environments. For most environments, it doesn’t make that much sense, and you can get much better results by improving performance of important system parts, such as memory and disk.
Tuning Memory
System memory is a very important part of a computer. It functions as a buffer between CPU and I/O, and by tuning memory, you can really get the best out of it. Linux works with the concept of virtual memory, which is the total of all memory available on a server. You can tune the working of virtual memory by writing to the /proc/sys/vm directory. This directory contains lots of parameters that help you to tune the way your server’s memory is used. As always when tuning the performance of a server, there are no solutions that work in all cases. Use the parameters in /proc/sys/vm with caution, and use them one by one. Only by tuning each parameter individually, will you be able to determine if it gave the desired result.
Understanding Memory Performance
In a Linux system, the virtual memory is used for many purposes. First, there are processes that claim their amount of memory. When tuning for processes, it helps to know how these processes allocate memory, for instance, a database server that allocates large amounts of system memory when starting up has different needs than a mail server that works with small files only. Also, each process has its own memory space, which may not be addressed by other processes. The kernel takes care that this never occurs.
When a process is created, using the fork() system call (which basically creates a child process from the parent), the kernel creates a virtual address space for the process. The kernel part that takes care of that is known as the dynamic linker. The virtual address space that is used by a process is made up of pages. On a 64-bit server, you can choose between 4, 8, 16, 32, and 64KB pages, but the default pages’ size is set to 4KB and is rarely changed. For applications that require lots of memory, you can optimize memory by configuring huge pages.
Another important aspect of memory usage is caching. In your system, there is a read cache and a write cache, and it may not surprise you that a server that handles read requests most of the time is tuned in another way than a server that handles write requests.
Configuring Huge Pages
If your server is a heavily used application server, it may benefit from using large pages, also referred to as huge pages. A huge page, by default, is a 2MB page, and it may be useful to improve performance in high-performance computing and with memory-intensive applications. By default, no huge pages are allocated, as they would be a waste on a server that doesn’t need them—memory that is used for huge pages cannot be used for anything else. Typically, you set huge pages from the Grub boot loader when starting your server. In Exercise 15-5, you’ll learn how to set huge pages.
EXERCISE 15-5. CONFIGURING HUGE PAGES
In this exercise, you’ll configure huge pages. You’ll set them as a kernel argument and then you’ll verify their availability. Note that in this procedure, you’ll specify the amount of huge pages as a boot argument to the kernel. You can also set it from the /proc file system, as explained later.
Be careful, however, when allocating huge pages. All memory pages that are allocated as huge pages are no longer available for other purposes, and if your server needs a heavy read or write cache, you will suffer from allocating too many huge pages immediately. If you find that this is the case, you can change the amount of huge pages currently in use by writing to the /proc/sys/vm/nr_hugepages parameter. Your server will pick up this new amount of huge pages immediately.
Optimizing Write Cache
The next couple of parameters all relate to the buffer cache. As discussed earlier, your server maintains a write cache. By putting data in that write cache, the server can delay writing data. This is useful for more than one reason. Imagine that just after committing the write request to the server, another write request is made. It will be easier for the server to handle that write request, if the data is not yet written to disk but still in memory. You may also want to tune the write cache to balance between the amount of memory reserved for reading and the amount that is reserved for writing data.
The first relevant parameter is in /proc/sys/vm/dirty_ratio. This parameter is used to define the percentage of memory that is maximally used for the write cache. When the percentage of buffer cache in use comes above this parameter, your server will write memory from the buffer cache to disk as soon as possible. The default of 10 percent works fine for an average server, but in some situations, you may want to increase or decrease the amount of memory used here.
Related to dirty_ration are the dirty_expire_centisecs and dirty_writeback_centisecs parameters, also in /proc/sys/vm. These parameters determine when data in the write cache expires and have to be written to disk, even if the write cache hasn’t reached the threshold, as defined in dirty_ratio, yet. By using these parameters, you reduce the chances of losing data when a power outage occurs on your server. On the contrary, if you want to use power more efficiently, it is useful to give both these parameters the value of 0, which actually disables them and keeps data as long as possible in the write cache. This is useful for laptop computers, because your hard disk has to spin up in order to write these data, and that takes a lot of power.
The last parameter that is related to writing data, is nr_pdflush_threads. This parameter helps in determining the amount of threads the kernel launches for writing data from the buffer cache. Understanding it is easy; more of these means faster write back. So, if you have the idea that buffer cache on your server is not cleared fast enough, increase the amount of pdflush_threads, using the following command by echoing a 4 to the file /proc/sys/vm/nr_pdflush_threads.
When using this option, do respect its limitations. By default, the minimal amount of pdflush_threads is set to 2, and there is a maximum of 8, so that the kernel still has a dynamic range to determine what exactly it has to do.
Overcommitting Memory
Next, there is the issue of overcommitting memory. By default, every process tends to claim more memory than it really needs. This is good, because it makes the process faster. If the process already has some spare memory available, it can access it much faster when it needs it, because it doesn’t have to ask the kernel if it has some more memory available. To tune the behavior of overcommitting memory, you can write to the /proc/sys/vm/overcommit_memory parameter. This parameter can have some values. The default value is 0, which means that the kernel checks if it still has memory available before granting it. If that doesn’t give you the performance you need, you can consider changing it to 1, which means that the system thinks there is enough memory in all cases. This is good for performance of memory-intensive tasks but may result in processes getting killed automatically. You can also use the value of 2, which means that the kernel fails the memory request if there is not enough memory available.
This minimal amount of memory that is available is specified in the /proc/sys/vm/overcommit_ratio parameter, which by default is set to 50. This means that the kernel can allocate 50 percent more than the total amount of memory that is available in RAM + swap. So, on a 4GB system that has 2GB swap, the total amount of addressable memory would be set to 9GB when using the value 50 in overcommit_ration.
Another useful parameter is the /proc/sys/vm/swappiness parameter. This indicates how eager the process is to start swapping out memory pages. A high value means that your server will swap very fast; a low value means that the server will wait some more before starting to swap. The default value of 60 does well in most situations. If you still think your server starts swapping too fast, set it to a somewhat lower value, like 40.
Optimizing Inter Process Communication
The last relevant parameters that relate to memory are the parameters that relate to shared memory. Shared memory is a method that the Linux kernel or Linux applications can use to make communication between processes (also known as Inter Process Communication or IPC) as fast as possible. In database environments, it often makes sense to optimize shared memory. The cool thing about shared memory is that the kernel is not involved in the communication between the processes using it. Data doesn’t even have to be copied, because the memory areas can be addressed directly. To get an idea of shared memory–related settings your server is currently using, use the ipcs -lm command, as shown in Listing 15-24.
Listing 15-24. Use the ipcs -lm Command to Get an Idea of Shared Memory Settings
[root@lab ~]# ipcs -lm
------ Shared Memory Limits --------
max number of segments = 4096
max seg size (kbytes) = 4194303
max total shared memory (kbytes) = 1073741824
min seg size (bytes) = 1
When your applications are written to use shared memory, you can benefit from tuning some of their parameters. If, on the contrary, your applications don’t know how to handle it, it doesn’t make a difference if you change the shared memory–related parameters. To find out if shared memory is used on your server, and, if so, in what amount it is used, apply the ipcs -m command. In Listing 15-25, you can see an example of its output on a server on which only one shared memory segment is used.
Listing 15-25. Use ipcs -m to Find Out If Your Server Is Using Shared Memory Segments
[root@lab ~]# ipcs -m
------ Shared Memory Segments --------
key shmid owner perms bytes nattch status
0x00000000 65536 root 600 4194304 2 dest
0x00000000 163841 root 600 4194304 2 dest
0x00000000 557058 root 600 4194304 2 dest
0x00000000 294915 root 600 393216 2 dest
0x00000000 458756 root 600 2097152 2 dest
0x00000000 425989 root 600 1048576 2 dest
0x00000000 5865478 root 777 3145728 1
0x00000000 622599 root 600 16777216 2 dest
0x00000000 1048584 root 600 33554432 2 dest
0x00000000 6029321 root 777 3145728 1
0x00000000 6127626 root 777 3145728 1
0x00000000 6193163 root 777 3145728 1
0x00000000 6258700 root 777 3145728 1
The first /proc parameter that is related to shared memory is shmmax. This defines the maximum size in bytes of a single shared-memory segment that a Linux process can allocate. You can see the current setting in the configuration file /proc/sys/kernel/shmmax, as follows:
root@hnl:~# cat /proc/sys/kernel/shmmax
33554432
This sample was taken from a system that has 4GB of RAM. The shmmax setting was automatically created to allow processes to allocate up to about 3.3GB of RAM. It doesn’t make sense to tune the parameter to use all available RAM, because RAM has to be used for other purposes as well.
The second parameter that is related to shared memory is shmmni, which is not, as you might think, the minimal size of shared memory segments but the maximum size of the shared memory segments that your kernel can allocate. You can get the default value from /proc/sys/kernel/shmmni; it should be set to 4096. If you have an application that relies heavily on the use of shared memory, you may benefit from increasing this parameter, for example:
sysctl -w kernel.shmmni=8192
The last parameter related to shared memory is shmall. It is set in /proc/sys/kernel/shmall and defines the total amount of shared memory pages that can be used system-wide. Normally, the value should be set to the value of shmmax, divided by the current page size your server is using. On a 32-bit processor, finding the page size is easy; it is always set to 4096. On a 64-bit computer, you can use the getconf command to determine the current page size:
[root@hnl ~]# getconf PAGE_SIZE
4096
If the shmall parameter doesn’t contain a value that is big enough for your application, change it, as needed. For instance, use the following command:
sysctl -w kernel.shmall=2097152
Tuning Storage Performance
The third element in the chain of Linux performance is the storage channel. Performance optimization on this channel can be divided in two: journal optimization and I/O buffer performance. Apart from that, there are some other file system parameters that can be tunes to optimize performance.
Understanding Storage Performance
To determine what happens with I/O on your server, Linux uses the I/O scheduler. This kernel component sits between the block layer that communicates directly with the file systems and the device drivers. The block layer generates I/O requests for the file systems and passes those requests to the I/O scheduler. This scheduler, in turn, transforms the request and passes it to the low-level drivers. The drivers, in turn, next forward the request to the actual storage devices. Optimizing storage performance begins with optimizing the I/O scheduler.
Optimizing the I/O Scheduler
Working with an I/O scheduler makes your computer more flexible. The I/O scheduler can prioritize I/O requests and reduce times for searching data on the hard disk. Also, the I/O scheduler makes sure that a request is handled before it times out. An important goal of the I/O scheduler is to make hard disk seek times more efficient. The scheduler does this by collecting requests before really committing them to disk. Because of this approach, the scheduler can do its work more efficiently. For example, it may choose to order requests before committing them to disk, which makes hard disk seeks more efficient.
When optimizing the performance of the I/O scheduler, there is a dilemma: you can optimize read performance or write performance but not both at the same time. Optimizing read performance means that write performance will be not as good, whereas optimizing write performance means you have to pay a price in read performance. So before starting to optimize the I/O scheduler, you should really analyze what type of workload is generated by your server.
There are four different ways in which the I/O scheduler does its work.
Note The results of switching between I/O schedulers heavily depend on the nature of the workload of the specific server. The preceding summary is only a guideline, and before changing the I/O scheduler, you should test intensively to find out if it really leads to the desired results.
There are two ways to change the current I/O scheduler. You can echo a new value to the /sys/block/<YOURDEVICE>/queue/scheduler file. Alternatively, you can set it as a boot parameter, using elevator=yourscheduler on the Grub prompt or in the grub menu. The choices are noop, anticipatory, deadline, and cfq.
Optimizing Storage for Reads
Another way to optimize the way your server works is by tuning read requests. This is something that you can do on a per-disk basis. First, there is read_ahead, which can be tuned in /sys/block/<YOURDEVICE>/queue/read_ahead_kb. On a default Linux installation, this parameter is set to 128KB. If you have fast disks, you can optimize your read performance by using a higher value; 512, for instance, is a starting point, but make sure always to test before making a new setting final. Also, you can tune the number of outstanding read requests by using /sys/block/<YOURDEVICE>/queue/nr_requests. The default value for this parameter also is set to 128, but a higher value may optimize your server in a significant way. Try 512, or even 1024, to get the best read performance, but do always observe that it doesn’t introduce too much latency while writing files.
Note Optimizing read performance works well, but be aware that while making read performance better, you’ll also introduce latency on writes. In general, there is nothing against that, but if your server loses power, all data that is still in memory buffers and hasn’t been written yet will get lost.
EXERCISE 15-6. CHANGING SCHEDULER PARAMETERS
In this exercise, you’ll change scheduler parameters and try to see a difference. Note that, normally, complex workloads will show differences better, so don’t be surprised if, with the simple tests proposed in this exercise, you don’t detect much of a difference.
cd /etc
for i in *
do
cat $i
done
Changing Journal Options
By default, all modern file systems on Linux use journaling. On some specific workloads, the default journaling mode will cause you a lot of problems. You will determine if this is the case for your server by using iotop. If you see that kjournald is high on the list, you have a journaling issue that you must optimize.
There are three different journaling options, which you can set by using the data=journaloption mount option.
Network Tuning
Among the most difficult items to tune is network performance. This is because in networking, multiple layers of communication are involved, and each is handled separately on Linux. First, there are buffers on the network card itself that deal with physical packets. Next, there is the TCP/IP protocol stack, and then there is also the application stack. All work together, and tuning one will have its consequences on the other layer. While tuning the network, always work upward in the protocol stack. That is, start by tuning the packets themselves, then tune the TCP/IP stack, and after that, have a look at the service stacks that are in use on your server.
Tuning Network-Related Kernel Parameters
While it initializes, the kernel sets some parameters automatically, based on the amount of memory that is available on your server. So, the good news is that in many situations, there is no work to be done. Some parameters, by default, are not set in the most optimal way, so, in some cases, there is some performance to gain there.
For every network connection, the kernel allocates a socket. The socket is the end-to-end line of communication. Each socket has a receive buffer and a send buffer, also known as the read (receive) and write (send) buffers. These buffers are very important. If they are full, no more data can be processed, so data will be dropped. This will have important consequences for the performance of your server, because if data is dropped, it has to be sent and processed again.
The basis of all reserved sockets on the network comes from two /proc tunables:
/proc/sys/net/core/wmem_default
/proc/sys/net/core/rmem_default
All kernel-based sockets are reserved from these sockets. If, however, a socket is TCP-based, the settings in here are overwritten by TCP specific parameters, in particular the tcp_rmem and tcp_wmem parameters. In the next section, you can get more details on how to optimize those.
The values of the wmem_default and rmem_default are set automatically when your server boots. If you have dropped packets on the network interface, you may benefit from increasing them. For some workloads, the values that are used by default are rather low. To set them, tune the following parameters in /etc/sysctl.conf.
net.core.wmem_default
net.core.rmem_default
Especially if you have dropped packets, try doubling them, to find out if the dropped packets go away by doing so.
Related to the default read and write buffer size is the maximum read and write buffer size: rmem_max and wmem_max. These are also calculated automatically when your server comes up but, for many situations, are far too low. For example, on a server that has 4GB of RAM, the sizes of these are set to 128KB only! You may benefit from changing their values to something that is much larger, like 8MB.
sysctl -w net.core.rmem_max=8388608
sysctl -w net.core.wmem_max=8388608
When increasing the read and write buffer size, you also have to increase the maximum amount of incoming packets that can be queued. This is set in netdev_max_backlog. The default value is set to 1000, which is not enough for very busy servers. Try increasing it to a much higher value, like 8000, especially if you have long latency times on your network or if there are lots of dropped packets.
sysctl -w net.core.netdev_max_backlog=8000
Apart from the maximum number of incoming packets that your server can queue, there is also a maximum amount of incoming connections that can be accepted. You can set them from the somaxconn file in /proc.
sysctl -w net.core.somaxconn=512
By tuning this parameter, you will limit the amount of new connections dropped.
Optimizing TCP/IP
Up to now, you have tuned kernel buffers for network sockets only. These are generic parameters. If you are working with TCP, some specific tunables are available as well. Some TCP tunables, by default, have a value that is too low; many are self-tunable and adjust their values automatically, if that is needed. Chances are that you can gain a lot by increasing them. All relevant options are in /proc/sys/net/ipv4.
To start with, there is a read buffer size and a write buffer size that you can set for TCP. They are written to tcp_rmem and tcp_wmem. Here also, the kernel tries to allocate the best possible values when it boots, but in some cases, it doesn’t work out that well. If that happens, you can change the minimum size, the default size, and the maximum size of these buffers. Note that each of these two parameters contains three values at the same time, for minimal, default, and maximal size. In general, there is no need to tune the minimal size. It can be interesting, though, to tune the default size. This is the buffer size that will be available when your server boots. Tuning the maximum size is also important, as it defines the upper threshold above which packets will get dropped. In Listing 15-26, you can see the default settings for those parameters on my server that have 4GB of RAM.
Listing 15-26. Default Settings for TCP Read and Write Buffers
[root@hnl ~]# cat /proc/sys/net/ipv4/tcp_rmem
4096 87380 3985408
[root@hnl ~]# cat /proc/sys/net/ipv4/tcp_wmem
4096 16384 3985408
In this example, the maximum size is quite good; almost 4MB are available as the maximum size for read as well as write buffers. The default write buffer size is limited. Imagine that you want to tune these parameters in a way that the default write buffer size is as big as the default read buffer size, and the maximum for both parameters is set to 8MB. You could do that by using the following two commands:
sysctl -w net.ipv4.tcp_rmem="4096 87380 8388608"
sysctl -w net.ipv4.tcp_wmem="4096 87380 8388608"
Before tuning options like these, you should always check the availability of memory on your server. All memory that is allocated for TCP read and write buffers can’t be used for other purposes anymore, so you may cause problems in other areas while tuning these. It’s an important rule in tuning that you should always make sure the parameters are well-balanced.
Another useful set of parameters is related to the acknowledged nature of TCP. Let’s have a look at an example to understand how this works. Imagine that the sender in a TCP connection sends a series of packets, numbered 1,2,3,4,5,6,7,8,9,10. Now imagine that the receiver receives all of them, with the exception of packet 5. In the default setting, the receiver would acknowledge receiving up to packet 4, in which case, the sender would send packets 5,6,7,8,9,10 again. This is a waste of bandwidth, because packets 6,7,8,9,10 have been received correctly already.
To handle this acknowledgment traffic in a more efficient way, the setting /proc/sys/net/ipv4/tcp_sack is enabled (having the value of 1). That means, in such cases as the above, only missing packets have to be sent again, and not the complete packet stream. For your network bandwidth, that is fine, as only those packets that really need to be retransmitted are retransmitted. So, if your bandwidth is low, you should always leave it on. If, however, you are on a fast network, there is a downside. When using this parameter, packets may come in out of order. That means that you need larger TCP receive buffers to keep all the packets until they can be defragmented and put in the right order. That means that using this parameter involves more memory to be reserved, and from that perspective, on fast network connections, you had better switch it off. To do that, use the following:
sysctl -w net.ipv4.tcp_sack=0
When disabling TCP selective acknowledgments, as described previously, you should also disable two related parameters: tcp_dsack and tcp_fack. These parameters enable selective acknowledgments for specific packet types. To enable them, use the following two commands:
sysctl -w net.ipv4.tcp_dsack=0
sysctl -w net.ipv4.tcp_fack=0
In case you would prefer to work with selective acknowledgments, you can also tune the amount of memory that is reserved to buffer incoming packets that have to be put in the right order. Two parameters relate to this. First, there is ipfrag_low_tresh, and then there is ipfrag_high_tresh. When the amount that is specified in ipfrag_high_tresh is reached, new packets to be defragmented are dropped until the server reaches ipfrag_low_tresh. Make sure the value of both of these is set high enough at all times, if your server uses selective acknowledgments. The following values are reasonable for most servers:
sysctl -w net.ipv4.ipfrag_low_thresh=393216
sysctl -w net.ipv4.ipfrag_high_thresh=524288
Next, there is the length of the TCP Syn queue that is created for each port. The idea is that all incoming connections are queued until they can be serviced. As you can probably guess, when the queue is full, connections get dropped. The situation is that the tcp_max_syn_backlog that manages these per-port queues has a default value that is too low, as only 1024 bytes are reserved for each port. For good performance, better allocate 8192 bytes per port, using the following:
sysctl -w net.ipv4.tcp_max_syn_backlog=8192
Also, there are some options that relate to the time an established connection is maintained. The idea is that every connection that your server has to keep alive uses resources. If your server is a very busy server, at a given moment, it will be out of resources and tell new incoming clients that no resources are available. Because, for a client, in most cases, it is easy enough to reestablish a connection, you probably want to tune your server in such a way that it detects failing connections as soon as possible.
The first parameter that relates to maintaining connections is tcp_synack_retries. This parameter defines the number of times the kernel will send a response to an incoming new connection request. The default value is 5. Given the current quality of network connections, 3 is probably enough, and it is better for busy servers, because it makes a connection available sooner. So use the following to change it:
sysctl -w net.ipv4.tcp_synack_retries=3
Next, there is the tcp_retries2 option. This relates to the amount of times the server tries to resend data to a remote host that has an established session. Because it is inconvenient for a client computer if a connection is dropped, the default value is with 15 a lot higher than the default value for the tcp_synack_retries. However, retrying it 15 times means that during all that time, your server can’t use its resources for something else. Therefore, it is better to decrease this parameter to a more reasonable value of 5, as in the following:
sysctl -w net.ipv4.tcp_retries2=5
The parameters just mentioned relate to sessions that appear to be gone. Another area in which you can do some optimization is in the maintenance of inactive sessions. By default, a TCP session can remain idle forever. You probably don’t want that, so use the tcp_keepalive_time option to determine how long an established inactive session will be maintained. By default, this will be 7200 seconds (2 hours). If your server tends to run out of resources because too many requests are coming in, limit it to a considerably shorter period of time.
sysctl -w net.ipv4.tcp_keepalive_time=900
Related to the keepalive_time is the amount of packets that your server sends before deciding a connection is dead. You can manage this by using the tcp_keepalive_probes parameter. By default, nine packets are sent before a server is considered dead. Change it to three, if you want to terminate dead connections faster.
sysctl -w net.ipv4.tcp_keepalive_probes=3
Related to the amount of keep alive probes is the interval you want to use to send these probes. By default, that happens every 75 seconds. So even with 3 proves, it still takes more than 3 minutes before your server can see that a connection has really failed. To bring this period back, give the tcp_keepalive_intvl parameter the value of 15.
sysctl -w net.ipv4.tcp_keepalive_intvl=15
To complete the story about maintaining connections, we need two more parameters. By default, the kernel waits a little before reusing a socket. If you run a busy server, performance will benefit from switching this off. To do this, use the following two commands:
sysctl -w net.ipv4.tcp_tw_reuse=1
sysctl -w net.ipv4.tcp_tw_recycle=1
Generic Network Performance Optimization Tips
Until now, we have discussed kernel parameters only. There are also some more generic hints for optimizing performance on the network. You probably already have applied all of them, but just to be sure, let’s repeat some of the most important tips.
Optimizing Linux Performance Using Cgroups
Among the latest features for performance optimization that Linux has to offer, is cgroups (short for control groups), a technique that allows you to create groups of resources and allocate them to specific services. By using this solution, you can make sure that a fixed percentage of resources on your server is always available for those services that need it.
To start using cgroups, you first have to make sure the libcgroup RPM package is installed, so use zypper install libcgroup-tools to do that. Once its installation is confirmed, you have to start the cgconfig and cgred services. Make sure to put these in the runlevels of your server, using systemctl enable cgconfig and systemctl enable cgred on. Next, make sure to start these services. This will create a directory /cgroup with a couple of subdirectories in it. These subdirectories are referred to as controllers. The controllers refer to the system resources that you can limit using cgroups. Some of the most interesting include the following:
There are some other controllers as well, but they are not as useful as the blkio, cpu, and memory controllers. Now let’s assume that you’re running an Oracle database on your server, and you want to make sure that it runs in a cgroup in which it has access to at least 75 percent of available memory and CPU cycles. The first step would be to create a cgroup that defines access to cpu and memory resources. The following command would create this cgroup with the name oracle: cgcreate -g cpu,memory oracle. After defining the cgroups this way, you’ll see that in the /cgroups/cpu and /cgroups/memory directory, a subdirectory with the name oracle is created. In this subdirectory, different parameters are available to specify the resources that you want to make available to the cgroup. (See Listing 15-27.)
Listing 15-27. In the Subdirectory of Your Cgroup, You’ll Find All Tunables
[root@hnl ~]# cd /cgroup/cpu/oracle/
[root@hnl oracle]# ls
cgroup.procs cpu.rt_period_us cpu.stat
cpu.cfs_period_us cpu.rt_runtime_us notify_on_release
cpu.cfs_quota_us cpu.shares tasks
To specify the amount of CPU resources available for the newly created cgroup, you’ll use the cpu.shares parameter. This is a relative parameter that only makes sense if everything is in cgroups, and it defines the amount of shares available in this cgroup. That means that to the amount of shares in the cgroup oracle, you’ll assign the value 80, and for that in the cgroup other that contains all other processes, you’ll assign the value of 20. Thus the oracle cgroup receives 80 percent of available CPU resources. To set the parameter, you can use the cgset command: cgset -r cpu.shares=80 oracle.
After setting the amount of CPU shares for this cgroup, you can put processes in it. The best way to do this is to start the process you want to put in the cgroup as an argument to the cgexec command. In this example, that would mean that you’d run cgexec -g cpu:/oracle /path/to/oracle. At this time, the oracle process itself, and all its child processes, will be visible in the /cgroups/cpu/oracle/tasks file, and you have assigned Oracle to its specific cgroup.
In this example, you’ve seen how to manually create cgroups, make resources available to the cgroup, and put a process in it. The disadvantage of this approach is that after a system restart, all settings will be lost. To make the cgroups permanent, you have to use the cgconfig service and the cgred service. The cgconfig service reads its configuration file /etc/cgconfig.conf, in which the cgroups are defined, including the definition of the resources you want to assign to that cgroup. Listing 15-28 shows what it would look like for the oracle example:
Listing 15-28. Sample cgconfig.conf file
group oracle {
cpu {
cpu.shares=80
}
memory {
}
}
Next, you have to create the file cgrules.conf, which specifies the processes that have to be put in a specific cgroup automatically. This file is read when the cgred service is starting. For the Oracle group, it would have the following contents:
*:oracle cpu,memory /oracle
If you have ensured that both the cgconfig service and the cgred service are starting from the runlevels, your services will be started automatically in the appropriate cgroup.
Summary
In this chapter, you’ve learned how to tune and optimize performance on your server. You’ve read that for both the tuning part and the optimization part, you’ll always have to look at four different categories: CPU, memory, I/O, and network. For each of these, several tools are available to optimize performance.
Often, performance optimization is done by tuning parameters in the /proc file system. In addition to that, there are different options, which can be very diverse, depending on the optimization you’re trying to get. An important new instrument to optimize performance are control groups (cgroups), which allow you to limit resources for services on your server in a very specific way.