USE Method: Linux
This appendix contains a checklist for Linux derived from the USE method [Gregg 13d]. This is a method for checking system health, and identifying common resource bottlenecks and errors, introduced in Chapter 2, Methodologies, Section 2.5.9, The USE Method. Later chapters (5, 6, 7, 9, 10) described it in specific contexts and introduced tools to support its use.
Performance tools are often enhanced, and new ones are developed, so you should treat this as a starting point that will need updates. New observability frameworks and tools can also be developed to specifically make following the USE method easier.
Component |
Type |
Metric |
---|---|---|
CPU |
Utilization |
Per CPU: System-wide: Per process: Per kernel thread: |
CPU |
Saturation |
System-wide: Per process: /proc/PID/schedstat 2nd field (sched_info.run_delay); getdelays.c, |
CPU |
Errors |
Machine Check Exceptions (MCEs) seen in |
Memory capacity |
Utilization |
System-wide: Per process: |
Memory capacity |
Saturation |
System-wide: Per process: getdelays.c, |
Memory capacity |
Errors |
|
Network interfaces |
Utilization |
|
Network interfaces |
Saturation |
|
Network interfaces |
Errors |
|
Storage device I/O |
Utilization |
System-wide: |
Storage device I/O |
Saturation |
|
Storage device I/O |
Errors |
/sys/devices/ . . . /ioerr_cnt; |
Storage capacity |
Utilization |
Swap: |
Storage capacity |
Saturation |
Not sure this one makes sense—once it’s full, ENOSPC (although when close to full, performance may be degraded depending on the file system free block algorithm) |
Storage capacity |
File systems: errors |
|
Storage controller |
Utilization |
|
Storage controller |
Saturation |
See storage device saturation, . . . |
Storage controller |
Errors |
See storage device errors, . . . |
Network controller |
Utilization |
Infer from |
Network controller |
Saturation |
See network interfaces, saturation, . . . |
Network controller |
Errors |
See network interfaces, errors, . . . |
CPU interconnect |
Utilization |
|
CPU interconnect |
Saturation |
|
CPU interconnect |
Errors |
|
Memory interconnect |
Utilization |
|
Memory interconnect |
Saturation |
|
Memory interconnect |
Errors |
|
I/O interconnect |
Utilization |
|
I/O interconnect |
Saturation |
|
I/O interconnect |
Errors |
|
General notes: uptime
“load average” (or /proc/loadavg) wasn’t included for CPU metrics since Linux load averages include tasks in the uninterruptible I/O state.
perf(1): is a powerful observability toolkit that reads PMCs and can also use dynamic and static instrumentation. Its interface is the perf(1) command. See Chapter 13, perf.
PMCs: Performance monitoring counters. See Chapter 6, CPUs, and their usage with perf(1).
I/O interconnect: This includes the CPU-to-I/O controller buses, the I/O controller(s), and device buses (e.g., PCIe).
Dynamic instrumentation: allows custom metrics to be developed. See Chapter 4, Observability Tools, and the examples in later chapters. Dynamic tracing tools for Linux include perf(1) (Chapter 13), Ftrace (Chapter 14), BCC and bpftrace (Chapter 15).
For any environment that imposes resource controls (e.g., cloud computing), check USE for each resource control. These may be encountered—and limit usage—before the hardware resource is fully utilized.
1The r
column reports those threads that are waiting and threads that are running on-CPU. See the vmstat(1)
description in Chapter 6, CPUs.
2Uses delay accounting; see Chapter 4, Observability Tools.
3There is also the sched:sched_process_wait tracepoint for perf(1); be careful about overheads when tracing, as scheduler events are frequent.
4There aren’t many error-related events in the recent Intel and AMD processor manuals.
5This can be used to show what is consuming memory and leading to saturation, by seeing what is causing minor faults. This should be available in htop(1) as MINFLT.
6Dropped packets are included as both saturation and error indicators, since they can occur due to both types of events.
7This includes tracing functions from different layers of the I/O subsystem: block device, SCSI, SATA, IDE... Some static probes are available (perf(1) scsi and block tracepoint events); otherwise, use dynamic tracing.
Component |
Type |
Metric |
---|---|---|
Kernel mutex |
Utilization |
With CONFIG_LOCK_STATS=y, /proc/lock_stat |
Kernel mutex |
Saturation |
With CONFIG_LOCK_STATS=y, /proc/lock_stat |
Kernel mutex |
Errors |
Dynamic instrumentation (e.g., recursive mutex enter); other errors can cause kernel lockup/panic, debug with kdump/ |
User mutex |
Utilization |
|
User mutex |
Saturation |
|
User mutex |
Errors |
|
Task capacity |
Utilization |
|
Task capacity |
Saturation |
Threads blocking on memory allocation; at this point the page scanner should be running ( |
Task capacity |
Errors |
“can’t fork()” errors; user-level threads: pthread_create() failures with EAGAIN, EINVAL, . . . ; kernel: dynamic tracing of kernel_thread() ENOMEM |
File descriptors |
Utilization |
System-wide: Per process: |
File descriptors |
Saturation |
This one may not make sense |
File descriptors |
Errors |
|
8Kernel lock analysis used to be via lockmeter, which had an interface called lockstat.
9Since these functions can be very frequent, beware of the performance overhead of tracing every call: an application could slow by 2x or more.
[Gregg 13d] Gregg, B., “USE Method: Linux Performance Checklist,” http://www.brendangregg.com/USEmethod/use-linux.html, first published 2013.