CHAPTER 6

image

Analyzing Operating System Performance

Solving database performance issues sometimes requires the use of operating system (OS) utilities. These tools often provide information that can help isolate database performance problems. Consider the following situations:

  • You’re running multiple databases and multiple applications on one server and want to use OS utilities to identify which process is consuming the most OS resources. This approach is invaluable when a process (that may or may not be related to a database) is consuming resources to the point of causing one or more of the databases on the box to perform poorly.
  • You need to verify if the database server is adequately sized for current application workload in terms of CPU, memory, storage I/O, and network bandwidth.
  • An analysis is needed to determine at what point the server will not be able to handle larger (future) workloads.
  • You’re leasing resources within a cloud and want to verify that you’re consuming the resources you’re paying for.
  • You’ve used database tools to identify system bottlenecks and want to double-check the analysis via OS tools.

In these scenarios, to effectively analyze, tune, and troubleshoot, you’ll need to employ OS tools to identify resource-intensive processes. Furthermore, if you have multiple databases and applications running on one server, when troubleshooting performance issues, it’s often more efficient to first determine which database and process is consuming the most resources. Operating system utilities help pinpoint whether the bottleneck is CPU, memory, storage I/O, or a network issue. Once you’ve identified the OS identifier, you can then query the database to show any corresponding database processes and SQL statements.

Take a look at Figure 6-1. This flowchart details the decision-making process and the relevant OS tools that a DBA steps through when diagnosing sluggish server performance. For example, when you’re dealing with performance problems, one common first task is to log on to the box and quickly check for disk space issues using OS utilities like df and du. A full mount point is a common cause of database unresponsiveness.

9781430261872_Fig06-01.jpg

Figure 6-1. Troubleshooting poor performance

After inspecting disk space issues, the next task is to use an OS utility such as vmstat, top, or ps to determine what type of bottleneck you have (CPU, memory, I/O, or network). After determining the type of bottleneck, the next step is to determine if a database process is causing the problem.

The ps command is useful for displaying the process name and ID of the resource-consuming session. When you have multiple databases running on one box, you can determine which database is associated with the process from the process name. Once you have the process ID and associated database, you can then log on to the database and run SQL queries to determine if the process is associated with a SQL query. If the problem is SQL-related, then you can identify further details regarding the SQL query and where it might be tuned.

Figure 6-1 encapsulates the difficulty of troubleshooting performance problems. Correctly pinpointing the cause of performance issues and recommending an efficient solution is often easier said than done. When trying to resolve issues, some paths result in relatively efficient and inexpensive solutions, such as terminating a runaway OS process, creating an index, or generating fresh statistics. Other decisions may lead you to conclude that you need to add expensive hardware or redesign the system. Your performance tuning conclusions can have long-lasting financial impact on your company and thus influence your ability to retain a job. Obviously you want to focus on the cause of a performance problem and not just address the symptom. If you can consistently identify the root cause of the performance issue and recommend an effective and inexpensive solution, this will greatly increase your value to your employer.

The focus of this chapter is to provide detailed examples that show how to use Linux/Unix OS utilities to identify server performance issues. These utilities are invaluable for providing extra information used to diagnose performance issues outside of tools available within the database. Operating system utilities act as an extra set of eyes to help zero in on the cause of poor database performance.

image Tip  If you work in a Windows environment, consider downloading the Cygwin (www.cygwin.com). This provides you a collection of Linux/Unix tools in your Windows environment.

6-1. Detecting Disk Space Issues

Problem

Users are reporting that they can’t connect to a database. You log on to the database server, attempt to connect to SQL*Plus, and receive this error:

ORA-09817: Write to audit file failed.
Linux Error: 28: No space left on device
Additional information: 12

You want to quickly determine if a mount point is full and where the largest files are within this mount point.

Solution

In a Linux/Unix environment, use the df command to identify disk space issues. This example uses the h (human readable) parameter to format the output so that space is reported in megabytes or gigabytes:

$ df –h

Here is some sample output:

Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup00-LogVol00
                      222G  210G     0 100% /
/dev/sda1              99M   23M   71M  25% /boot

The prior output indicates that the root (/) file system is full on this server. In this situation, once a full mount point is identified, then use the find command to locate the largest files contained in a directory structure. This example navigates to the ORACLE_HOME directory and then combines the find, ls, sort, and head commands to identify the largest files beneath that directory:

$ cd $ORACLE_HOME
$ find . -ls | sort -nrk7 | head -10

If you have a full mount point, also consider looking for the following types of files that can be moved or removed:

  • Deleting database trace files (these files often have the extension of .trm or .trc)
  • Removing large Oracle Net log files
  • Moving, compressing, or deleting old archive redo log files
  • Removing old installation files or binaries
  • If you have datafiles with ample free space, consider resizing them to smaller sizes

image Tip  You can view the names and values of directories where Oracle writes trace files with this query: select value from v$diag_info where name=‘Diag Trace’;

Another way to identify where the disk space is being used is to find the largest space-consuming directories beneath a given directory. This Linux example combines the du, sort, and head commands to show the 10 largest directories beneath the current working directory:

$ du -S . | sort -nr | head -10

The prior command is particularly useful for identifying a directory that might not necessarily have large files in it but lots of small files consuming space (like trace files).

image Note  On Solaris Unix systems, the prior command will need to use du with the –o option.

How It Works

When you have a database that is hung because there is little or no free disk space, you should quickly find files that can be safely removed without compromising database availability. On Linux/Unix servers, the df, find, and du commands are particularly useful.

When working with production database servers, it’s highly desirable to proactively monitor disk space so that you’re warned about a mount point becoming full. Listed next is a simple shell script that monitors disk space for a given set of mount points:

#!/bin/bash
mntlist="/orahome /oraredo1 /oraarch1 /ora01 /oradump01 /"
for ml in $mntlist
do
echo $ml
usedSpc=$(df -h $ml | awk '{print $5}' | grep -v capacity | cut -d "%" -f1 -)
BOX=$(uname -a | awk '{print $2}')
#
case $usedSpc in
[0-9])
arcStat="relax, lots of disk space: $usedSpc"
;;
[1-7][0-9])
arcStat="disk space okay: $usedSpc"
;;
[8][0-9])
arcStat="space getting low: $usedSpc"
;;
[9][0-9])
arcStat="warning, running out of space: $usedSpc"
echo $arcStat $ml | mailx -s "space on: $BOX"[email protected]
;;
[1][0][0])
arcStat="update resume, no space left: $usedSpc"
echo $arcStat $ml | mailx -s "space on: $BOX"[email protected]
;;
*)
arcStat="huh?: $usedSpc"
esac
#
BOX=$(uname -a | awk '{print $2}')
echo $arcStat
#
done
#
exit 0

You’ll have to modify the script to match your environment. For example, the second line of the script specifies the mount points on the box being monitored:

mntlist="/orahome /oraredo1 /oraarch1 /ora01 /oradump01 /"

These mount points should match the mount points listed in the output of the df –h command. For a Solaris box that this script runs on, here’s the output of df:

Filesystem             size   used  avail capacity  Mounted on
/                       35G   5.9G    30G    17%    /
/ora01                 230G   185G    45G    81%    /ora01
/oraarch1              100G    12G    88G    13%    /oraarch1
/oradump01             300G    56G   244G    19%    /oradump01
/orahome                20G    15G   5.4G    73%    /orahome
/oraredo1               30G   4.9G    25G    17%    /oraredo1

Also, depending on what version of Linux/Unix you’re using, you’ll have to modify this line as well:

usedSpc=$(df -h $ml | awk '{print $5}' | grep -v capacity | cut -d "%" -f1 -)

The prior code runs several Linux/Unix commands and places the output in the usedSpc variable.  The code depends on the output of the df command, which can vary somewhat depending on the OS vendor and version. For example, on some Linux systems, the output of df reports on Use% instead of capacity, so in this scenario, the usedSpc variable is populated as shown:

usedSpc=$(df -h $ml | grep % | grep -v Use | awk '{print $4}' | cut -d "%" -f1 -)

The command first runs df –h, which is piped to the grep command to find any string with the “%” character in it, and then uses –v to eliminate any strings that contain “Use” from the output. Then the awk command takes the output and prints out the fourth column. This is finally piped to the cut command, which removes the “%” character from the output.

On a Linux/Unix system, a shell script such as the prior one can easily be run from a scheduling utility such as cron. For example, if the shell script is named filesp.bsh, here is a sample cron entry:

#-----------------------------------------------------------------
# Filesystem check
7 * * * * /orahome/oracle/bin/filesp.bsh 1>/orahome/oracle/bin/log/filesp.log 2>&1
#-----------------------------------------------------------------

The prior entry instructs the system to run the filesp.bsh shell script at 7 minutes after the hour for every hour of the day.

6-2. Identifying System Bottlenecks

Problem

You want to determine if a server performance issue is specifically related to CPU, memory, I/O, or network.

Solution

Use vmstat to determine where the system is resource-constrained. For example, the following command reports on system resource usage every 5 seconds on a Linux system:

$ vmstat 5

Here is some sample output:

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r  b   swpd   free   buff  cache    si   so    bi   bo   in   cs us sy id wa st
2  0 228816 2036164  78604 3163452   0    0     1   16    0    0 29  0 70  0  0
2  0 228816 2035792  78612 3163456   0    0     0   59  398  528 50  1 49  0  0
2  0 228816 2035172  78620 3163448   0    0     0   39  437  561 50  1 49  0  0

To exit out of vmstat in this mode, press Ctrl+C. You can also have vmstat report for a specific number of runs. For example, this instructs vmstat to run every 6 seconds for a total of 10 reports:

$ vmstat 6 10
  • Here are some general heuristics you can use when interpreting the output of vmstat. If b (processes sleeping) is consistently greater than 0, then you may be using more CPU than available. See Recipe 6-5 for details on identifying Oracle processes and SQL statements consuming CPU resources.
  • If so (memory swapped out to disk) and si (memory swapped in from disk) are consistently greater than 0, you may have a memory bottleneck. Paging and swapping occur when there isn’t enough physical memory to accommodate the memory needs of the processes on the server. When paging and swapping take place, performance usually suffers because the process of copying memory contents to and from disk is an inherently slow activity. See Recipe 6-5 for details on identifying Oracle processes and SQL statements consuming the most memory.
  • If the wa (time waiting for I/O) column is high, this is usually an indication that the storage subsystem is overloaded. See Recipe 6-6 for identifying the sources of I/O contention.

image Note  The vmstat tool is available on most Linux/Unix platforms. This tool does vary somewhat by platform; use the man vmstat command to list descriptions of the output and options available for your environment.

How It Works

The vmstat (virtual memory statistics) tool helps quickly identify bottlenecks on your server. This provides a system wide view of the server health. Use the output of vmstat to help determine if the performance bottleneck is related to CPU, memory, or disk I/O.

Most OS monitoring tools focus on one aspect of system performance. The vmstat tool differs from other monitoring tools in that it displays statistics for all of the major components of the OS—namely:

  • Processing jobs
  • Memory
  • Swap
  • I/O
  • System
  • CPU

The processing jobs (procs) section has two columns: r and b. The r column displays the number of processes waiting for access to a processor. The b column displays the number of processes in a sleep state. These values are usually zero.

The memory area displays information about memory usage. This output is approximately the same as obtained by running the free command with the -m parameter:

$ free –m

The swpd column indicates how much virtual memory has been swapped out to a file or disk. The free column indicates the amount of unused memory.

The swap section displays the amount of memory swapped to and from disk. The si column displays the amount of memory per second swapped in from disk and so column displays the amount of memory per second swapped out to disk.

The IO section displays the number of blocks read and written per second. The bo column is blocks received from the block device and the bi column is the numbers sent to the block device.

The system area shows the number of system operations per second. The in column is the number system interrupts and the cs column is the number context switches.

The CPU section shows the percentage of system usage of the CPUs. These columns should sum to approximately 100 and reflect the percentage of available CPU time. The us column signifies the amount of time spent on non-kernel processes, whereas the si column shows the amount of time spent on kernel operations. The id column shows the amount of idle time. The wa column shows the amount of time waiting for IO operations to complete.

image Tip  Use the watch -d vmstat command to highlight differences in the report output as it refreshes.

OS WATCHER

Oracle provides a collection of OS scripts that gather and store metrics for CPU, memory, disk I/O, and network usage. The OS Watcher tool suite automates the gathering of statistics using tools such as vmstat, top, iostat, mpstat, netstat, and so on.

You can obtain OS Watcher from the My Oracle Support web site (support.oracle.com). The OS Watcher User Guide can be found under document ID 301137.1. This tool is supported on most Linux/Unix systems, and there is also a version for the Windows platform.

6-3. Determining Top System-Resource-Consuming Processes

Problem

You have a server that hosts multiple databases. Users are reporting sluggishness with an application that uses one of the databases. You want to identify which processes are consuming the most resources on the server and then determine if the top consuming process is associated with a database.

Solution

The top command shows a real-time display of the highest CPU and memory resource-consuming processes on a server. Here’s the simplest way to run top:

$ top

Listed next is a fragment of the output:

top - 10:57:47 up 65 days,  1:32,  2 users,  load average: 0.22, 0.08, 0.05
Tasks: 165 total,   2 running, 163 sleeping,   0 stopped,   0 zombie
Cpu(s): 46.4%us,  2.3%sy,  0.0%ni, 47.9%id,  3.3%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   2049404k total,  2033308k used,    16096k free,    95108k buffers
Swap:  4095992k total,    55744k used,  4040248k free,  1362984k cached
 
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
29603 oracle    20   0  966m 219m  69m R 34.4 11.0   0:10.16 oracle_29603_o1
   54 root      20   0     0    0    0 S  1.3  0.0   6:37.51 kswapd0
23255 oracle    -2   0  805m  15m  15m S  1.0  0.8  23:53.83 ora_vktm_o12c
 2516 root      20   0     0    0    0 S  0.3  0.0  14:26.21 kondemand/0
23293 oracle    20   0  805m  26m  24m S  0.3  1.3   1:17.38 ora_ckpt_o12c
23317 oracle    20   0  817m 126m 119m S  0.3  6.3   3:14.04 ora_mmon_o12c

Type q or press Ctrl+C to exit top. In the prior output, the first section of the output displays general system information such as how long the server has been running, number of users, CPU information, and so on.

image Tip  The w command will display the same information contained in the first line of the output of the top command. Additionally, w will display who is logged onto the box.

The second section shows which processes are consuming the most CPU resources (listed top to bottom). In the prior output, the process ID of 29603 is consuming a large amount of CPU and memory. To determine which database this process is associated with, use the ps command:

$ ps 29603

Here is the associated output:

  PID TTY      STAT   TIME COMMAND
29603 ?        Rs     1:08 oracleO12C (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))

In the prior output, the command column displays the value of oracleO12C. This indicates that this is an Oracle process associated with the O12C database. If the process continues to consume resources, you can next determine if there is a SQL statement associated with the process (see Recipe 6-8) or terminate the process (see Recipe 6-9).

image Note  If you’re on a Windows box, you can press the ctrl+alt+delete keys at the same time to start the Task Manager utility. This utility allows you to view which applications and processes are consuming the most CPU and memory.

How It Works

The top utility is often the first investigative tool employed by DBAs and system administrators to identify resource-intensive processes on a server. If a process is continuously consuming excessive system resources, then you should further determine if the process is associated with a database and a specific SQL statement.

By default, top will repetitively refresh (every few seconds) information regarding the most CPU-intensive processes. While top is running, you can interactively change its output. For example, if you type >, this will move the column that top is sorting one position to the right. Table 6-1 lists the most useful hot key features to alter the top display to the desired format.

Table 6-1. Commands to Interactively Change the top Output

Command

Function

Spacebar

Immediately refreshes the output

< or >

Moves the sort column one to the left or to the right; by default, top sorts on the CPU column

d

Changes the refresh time

R

Reverses the sort order

z

Toggles the color output

h

Displays help menu

F or O

Chooses a sort column

Table 6-2 describes several of the columns displayed by top. Use these descriptions to help interpret the output.

Table 6-2. Column Descriptions of the top Output

Column

Description

PID

Unique process identifier

USER

OS username running the process

PR

Priority of the process

NI

Nice value of process; negative value means high priority; positive value means low priority

VIRT

Total virtual memory used by process

RES

Non-swapped physical memory used

SHR

Shared memory used by process

S

Process status

%CPU

Processes percent of CPU consumption since last screen refresh

%MEM

Percent of physical memory the process is consuming

TIME+

Total CPU time, showing hundredths of seconds

COMMAND

Command line used to start a process

6-4. Detecting CPU Bottlenecks

Problem

You want to determine if there is an excessive load on the CPUs on your server.

image Note  There are two main utilities for identifying system wide CPU bottlenecks: vmstat and mpstat. The vmstat utility was described in Recipe 6-2. This recipe focuses on using mpstat.

Solution

The mpstat utility reports on system wide CPU statistics. You can specify an interval and count when running this utility. The following instructs mpstat to run 10 times while refreshing the output every 2 seconds:

$ mpstat 2 10

Here is a snippet of the output:

10:17:44 AM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
10:17:46 AM  all    0.50    0.00    0.50    0.75    0.00    0.00    0.00   98.25   1135.18
10:17:48 AM  all    0.25    0.00    0.74    0.00    0.00    0.00    0.00   99.01   1085.00

Here are some general guidelines for interpreting the output of the previous report:

  • If %idle is high, then your CPUs are most likely not overburdened
  • If the %iowat output is a non-zero number, then you may have some disk I/O contention
  • If you identify that the CPUs are overloaded, see Recipe 6-5 for techniques to pinpoint sessions consuming the most CPU resources

image Tip  On Solaris systems, see MOS document ID 1278725.1 for details on using dtrace to interpret mpstat output.

How It Works

The default output of mpstat will only show one line of aggregated statistics for all CPUs on the server. You can also view CPU snapshots that report statistics accumulated between intervals. The following example uses the -P option to report only on processor 0; it displays output every 2 seconds for a total of 20 different reports:

$ mpstat –P 0 2 20

Here is a snippet of the output:

12:20:35 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
12:20:37 PM    0    0.00    0.00    0.50    0.00    0.00    0.00    0.00   99.50      2.51
12:20:39 PM    0    0.00    0.00    1.00    0.00    0.00    0.00    0.00   99.00      2.00

You can also use the -P ALL option to print on separate lines each CPU’s statistics:

$ mpstat -P ALL

Here is the corresponding output:

12:22:45 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
12:22:45 PM  all    1.62    0.01    1.05    2.51    0.00    0.00    0.00   94.81    807.99
12:22:45 PM    0    1.22    0.01    1.03    0.64    0.00    0.00    0.00   97.11      1.56
12:22:45 PM    1    2.02    0.01    1.06    4.34    0.01    0.01    0.00   92.56     12.06

image Tip  To view the configuration of CPUs on a Linux server view the /proc/cpuinfo file. Similarly you can view the /proc/meminfo file to view memory configuration. On Solaris systems, run the prtdiag utility to view configured memory (and CPUs).

6-5. Identifying Processes Consuming CPU and Memory

Problem

You want to view which processes are consuming the most CPU and memory resources on your system.

image Note  There are two main utilities for identifying specific processes consuming the most CPU and memory resources: top and ps. The top utility was described in Recipe 6-3. This recipe focuses on using ps.

Solution

The ps (process status) command is handy for quickly identifying top resource-consuming processes. For example, this command displays the top 10 CPU-consuming resources on the box:

$ ps -e -o pcpu,pid,user,tty,args | sort -n -k 1 -r | head

Here is a partial listing of the output:

65.5  5017 oracle   ?        oracleO12C (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
 0.8  5014 oracle   pts/1    sqlplus
 0.8 23255 oracle   ?        ora_vktm_O12C

In the prior output, the process named oracleO12C is consuming an inordinate amount of CPU resources on the server. The process name identifies this as an Oracle process associated with the O12C database.

Similarly, you can also display the top memory-consuming processes:

$ ps -e -o pmem,pid,user,tty,args | sort -n -k 1 -r | head

Here is a snippet of the output:

11.2  5017 oracle   ?        oracleO12C (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
 7.1 23317 oracle   ?        ora_mmon_O12C
 6.3 23285 oracle   ?        ora_dbw0_O12C

How It Works

The Linux/Unix ps command displays information about currently active processes on the server. The pcpu switch instructs the process status to report the CPU usage of each process. Similarly the pmem switch instructs ps to report on process memory usage. This gives you a quick and easy way to determine which processes are consuming the most CPU or memory resources.

When using multiple commands on one line (such as ps, sort, and head), it’s often desirable to associate the combination of commands with a shortcut (alias). Here’s an example of creating aliases:

$ alias topc='ps -e -o pcpu,pid,user,tty,args | sort -n -k 1 -r | head'
$ alias topm='ps -e -o pmem,pid,user,tty,args | sort -n -k 1 -r | head'

Now instead of typing in the long line of commands, you can use the alias—for example:

$ topc

Also consider establishing the aliases in a startup file (like .bashrc or .profile) so that the commands are automatically defined when you log on to the database server.

On Solaris systems, the prstat utility can also be used to identify which processes are consuming the most CPU and memory resources. For example, you can instruct the prstat to report system statistics every 5 seconds:

$ prstat 5

Here is some sample output:

   PID USERNAME  SIZE   RSS STATE   PRI NICE      TIME  CPU PROCESS/NLWP
 24448 oracle     12G 8216M sleep    59    0   0:22:40 0.4% oracle/51
 24439 oracle     12G 8214M sleep    59    0   0:08:48 0.3% oracle/14
 24387 oracle     12G 4025M sleep    59    0   0:09:45 0.1% oracle/258

Type q or press Ctrl+C to exit prstat. After identifying a top resource-consuming process, you can determine which database the process is associated with by using the ps command. This example reports on process information associated with the PID of 24448:

$ ps -ef | grep 24448
  oracle 24448     1   1 16:33:42 ?          22:48 ora_pr00_DWREP

In this example, the name of the process is ora_pr00_DWREP and the associated database is DWREP.

image Tip  Use the ipcs or the sysresv commands to view memory segments allocated by Oracle (and associated semaphores).

6-6. Determining I/O Bottlenecks

Problem

You are experiencing performance problems and want to determine if the issues are related to slow disk I/O.

Solution

Use the iostat command with the -x (extended) option combined with the -d (device) option to generate I/O statistics. This next example displays extended device statistics every 10 seconds:

$ iostat –xd 10

You need a fairly wide screen to view this output; here’s a partial listing:

Device: rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda       0.00     2.00  0.00  0.80     0.00    22.40    28.00     0.01   15.88  15.87   1.27

This periodic extended output allows you to view in real time which devices are experiencing spikes in read and write activity. To exit from the previous iostat command, press Ctrl+C. The options and output may vary depending on your OS. For example, on some Linux/Unix distributions, the iostat output may report the disk utilization as %b (percent busy).

When trying to determine whether device I/O is the bottleneck, here are some general guidelines when examining the iostat output:

  • Watch for devices with high reads (r/s) or writes (w/s) per second. To determine if I/O is abnormally high, you’ll have to compare the rates with maximum I/O rate of your storage subsystem.
  • If any device is near 100% utilization, that’s a sign that I/O is a bottleneck.

DETERMINING I/O RATES

To get an estimate of the I/O rate of your storage system, use the dd command in combination with the time command. For example:

$ time dd if=/u01/dbfile/O12C/system01.dbf of=test.out

Here is some sample output:

1024016+0 records in
1024016+0 records out
524296192 bytes (524 MB) copied, 27.3087 seconds, 19.2 MB/s

real    0m27.349s
user    0m0.385s
sys     0m9.846s

You can divide the data file size by the total time the command too to run, this will give you the disk I/O rate (which is also displayed in the output). This is a useful metric in determining the speed at which the OS is able to read and write a file.

If you want an estimate of just the read rate, then send then use /dev/null for the output file (nothing gets written):

$ time dd if=/u01/dbfile/O12C/system01.dbf of=/dev/null

In this way you can estimate the I/O writes of your storage subsystem.

How It Works

The iostat command can help you determine whether disk I/O is potentially a source of performance problems. Table 6-3 describes the columns displayed in the iostat output.

Table 6-3. Column Descriptions of iostat Disk I/O Output

Column

Description

Device

Device or partition name

tps

I/O transfers per second to the device

Blk_read/s

Blocks per second read from the device

Blk_wrtn/s

Blocks written per second to the device

Blk_read

Number of blocks read

Blk_wrtn

Number of blocks written

rrqm/s

Number of read requests merged per second that were queued to device

wrqm/s

Number of write requests merged per second that were queued to device

r/s

Read requests per second

w/s

Write requests per second

rsec/s

Sectors read per second

wsec/s

Sectors written per second

rkB/s

Kilobytes read per second

wkB/s

Kilobytes written per second

avgrq-sz

Average size of requests in sectors

avgqu-sz

Average queue length of requests

await

Average time in milliseconds for I/O requests sent to the device to be served

svctm

Average service time in milliseconds

%util

Percentage of CPU time during which I/O requests were issued to the device. Near 100% indicates device saturation

You can also instruct iostat to display reports at a specified interval. The first report displayed will report averages since the last server reboot; each subsequent report shows statistics since the previously generated snapshot. The following example displays a device statistic report every 3 seconds:

$ iostat -d 3

You can also specify a finite number of reports that you want generated. This is useful for gathering metrics to be analyzed over a period of time. This example instructs iostat to report every 2 seconds for a total of 15 reports:

$ iostat 2 15

Also consider running iostat in combination with the watch command. The watch command is particularly useful when you want to highlight differences in the output with each screen refresh. For example:

$ watch -d iostat

When working with locally attached disks, the output of the iostat command will clearly show where the I/O is occurring. However, it is not that clear-cut in environments that use external arrays for storage. What you are presented with at the file system layer is some sort of a virtual disk that might also have been configured by a volume manager. In virtualized storage environments, you’ll have to work with your system administrator or storage administrator to determine exactly which disks are experiencing high I/O activity.

Once you have determined that you have a disk I/O contention issue, then you can use utilities such as AWR (if licensed), Statspack (no license required), or the V$ views to determine if your database is I/O stressed. For example, the AWR report contains an I/O statistics section with the following subsections:

  • IOStat by Function summary
  • IOStat by Filetype summary
  • IOStat by Function/Filetype summary
  • Tablespace IO Stats
  • File IO Stats

If you want to display current database sessions that are waiting for I/O resources, you can query the data dictionary as follows:

SELECT a.username, a.sql_id, b.object_name, b.object_type, a.event
FROM v$session a
    ,dba_objects b
    ,v$event_name c
WHERE b.object_id = a.row_wait_obj#
AND   a.event = c.name
AND   c.wait_class = 'User I/O';

6-7. Detecting Network-Intensive Processes

Problem

You’re investigating performance issues on a database server. As part of your investigation, you want to determine if there are network bottlenecks on the system.

Solution

Use the netstat (network statistics) command to display network traffic. Perhaps the most useful way to view netstat output is with the -ptc options. These options display the process ID and TCP connections, and they continuously update the output:

$ netstat -ptc

Press Ctrl+C to exit the previous command. Here’s a partial listing of the output:

Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address             State
PID/Program name
tcp        0      0 speed3:ssh    dhcp-brk-bl8-214-2w-e:59032 ESTABLISHED -
tcp        0      0 speed3:6204   speed3.us.oracle.c:ncube-lm ESTABLISHED -
tcp        0      0 speed3:ssh    dhcp-brk-bl8-214-2w-e:63682 ESTABLISHED -

If the Send-Q (bytes not acknowledged by remote host) column has an unusually high value for a process, this may indicate an overloaded network. The useful aspect about the previous output is that you can determine the OS process ID (PID) associated with a network connection. If you suspect the connection in question is an oracle session, you can use the techniques described in the “Solution” section of Recipe 6-8 to map an OS PID to an Oracle process or SQL statement.

How It Works

When experiencing performance issues, usually the network bandwidth is not the cause. Most likely you’ll determine that bad performance is related to a poorly constructed SQL statement, inadequate disk I/O, badly designed processes, or not enough CPU or memory resources. However, as a DBA, you need to be aware of all sources of performance bottlenecks and how to diagnose them. In today’s highly interconnected world, you must possess network troubleshooting and monitoring skills. The netstat utility is a good starting place for monitoring server network connections.

TROUBLESHOOTING DATABASE NETWORK CONNECTIVITY

Use these steps as guidelines when diagnosing Oracle database network connectivity issues:

Step 1. Use the OS ping utility to determine whether the remote box is accessible—for example:

$ ping dwdb
dwdb is alive

If ping doesn’t work, work with your system or network administrator to ensure you have server-to-server connectivity in place.

Step 2. Use telnet to see if you can connect to the remote server and port (that the listener is listening on)—for example:

$ telnet dwdb 1521
Trying 127.0.0.1...
Connected to dwdb.
Escape character is '^]'.

The prior output indicates that connectivity to a server and port is okay. If the prior command hangs, then contact your SA or network administrator for further assistance.

Step 3. Use tnsping to determine whether Oracle Net is working. This utility will verify that an Oracle Net connection can be made to a database via the network—for example:

$ tnsping dwrep
..........
Used TNSNAMES adapter to resolve the alias
Attempting to contact (DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)
(HOST = dwdb.us.farm.com)(PORT = 1521))
(CONNECT_DATA = (SERVER = DEDICATED) (SERVICE_NAME = DWREP)))
OK (500 msec)

If tnsping can’t contact the remote database, verify that the remote listener and database are both up and running. On the remote box, use the lsnrctl status command to verify that the listener is up. Verify that the remote database is available by establishing a local connection as a non-SYS account (SYS can often connect to a troubled database when other schemas will not work).

Step 4. Verify that the TNS information is correct. If the remote listener and database are working, then ensure that the mechanism for determining TNS information (like the tnsnames.ora file) contains the correct information.

Sometimes the client machine will have multiple TNS_ADMIN locations and tnsnames.ora files. One way to verify whether a particular tnsnames.ora file is being used is to rename it and see whether you get a different error when attempting to connect to the remote database.

Step 5. If you’re still having issues, examine the client sqlnet.log file and the remote server listener.log file. Sometimes these log files will show additional information that will pinpoint the issue.

6-8. Mapping a Resource-Intensive Process to a Database Process

Problem

The system is performing poorly and you want to identify which OS intensive process on the box is consuming the most resources. Furthermore, you want to map an OS process back to a database process. If the database process is a SQL process, you want to display the details of the SQL statement (such as the user, explain plan, and so on).

Solution

In Linux/Unix environments, if you can identify the resource intensive OS process, then you can easily check to see if that process is associated with a database process. This procedure consists of the following:

  1. Run an OS command to identify resource-intensive processes and associated IDs.
  2. Identify the database associated with the process.
  3. Extract details about the process from the database data dictionary views.
  4. If it’s a SQL statement, get those details.
  5. Generate an execution plan for the SQL statement.

For example, suppose you identify the top CPU-consuming queries with the ps command:

$ ps -e -o pcpu,pid,user,tty,args|grep -i oracle|sort -n -k 1 -r|head

Here is some sample output:

45.7 23651 oracle   ?        oracleO12C (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
 0.8 23255 oracle   ?        ora_vktm_O12C
 0.2 23317 oracle   ?        ora_mmon_O12C

The prior output identifies one OS process consuming an excessive amount of CPU. The process ID is 23651 and name is oracleO12C. From the process name, it’s an Oracle process associated with the O12C database.

You can determine what type of Oracle process this is by querying the data dictionary:

SELECT
  'USERNAME    : ' || s.username     || CHR(10) ||
  'SCHEMA      : ' || s.schemaname   || CHR(10) ||
  'OSUSER      : ' || s.osuser       || CHR(10) ||
  'MODUEL      : ' || s.program      || CHR(10) ||
  'ACTION      : ' || s.schemaname   || CHR(10) ||
  'CLIENT_INFO : ' || s.osuser       || CHR(10) ||
  'PROGRAM     : ' || s.program      || CHR(10) ||
  'SPID        : ' || p.spid         || CHR(10) ||
  'SID         : ' || s.sid          || CHR(10) ||
  'SERIAL#     : ' || s.serial#      || CHR(10) ||
  'KILL STRING : ' || '''' || s.sid || ',' || s.serial# || ''''  || CHR(10) ||
  'MACHINE     : ' || s.machine      || CHR(10) ||
  'TYPE        : ' || s.type         || CHR(10) ||
  'TERMINAL    : ' || s.terminal     || CHR(10) ||
  'SQL ID      : ' || q.sql_id       || CHR(10) ||
  'CHILD_NUM   : ' || q.child_number || CHR(10) ||
  'SQL TEXT    : ' || q.sql_text
FROM v$session s
    ,v$process p
    ,v$sql     q
WHERE s.paddr  = p.addr
AND   p.spid   = '&PID_FROM_OS'
AND   s.sql_id = q.sql_id(+)
AND   s.status = 'ACTIVE';

The prior script prompts you for the OS process ID. Here is the output for this example:

USERNAME    : MV_MAINT
SCHEMA      : MV_MAINT
OSUSER      : oracle
MODUEL      : sqlplus@speed2 (TNS V1-V3)
ACTION      : MV_MAINT
CLIENT_INFO : oracle
PROGRAM     : sqlplus@speed2 (TNS V1-V3)
SPID        : 28191
SID         : 10
SERIAL#     : 2039
KILL STRING : '10,2039'
MACHINE     : speed2
TYPE        : USER
TERMINAL    : pts/1
SQL ID      : 3k8vqz6yycr3g
CHILD_NUM   : 0
SQL TEXT    : select a.table_name from dba_tables a, dba_indexes b, dba_objects

The output indicates that this is a SQL*Plus process with a database SID of 10 and SERIAL# of 2039. You’ll need this information if you decide to terminate the process with the ALTER SYSTEM KILL SESSION statement (see Recipe 6-9 for details).

In this example, since the process is running a SQL statement, further details about the query can be extracted by generating an execution plan:

SQL> SELECT * FROM table(DBMS_XPLAN.DISPLAY_CURSOR('&sql_id',&child_num));

You’ll be prompted for the sql_id and child_num when you run the prior statement. Here is a partial listing of the output:

SQL_ID  3k8vqz6yycr3g, child number 0
-------------------------------------
select a.table_name from dba_tables a, dba_indexes b, dba_objects c
...

This output will help you determine the efficiency of the SQL statement and provide insight on how to tune it. Refer to Chapter 9 for details on how to manually tune a query and Chapter 11 for automated SQL tuning.

Keep in mind that when selecting from V$SESSION, the records displayed within that view show the current SQL statement that the session is executing. It’s entirely possible that a given session can run several resource consuming queries in succession. Therefore it’s conceivable you might identify a query for a session (Step 3) that wasn’t the SQL executing when you first identified the OS process (Step 1). That said, the prior solution works well for long running queries consuming large amounts of system resources.

How It Works

The process described in the “Solution” section of this recipe allows you to quickly identify resource-intensive processes, then map the OS process to a database process, and subsequently map the database process to a SQL statement. Once you know which SQL statement is consuming resources, then you can generate an execution plan to further attempt to determine any possible inefficiencies.

When generating the execution plan for a SQL statement, you need to provide DBMS_XPLAN both the SQL_ID and CHILD_NUMBER to correctly identify an execution plan. The SQL_ID uniquely identifies the SQL statement. If the SQL statement has executed more than once, it is possible for the optimizer to generate and use one or more execution plans. If a given SQL statement has more than one execution plan associated with it, then the combination of SQL_ID and CHILD_NUMBER uniquely identifies an execution plan for a given SQL statement.

Sometimes the resource-consuming process will not be associated with a database. In these scenarios, you’ll have to work with your SA to determine what the process is and if it can be tuned or terminated. Also, you may encounter resource-intensive processes that are database-specific but not associated with a SQL statement. For example, you might have a long-running RMAN backup process, Data Pump, or PL/SQL jobs running. In these cases, work with your DBA to identify whether these types of processes can be tuned, scheduled to run at another time of day, or killed.

ORADEBUG

You can use Oracle’s oradebugutility to display top consuming SQL statements if you know the OS ID. For example, suppose that you have used a utility such as top or ps to identify a high CPU-consuming OS process, and from the name of the process you determine it’s a database process. Now log in to SQL*Plus and use oradebug to display any SQL associated with the process. In this example, the OS process ID is 23651:

SQL> oradebug setospid 23651;
Oracle pid: 32, Unix process pid: 23651, image: oracle@speed2 (TNS V1-V3)

Now show the SQL associated with this process (if any):

SQL> oradebug current_sql;

If there is a SQL statement associated with the process, it will be displayed—for example:

select a.table_name from dba_tables a, dba_indexes b, dba_objects c

The oradebug utility can be used in a variety of methods to help troubleshoot performance issues. Use oradebug help to display all options available.

6-9. Terminating a Resource-Intensive Process

Problem

You have identified a database process that is consuming inordinate amounts of system resources (see Recipe 6-8) and determined that it’s a runaway SQL statement that needs to be killed.

Solution

There are three basic ways to terminate a SQL process:

  • If you have access to the terminal where the SQL statement is running, you can press Ctrl+C and attempt to halt the process.
  • Determine the session ID and serial number, and then use the SQL ALTER SYSTEM KILL SESSION statement.
  • Determine the OS process ID, and use the kill utility to stop the process.

If you happen to have access to the terminal from which the resource-consuming SQL statement is running, you can attempt to press Ctrl+C to terminate the process. Often you don’t have access to the terminal and will have to use a SQL statement or an OS command to terminate the process.

Using SQL to Kill a Session

If it’s an Oracle process and you have identified the SID and SERIAL# (see Recipe 6-8), you can terminate a process from within SQL*Plus. Here is the general syntax:

alter system kill session 'integer1, integer2 [,integer3]' [immediate];

In the prior syntax statement, integer1 is the value of the SID column and integer2 is the value from the SERIAL# column (of V$SESSION). In a RAC environment, you can optionally specify the value of the instance ID for integer3. The instance ID can be retrieved from the GV$SESSION view.

Here’s an example that terminates a process with a SID of 1177 and a SERIAL# of 38583:

SQL> alter system kill session '1177,38583';

If successful, you should see this output:

System altered.

When you kill a session, this will mark the session as terminated, roll back active transactions (within the session), and release any locks (held by the session). The session will stay in a terminated state until any dependent transactions are rolled back. If it takes a minute or more to roll back the transaction, Oracle reports the session as “marked to be terminated” and returns control to the SQL prompt. If you specify IMMEDIATE (optional), Oracle will roll back any active transactions and immediately return control back to you.

image Caution  Killing a session that is executing a select statement is fairly harmless. However, if you terminate a session that is performing a large insert/update/delete then you may see a great deal of database activity (including I/O in the online redo logs) associated with Oracle rolling back the terminated transaction.

Using the OS Kill Command

If you have access to the database server and access to an OS account that has privileges to terminate an Oracle process (such as the oracle OS account), you can also terminate a process directly with the kill command.

image Caution  Ensure that you don’t kill the wrong Oracle process. If you accidentally kill a required Oracle background process, this will cause your instance to abort.

For example, suppose you run the ps command and have done the associated work to determine that you have a SQL statement that has been running for hours and needs to be terminated. The kill command directly terminates the OS process. In this example, the process ID of 6254 is terminated:

$ kill -9 6254

image Caution  In some rare situations killing an Oracle process associated with a SQL transaction can have an adverse impact on the stability of the instance. For example, killing a process participating in a distributed transaction may cause the instance to crash. To determine if this is an issue for the version of the database you’re using see MOS bug IDs 8686128 and 12961905. In older versions of Oracle various other bugs associated with killing a process have been identified and fixed and are documented in MOS bug IDs 5929055 and 6865378.

How It Works

Sometimes you’ll find yourself in a situation where you need to kill hung SQL processes, database jobs, or SQL statements that are consuming inordinate amounts of resources. For example, you may have a test server where a job has been running for several hours, is consuming much of the server resources, and needs to be stopped so that other jobs can continue to process in a timely manner.

Manually killing a SQL statement will cause the transaction to be rolled back. Therefore take care when doing this. Ensure that you are killing the correct process. If you erroneously terminate a critical process, this obviously will have an adverse impact on the application and associated data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset