Analyzing Operating System Performance
Solving database performance issues sometimes requires the use of operating system (OS) utilities. These tools often provide information that can help isolate database performance problems. Consider the following situations:
In these scenarios, to effectively analyze, tune, and troubleshoot, you’ll need to employ OS tools to identify resource-intensive processes. Furthermore, if you have multiple databases and applications running on one server, when troubleshooting performance issues, it’s often more efficient to first determine which database and process is consuming the most resources. Operating system utilities help pinpoint whether the bottleneck is CPU, memory, storage I/O, or a network issue. Once you’ve identified the OS identifier, you can then query the database to show any corresponding database processes and SQL statements.
Take a look at Figure 6-1. This flowchart details the decision-making process and the relevant OS tools that a DBA steps through when diagnosing sluggish server performance. For example, when you’re dealing with performance problems, one common first task is to log on to the box and quickly check for disk space issues using OS utilities like df and du. A full mount point is a common cause of database unresponsiveness.
Figure 6-1. Troubleshooting poor performance
After inspecting disk space issues, the next task is to use an OS utility such as vmstat, top, or ps to determine what type of bottleneck you have (CPU, memory, I/O, or network). After determining the type of bottleneck, the next step is to determine if a database process is causing the problem.
The ps command is useful for displaying the process name and ID of the resource-consuming session. When you have multiple databases running on one box, you can determine which database is associated with the process from the process name. Once you have the process ID and associated database, you can then log on to the database and run SQL queries to determine if the process is associated with a SQL query. If the problem is SQL-related, then you can identify further details regarding the SQL query and where it might be tuned.
Figure 6-1 encapsulates the difficulty of troubleshooting performance problems. Correctly pinpointing the cause of performance issues and recommending an efficient solution is often easier said than done. When trying to resolve issues, some paths result in relatively efficient and inexpensive solutions, such as terminating a runaway OS process, creating an index, or generating fresh statistics. Other decisions may lead you to conclude that you need to add expensive hardware or redesign the system. Your performance tuning conclusions can have long-lasting financial impact on your company and thus influence your ability to retain a job. Obviously you want to focus on the cause of a performance problem and not just address the symptom. If you can consistently identify the root cause of the performance issue and recommend an effective and inexpensive solution, this will greatly increase your value to your employer.
The focus of this chapter is to provide detailed examples that show how to use Linux/Unix OS utilities to identify server performance issues. These utilities are invaluable for providing extra information used to diagnose performance issues outside of tools available within the database. Operating system utilities act as an extra set of eyes to help zero in on the cause of poor database performance.
Tip If you work in a Windows environment, consider downloading the Cygwin (www.cygwin.com). This provides you a collection of Linux/Unix tools in your Windows environment.
6-1. Detecting Disk Space Issues
Users are reporting that they can’t connect to a database. You log on to the database server, attempt to connect to SQL*Plus, and receive this error:
ORA-09817: Write to audit file failed.
Linux Error: 28: No space left on device
Additional information: 12
You want to quickly determine if a mount point is full and where the largest files are within this mount point.
Solution
In a Linux/Unix environment, use the df command to identify disk space issues. This example uses the h (human readable) parameter to format the output so that space is reported in megabytes or gigabytes:
$ df –h
Here is some sample output:
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/VolGroup00-LogVol00
222G 210G 0 100% /
/dev/sda1 99M 23M 71M 25% /boot
The prior output indicates that the root (/) file system is full on this server. In this situation, once a full mount point is identified, then use the find command to locate the largest files contained in a directory structure. This example navigates to the ORACLE_HOME directory and then combines the find, ls, sort, and head commands to identify the largest files beneath that directory:
$ cd $ORACLE_HOME
$ find . -ls | sort -nrk7 | head -10
If you have a full mount point, also consider looking for the following types of files that can be moved or removed:
Tip You can view the names and values of directories where Oracle writes trace files with this query: select value from v$diag_info where name=‘Diag Trace’;
Another way to identify where the disk space is being used is to find the largest space-consuming directories beneath a given directory. This Linux example combines the du, sort, and head commands to show the 10 largest directories beneath the current working directory:
$ du -S . | sort -nr | head -10
The prior command is particularly useful for identifying a directory that might not necessarily have large files in it but lots of small files consuming space (like trace files).
Note On Solaris Unix systems, the prior command will need to use du with the –o option.
How It Works
When you have a database that is hung because there is little or no free disk space, you should quickly find files that can be safely removed without compromising database availability. On Linux/Unix servers, the df, find, and du commands are particularly useful.
When working with production database servers, it’s highly desirable to proactively monitor disk space so that you’re warned about a mount point becoming full. Listed next is a simple shell script that monitors disk space for a given set of mount points:
#!/bin/bash
mntlist="/orahome /oraredo1 /oraarch1 /ora01 /oradump01 /"
for ml in $mntlist
do
echo $ml
usedSpc=$(df -h $ml | awk '{print $5}' | grep -v capacity | cut -d "%" -f1 -)
BOX=$(uname -a | awk '{print $2}')
#
case $usedSpc in
[0-9])
arcStat="relax, lots of disk space: $usedSpc"
;;
[1-7][0-9])
arcStat="disk space okay: $usedSpc"
;;
[8][0-9])
arcStat="space getting low: $usedSpc"
;;
[9][0-9])
arcStat="warning, running out of space: $usedSpc"
echo $arcStat $ml | mailx -s "space on: $BOX"[email protected]
;;
[1][0][0])
arcStat="update resume, no space left: $usedSpc"
echo $arcStat $ml | mailx -s "space on: $BOX"[email protected]
;;
*)
arcStat="huh?: $usedSpc"
esac
#
BOX=$(uname -a | awk '{print $2}')
echo $arcStat
#
done
#
exit 0
You’ll have to modify the script to match your environment. For example, the second line of the script specifies the mount points on the box being monitored:
mntlist="/orahome /oraredo1 /oraarch1 /ora01 /oradump01 /"
These mount points should match the mount points listed in the output of the df –h command. For a Solaris box that this script runs on, here’s the output of df:
Filesystem size used avail capacity Mounted on
/ 35G 5.9G 30G 17% /
/ora01 230G 185G 45G 81% /ora01
/oraarch1 100G 12G 88G 13% /oraarch1
/oradump01 300G 56G 244G 19% /oradump01
/orahome 20G 15G 5.4G 73% /orahome
/oraredo1 30G 4.9G 25G 17% /oraredo1
Also, depending on what version of Linux/Unix you’re using, you’ll have to modify this line as well:
usedSpc=$(df -h $ml | awk '{print $5}' | grep -v capacity | cut -d "%" -f1 -)
The prior code runs several Linux/Unix commands and places the output in the usedSpc variable. The code depends on the output of the df command, which can vary somewhat depending on the OS vendor and version. For example, on some Linux systems, the output of df reports on Use% instead of capacity, so in this scenario, the usedSpc variable is populated as shown:
usedSpc=$(df -h $ml | grep % | grep -v Use | awk '{print $4}' | cut -d "%" -f1 -)
The command first runs df –h, which is piped to the grep command to find any string with the “%” character in it, and then uses –v to eliminate any strings that contain “Use” from the output. Then the awk command takes the output and prints out the fourth column. This is finally piped to the cut command, which removes the “%” character from the output.
On a Linux/Unix system, a shell script such as the prior one can easily be run from a scheduling utility such as cron. For example, if the shell script is named filesp.bsh, here is a sample cron entry:
#-----------------------------------------------------------------
# Filesystem check
7 * * * * /orahome/oracle/bin/filesp.bsh 1>/orahome/oracle/bin/log/filesp.log 2>&1
#-----------------------------------------------------------------
The prior entry instructs the system to run the filesp.bsh shell script at 7 minutes after the hour for every hour of the day.
6-2. Identifying System Bottlenecks
You want to determine if a server performance issue is specifically related to CPU, memory, I/O, or network.
Solution
Use vmstat to determine where the system is resource-constrained. For example, the following command reports on system resource usage every 5 seconds on a Linux system:
$ vmstat 5
Here is some sample output:
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 0 228816 2036164 78604 3163452 0 0 1 16 0 0 29 0 70 0 0
2 0 228816 2035792 78612 3163456 0 0 0 59 398 528 50 1 49 0 0
2 0 228816 2035172 78620 3163448 0 0 0 39 437 561 50 1 49 0 0
To exit out of vmstat in this mode, press Ctrl+C. You can also have vmstat report for a specific number of runs. For example, this instructs vmstat to run every 6 seconds for a total of 10 reports:
$ vmstat 6 10
Note The vmstat tool is available on most Linux/Unix platforms. This tool does vary somewhat by platform; use the man vmstat command to list descriptions of the output and options available for your environment.
The vmstat (virtual memory statistics) tool helps quickly identify bottlenecks on your server. This provides a system wide view of the server health. Use the output of vmstat to help determine if the performance bottleneck is related to CPU, memory, or disk I/O.
Most OS monitoring tools focus on one aspect of system performance. The vmstat tool differs from other monitoring tools in that it displays statistics for all of the major components of the OS—namely:
The processing jobs (procs) section has two columns: r and b. The r column displays the number of processes waiting for access to a processor. The b column displays the number of processes in a sleep state. These values are usually zero.
The memory area displays information about memory usage. This output is approximately the same as obtained by running the free command with the -m parameter:
$ free –m
The swpd column indicates how much virtual memory has been swapped out to a file or disk. The free column indicates the amount of unused memory.
The swap section displays the amount of memory swapped to and from disk. The si column displays the amount of memory per second swapped in from disk and so column displays the amount of memory per second swapped out to disk.
The IO section displays the number of blocks read and written per second. The bo column is blocks received from the block device and the bi column is the numbers sent to the block device.
The system area shows the number of system operations per second. The in column is the number system interrupts and the cs column is the number context switches.
The CPU section shows the percentage of system usage of the CPUs. These columns should sum to approximately 100 and reflect the percentage of available CPU time. The us column signifies the amount of time spent on non-kernel processes, whereas the si column shows the amount of time spent on kernel operations. The id column shows the amount of idle time. The wa column shows the amount of time waiting for IO operations to complete.
Tip Use the watch -d vmstat command to highlight differences in the report output as it refreshes.
Oracle provides a collection of OS scripts that gather and store metrics for CPU, memory, disk I/O, and network usage. The OS Watcher tool suite automates the gathering of statistics using tools such as vmstat, top, iostat, mpstat, netstat, and so on.
You can obtain OS Watcher from the My Oracle Support web site (support.oracle.com). The OS Watcher User Guide can be found under document ID 301137.1. This tool is supported on most Linux/Unix systems, and there is also a version for the Windows platform.
6-3. Determining Top System-Resource-Consuming Processes
You have a server that hosts multiple databases. Users are reporting sluggishness with an application that uses one of the databases. You want to identify which processes are consuming the most resources on the server and then determine if the top consuming process is associated with a database.
Solution
The top command shows a real-time display of the highest CPU and memory resource-consuming processes on a server. Here’s the simplest way to run top:
$ top
Listed next is a fragment of the output:
top - 10:57:47 up 65 days, 1:32, 2 users, load average: 0.22, 0.08, 0.05
Tasks: 165 total, 2 running, 163 sleeping, 0 stopped, 0 zombie
Cpu(s): 46.4%us, 2.3%sy, 0.0%ni, 47.9%id, 3.3%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 2049404k total, 2033308k used, 16096k free, 95108k buffers
Swap: 4095992k total, 55744k used, 4040248k free, 1362984k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
29603 oracle 20 0 966m 219m 69m R 34.4 11.0 0:10.16 oracle_29603_o1
54 root 20 0 0 0 0 S 1.3 0.0 6:37.51 kswapd0
23255 oracle -2 0 805m 15m 15m S 1.0 0.8 23:53.83 ora_vktm_o12c
2516 root 20 0 0 0 0 S 0.3 0.0 14:26.21 kondemand/0
23293 oracle 20 0 805m 26m 24m S 0.3 1.3 1:17.38 ora_ckpt_o12c
23317 oracle 20 0 817m 126m 119m S 0.3 6.3 3:14.04 ora_mmon_o12c
Type q or press Ctrl+C to exit top. In the prior output, the first section of the output displays general system information such as how long the server has been running, number of users, CPU information, and so on.
Tip The w command will display the same information contained in the first line of the output of the top command. Additionally, w will display who is logged onto the box.
The second section shows which processes are consuming the most CPU resources (listed top to bottom). In the prior output, the process ID of 29603 is consuming a large amount of CPU and memory. To determine which database this process is associated with, use the ps command:
$ ps 29603
Here is the associated output:
PID TTY STAT TIME COMMAND
29603 ? Rs 1:08 oracleO12C (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
In the prior output, the command column displays the value of oracleO12C. This indicates that this is an Oracle process associated with the O12C database. If the process continues to consume resources, you can next determine if there is a SQL statement associated with the process (see Recipe 6-8) or terminate the process (see Recipe 6-9).
How It Works
The top utility is often the first investigative tool employed by DBAs and system administrators to identify resource-intensive processes on a server. If a process is continuously consuming excessive system resources, then you should further determine if the process is associated with a database and a specific SQL statement.
By default, top will repetitively refresh (every few seconds) information regarding the most CPU-intensive processes. While top is running, you can interactively change its output. For example, if you type >, this will move the column that top is sorting one position to the right. Table 6-1 lists the most useful hot key features to alter the top display to the desired format.
Table 6-1. Commands to Interactively Change the top Output
Command |
Function |
---|---|
Spacebar |
Immediately refreshes the output |
< or > |
Moves the sort column one to the left or to the right; by default, top sorts on the CPU column |
d |
Changes the refresh time |
R |
Reverses the sort order |
z |
Toggles the color output |
h |
Displays help menu |
F or O |
Chooses a sort column |
Table 6-2 describes several of the columns displayed by top. Use these descriptions to help interpret the output.
Table 6-2. Column Descriptions of the top Output
Column |
Description |
---|---|
PID |
Unique process identifier |
USER |
OS username running the process |
PR |
Priority of the process |
NI |
Nice value of process; negative value means high priority; positive value means low priority |
VIRT |
Total virtual memory used by process |
RES |
Non-swapped physical memory used |
SHR |
Shared memory used by process |
S |
Process status |
%CPU |
Processes percent of CPU consumption since last screen refresh |
%MEM |
Percent of physical memory the process is consuming |
TIME+ |
Total CPU time, showing hundredths of seconds |
COMMAND |
Command line used to start a process |
6-4. Detecting CPU Bottlenecks
Problem
You want to determine if there is an excessive load on the CPUs on your server.
Note There are two main utilities for identifying system wide CPU bottlenecks: vmstat and mpstat. The vmstat utility was described in Recipe 6-2. This recipe focuses on using mpstat.
Solution
The mpstat utility reports on system wide CPU statistics. You can specify an interval and count when running this utility. The following instructs mpstat to run 10 times while refreshing the output every 2 seconds:
$ mpstat 2 10
Here is a snippet of the output:
10:17:44 AM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
10:17:46 AM all 0.50 0.00 0.50 0.75 0.00 0.00 0.00 98.25 1135.18
10:17:48 AM all 0.25 0.00 0.74 0.00 0.00 0.00 0.00 99.01 1085.00
Here are some general guidelines for interpreting the output of the previous report:
Tip On Solaris systems, see MOS document ID 1278725.1 for details on using dtrace to interpret mpstat output.
How It Works
The default output of mpstat will only show one line of aggregated statistics for all CPUs on the server. You can also view CPU snapshots that report statistics accumulated between intervals. The following example uses the -P option to report only on processor 0; it displays output every 2 seconds for a total of 20 different reports:
$ mpstat –P 0 2 20
Here is a snippet of the output:
12:20:35 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
12:20:37 PM 0 0.00 0.00 0.50 0.00 0.00 0.00 0.00 99.50 2.51
12:20:39 PM 0 0.00 0.00 1.00 0.00 0.00 0.00 0.00 99.00 2.00
You can also use the -P ALL option to print on separate lines each CPU’s statistics:
$ mpstat -P ALL
Here is the corresponding output:
12:22:45 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
12:22:45 PM all 1.62 0.01 1.05 2.51 0.00 0.00 0.00 94.81 807.99
12:22:45 PM 0 1.22 0.01 1.03 0.64 0.00 0.00 0.00 97.11 1.56
12:22:45 PM 1 2.02 0.01 1.06 4.34 0.01 0.01 0.00 92.56 12.06
Tip To view the configuration of CPUs on a Linux server view the /proc/cpuinfo file. Similarly you can view the /proc/meminfo file to view memory configuration. On Solaris systems, run the prtdiag utility to view configured memory (and CPUs).
6-5. Identifying Processes Consuming CPU and Memory
Problem
You want to view which processes are consuming the most CPU and memory resources on your system.
Note There are two main utilities for identifying specific processes consuming the most CPU and memory resources: top and ps. The top utility was described in Recipe 6-3. This recipe focuses on using ps.
Solution
The ps (process status) command is handy for quickly identifying top resource-consuming processes. For example, this command displays the top 10 CPU-consuming resources on the box:
$ ps -e -o pcpu,pid,user,tty,args | sort -n -k 1 -r | head
Here is a partial listing of the output:
65.5 5017 oracle ? oracleO12C (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
0.8 5014 oracle pts/1 sqlplus
0.8 23255 oracle ? ora_vktm_O12C
In the prior output, the process named oracleO12C is consuming an inordinate amount of CPU resources on the server. The process name identifies this as an Oracle process associated with the O12C database.
Similarly, you can also display the top memory-consuming processes:
$ ps -e -o pmem,pid,user,tty,args | sort -n -k 1 -r | head
Here is a snippet of the output:
11.2 5017 oracle ? oracleO12C (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
7.1 23317 oracle ? ora_mmon_O12C
6.3 23285 oracle ? ora_dbw0_O12C
How It Works
The Linux/Unix ps command displays information about currently active processes on the server. The pcpu switch instructs the process status to report the CPU usage of each process. Similarly the pmem switch instructs ps to report on process memory usage. This gives you a quick and easy way to determine which processes are consuming the most CPU or memory resources.
When using multiple commands on one line (such as ps, sort, and head), it’s often desirable to associate the combination of commands with a shortcut (alias). Here’s an example of creating aliases:
$ alias topc='ps -e -o pcpu,pid,user,tty,args | sort -n -k 1 -r | head'
$ alias topm='ps -e -o pmem,pid,user,tty,args | sort -n -k 1 -r | head'
Now instead of typing in the long line of commands, you can use the alias—for example:
$ topc
Also consider establishing the aliases in a startup file (like .bashrc or .profile) so that the commands are automatically defined when you log on to the database server.
On Solaris systems, the prstat utility can also be used to identify which processes are consuming the most CPU and memory resources. For example, you can instruct the prstat to report system statistics every 5 seconds:
$ prstat 5
Here is some sample output:
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
24448 oracle 12G 8216M sleep 59 0 0:22:40 0.4% oracle/51
24439 oracle 12G 8214M sleep 59 0 0:08:48 0.3% oracle/14
24387 oracle 12G 4025M sleep 59 0 0:09:45 0.1% oracle/258
Type q or press Ctrl+C to exit prstat. After identifying a top resource-consuming process, you can determine which database the process is associated with by using the ps command. This example reports on process information associated with the PID of 24448:
$ ps -ef | grep 24448
oracle 24448 1 1 16:33:42 ? 22:48 ora_pr00_DWREP
In this example, the name of the process is ora_pr00_DWREP and the associated database is DWREP.
Tip Use the ipcs or the sysresv commands to view memory segments allocated by Oracle (and associated semaphores).
6-6. Determining I/O Bottlenecks
You are experiencing performance problems and want to determine if the issues are related to slow disk I/O.
Solution
Use the iostat command with the -x (extended) option combined with the -d (device) option to generate I/O statistics. This next example displays extended device statistics every 10 seconds:
$ iostat –xd 10
You need a fairly wide screen to view this output; here’s a partial listing:
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 2.00 0.00 0.80 0.00 22.40 28.00 0.01 15.88 15.87 1.27
This periodic extended output allows you to view in real time which devices are experiencing spikes in read and write activity. To exit from the previous iostat command, press Ctrl+C. The options and output may vary depending on your OS. For example, on some Linux/Unix distributions, the iostat output may report the disk utilization as %b (percent busy).
When trying to determine whether device I/O is the bottleneck, here are some general guidelines when examining the iostat output:
To get an estimate of the I/O rate of your storage system, use the dd command in combination with the time command. For example:
$ time dd if=/u01/dbfile/O12C/system01.dbf of=test.out
Here is some sample output:
1024016+0 records in
1024016+0 records out
524296192 bytes (524 MB) copied, 27.3087 seconds, 19.2 MB/s
real 0m27.349s
user 0m0.385s
sys 0m9.846s
You can divide the data file size by the total time the command too to run, this will give you the disk I/O rate (which is also displayed in the output). This is a useful metric in determining the speed at which the OS is able to read and write a file.
If you want an estimate of just the read rate, then send then use /dev/null for the output file (nothing gets written):
$ time dd if=/u01/dbfile/O12C/system01.dbf of=/dev/null
In this way you can estimate the I/O writes of your storage subsystem.
How It Works
The iostat command can help you determine whether disk I/O is potentially a source of performance problems. Table 6-3 describes the columns displayed in the iostat output.
Table 6-3. Column Descriptions of iostat Disk I/O Output
Column |
Description |
---|---|
Device |
Device or partition name |
tps |
I/O transfers per second to the device |
Blk_read/s |
Blocks per second read from the device |
Blk_wrtn/s |
Blocks written per second to the device |
Blk_read |
Number of blocks read |
Blk_wrtn |
Number of blocks written |
rrqm/s |
Number of read requests merged per second that were queued to device |
wrqm/s |
Number of write requests merged per second that were queued to device |
r/s |
Read requests per second |
w/s |
Write requests per second |
rsec/s |
Sectors read per second |
wsec/s |
Sectors written per second |
rkB/s |
Kilobytes read per second |
wkB/s |
Kilobytes written per second |
avgrq-sz |
Average size of requests in sectors |
avgqu-sz |
Average queue length of requests |
await |
Average time in milliseconds for I/O requests sent to the device to be served |
svctm |
Average service time in milliseconds |
%util |
Percentage of CPU time during which I/O requests were issued to the device. Near 100% indicates device saturation |
You can also instruct iostat to display reports at a specified interval. The first report displayed will report averages since the last server reboot; each subsequent report shows statistics since the previously generated snapshot. The following example displays a device statistic report every 3 seconds:
$ iostat -d 3
You can also specify a finite number of reports that you want generated. This is useful for gathering metrics to be analyzed over a period of time. This example instructs iostat to report every 2 seconds for a total of 15 reports:
$ iostat 2 15
Also consider running iostat in combination with the watch command. The watch command is particularly useful when you want to highlight differences in the output with each screen refresh. For example:
$ watch -d iostat
When working with locally attached disks, the output of the iostat command will clearly show where the I/O is occurring. However, it is not that clear-cut in environments that use external arrays for storage. What you are presented with at the file system layer is some sort of a virtual disk that might also have been configured by a volume manager. In virtualized storage environments, you’ll have to work with your system administrator or storage administrator to determine exactly which disks are experiencing high I/O activity.
Once you have determined that you have a disk I/O contention issue, then you can use utilities such as AWR (if licensed), Statspack (no license required), or the V$ views to determine if your database is I/O stressed. For example, the AWR report contains an I/O statistics section with the following subsections:
If you want to display current database sessions that are waiting for I/O resources, you can query the data dictionary as follows:
SELECT a.username, a.sql_id, b.object_name, b.object_type, a.event
FROM v$session a
,dba_objects b
,v$event_name c
WHERE b.object_id = a.row_wait_obj#
AND a.event = c.name
AND c.wait_class = 'User I/O';
6-7. Detecting Network-Intensive Processes
You’re investigating performance issues on a database server. As part of your investigation, you want to determine if there are network bottlenecks on the system.
Solution
Use the netstat (network statistics) command to display network traffic. Perhaps the most useful way to view netstat output is with the -ptc options. These options display the process ID and TCP connections, and they continuously update the output:
$ netstat -ptc
Press Ctrl+C to exit the previous command. Here’s a partial listing of the output:
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State
PID/Program name
tcp 0 0 speed3:ssh dhcp-brk-bl8-214-2w-e:59032 ESTABLISHED -
tcp 0 0 speed3:6204 speed3.us.oracle.c:ncube-lm ESTABLISHED -
tcp 0 0 speed3:ssh dhcp-brk-bl8-214-2w-e:63682 ESTABLISHED -
If the Send-Q (bytes not acknowledged by remote host) column has an unusually high value for a process, this may indicate an overloaded network. The useful aspect about the previous output is that you can determine the OS process ID (PID) associated with a network connection. If you suspect the connection in question is an oracle session, you can use the techniques described in the “Solution” section of Recipe 6-8 to map an OS PID to an Oracle process or SQL statement.
How It Works
When experiencing performance issues, usually the network bandwidth is not the cause. Most likely you’ll determine that bad performance is related to a poorly constructed SQL statement, inadequate disk I/O, badly designed processes, or not enough CPU or memory resources. However, as a DBA, you need to be aware of all sources of performance bottlenecks and how to diagnose them. In today’s highly interconnected world, you must possess network troubleshooting and monitoring skills. The netstat utility is a good starting place for monitoring server network connections.
TROUBLESHOOTING DATABASE NETWORK CONNECTIVITY
Use these steps as guidelines when diagnosing Oracle database network connectivity issues:
Step 1. Use the OS ping utility to determine whether the remote box is accessible—for example:
$ ping dwdb
dwdb is alive
If ping doesn’t work, work with your system or network administrator to ensure you have server-to-server connectivity in place.
Step 2. Use telnet to see if you can connect to the remote server and port (that the listener is listening on)—for example:
$ telnet dwdb 1521
Trying 127.0.0.1...
Connected to dwdb.
Escape character is '^]'.
The prior output indicates that connectivity to a server and port is okay. If the prior command hangs, then contact your SA or network administrator for further assistance.
Step 3. Use tnsping to determine whether Oracle Net is working. This utility will verify that an Oracle Net connection can be made to a database via the network—for example:
$ tnsping dwrep
..........
Used TNSNAMES adapter to resolve the alias
Attempting to contact (DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)
(HOST = dwdb.us.farm.com)(PORT = 1521))
(CONNECT_DATA = (SERVER = DEDICATED) (SERVICE_NAME = DWREP)))
OK (500 msec)
If tnsping can’t contact the remote database, verify that the remote listener and database are both up and running. On the remote box, use the lsnrctl status command to verify that the listener is up. Verify that the remote database is available by establishing a local connection as a non-SYS account (SYS can often connect to a troubled database when other schemas will not work).
Step 4. Verify that the TNS information is correct. If the remote listener and database are working, then ensure that the mechanism for determining TNS information (like the tnsnames.ora file) contains the correct information.
Sometimes the client machine will have multiple TNS_ADMIN locations and tnsnames.ora files. One way to verify whether a particular tnsnames.ora file is being used is to rename it and see whether you get a different error when attempting to connect to the remote database.
Step 5. If you’re still having issues, examine the client sqlnet.log file and the remote server listener.log file. Sometimes these log files will show additional information that will pinpoint the issue.
6-8. Mapping a Resource-Intensive Process to a Database Process
Problem
The system is performing poorly and you want to identify which OS intensive process on the box is consuming the most resources. Furthermore, you want to map an OS process back to a database process. If the database process is a SQL process, you want to display the details of the SQL statement (such as the user, explain plan, and so on).
Solution
In Linux/Unix environments, if you can identify the resource intensive OS process, then you can easily check to see if that process is associated with a database process. This procedure consists of the following:
For example, suppose you identify the top CPU-consuming queries with the ps command:
$ ps -e -o pcpu,pid,user,tty,args|grep -i oracle|sort -n -k 1 -r|head
Here is some sample output:
45.7 23651 oracle ? oracleO12C (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
0.8 23255 oracle ? ora_vktm_O12C
0.2 23317 oracle ? ora_mmon_O12C
The prior output identifies one OS process consuming an excessive amount of CPU. The process ID is 23651 and name is oracleO12C. From the process name, it’s an Oracle process associated with the O12C database.
You can determine what type of Oracle process this is by querying the data dictionary:
SELECT
'USERNAME : ' || s.username || CHR(10) ||
'SCHEMA : ' || s.schemaname || CHR(10) ||
'OSUSER : ' || s.osuser || CHR(10) ||
'MODUEL : ' || s.program || CHR(10) ||
'ACTION : ' || s.schemaname || CHR(10) ||
'CLIENT_INFO : ' || s.osuser || CHR(10) ||
'PROGRAM : ' || s.program || CHR(10) ||
'SPID : ' || p.spid || CHR(10) ||
'SID : ' || s.sid || CHR(10) ||
'SERIAL# : ' || s.serial# || CHR(10) ||
'KILL STRING : ' || '''' || s.sid || ',' || s.serial# || '''' || CHR(10) ||
'MACHINE : ' || s.machine || CHR(10) ||
'TYPE : ' || s.type || CHR(10) ||
'TERMINAL : ' || s.terminal || CHR(10) ||
'SQL ID : ' || q.sql_id || CHR(10) ||
'CHILD_NUM : ' || q.child_number || CHR(10) ||
'SQL TEXT : ' || q.sql_text
FROM v$session s
,v$process p
,v$sql q
WHERE s.paddr = p.addr
AND p.spid = '&PID_FROM_OS'
AND s.sql_id = q.sql_id(+)
AND s.status = 'ACTIVE';
The prior script prompts you for the OS process ID. Here is the output for this example:
USERNAME : MV_MAINT
SCHEMA : MV_MAINT
OSUSER : oracle
MODUEL : sqlplus@speed2 (TNS V1-V3)
ACTION : MV_MAINT
CLIENT_INFO : oracle
PROGRAM : sqlplus@speed2 (TNS V1-V3)
SPID : 28191
SID : 10
SERIAL# : 2039
KILL STRING : '10,2039'
MACHINE : speed2
TYPE : USER
TERMINAL : pts/1
SQL ID : 3k8vqz6yycr3g
CHILD_NUM : 0
SQL TEXT : select a.table_name from dba_tables a, dba_indexes b, dba_objects
The output indicates that this is a SQL*Plus process with a database SID of 10 and SERIAL# of 2039. You’ll need this information if you decide to terminate the process with the ALTER SYSTEM KILL SESSION statement (see Recipe 6-9 for details).
In this example, since the process is running a SQL statement, further details about the query can be extracted by generating an execution plan:
SQL> SELECT * FROM table(DBMS_XPLAN.DISPLAY_CURSOR('&sql_id',&child_num));
You’ll be prompted for the sql_id and child_num when you run the prior statement. Here is a partial listing of the output:
SQL_ID 3k8vqz6yycr3g, child number 0
-------------------------------------
select a.table_name from dba_tables a, dba_indexes b, dba_objects c
...
This output will help you determine the efficiency of the SQL statement and provide insight on how to tune it. Refer to Chapter 9 for details on how to manually tune a query and Chapter 11 for automated SQL tuning.
Keep in mind that when selecting from V$SESSION, the records displayed within that view show the current SQL statement that the session is executing. It’s entirely possible that a given session can run several resource consuming queries in succession. Therefore it’s conceivable you might identify a query for a session (Step 3) that wasn’t the SQL executing when you first identified the OS process (Step 1). That said, the prior solution works well for long running queries consuming large amounts of system resources.
How It Works
The process described in the “Solution” section of this recipe allows you to quickly identify resource-intensive processes, then map the OS process to a database process, and subsequently map the database process to a SQL statement. Once you know which SQL statement is consuming resources, then you can generate an execution plan to further attempt to determine any possible inefficiencies.
When generating the execution plan for a SQL statement, you need to provide DBMS_XPLAN both the SQL_ID and CHILD_NUMBER to correctly identify an execution plan. The SQL_ID uniquely identifies the SQL statement. If the SQL statement has executed more than once, it is possible for the optimizer to generate and use one or more execution plans. If a given SQL statement has more than one execution plan associated with it, then the combination of SQL_ID and CHILD_NUMBER uniquely identifies an execution plan for a given SQL statement.
Sometimes the resource-consuming process will not be associated with a database. In these scenarios, you’ll have to work with your SA to determine what the process is and if it can be tuned or terminated. Also, you may encounter resource-intensive processes that are database-specific but not associated with a SQL statement. For example, you might have a long-running RMAN backup process, Data Pump, or PL/SQL jobs running. In these cases, work with your DBA to identify whether these types of processes can be tuned, scheduled to run at another time of day, or killed.
ORADEBUG
You can use Oracle’s oradebugutility to display top consuming SQL statements if you know the OS ID. For example, suppose that you have used a utility such as top or ps to identify a high CPU-consuming OS process, and from the name of the process you determine it’s a database process. Now log in to SQL*Plus and use oradebug to display any SQL associated with the process. In this example, the OS process ID is 23651:
SQL> oradebug setospid 23651;
Oracle pid: 32, Unix process pid: 23651, image: oracle@speed2 (TNS V1-V3)
Now show the SQL associated with this process (if any):
SQL> oradebug current_sql;
If there is a SQL statement associated with the process, it will be displayed—for example:
select a.table_name from dba_tables a, dba_indexes b, dba_objects c
The oradebug utility can be used in a variety of methods to help troubleshoot performance issues. Use oradebug help to display all options available.
6-9. Terminating a Resource-Intensive Process
You have identified a database process that is consuming inordinate amounts of system resources (see Recipe 6-8) and determined that it’s a runaway SQL statement that needs to be killed.
Solution
There are three basic ways to terminate a SQL process:
If you happen to have access to the terminal from which the resource-consuming SQL statement is running, you can attempt to press Ctrl+C to terminate the process. Often you don’t have access to the terminal and will have to use a SQL statement or an OS command to terminate the process.
If it’s an Oracle process and you have identified the SID and SERIAL# (see Recipe 6-8), you can terminate a process from within SQL*Plus. Here is the general syntax:
alter system kill session 'integer1, integer2 [,integer3]' [immediate];
In the prior syntax statement, integer1 is the value of the SID column and integer2 is the value from the SERIAL# column (of V$SESSION). In a RAC environment, you can optionally specify the value of the instance ID for integer3. The instance ID can be retrieved from the GV$SESSION view.
Here’s an example that terminates a process with a SID of 1177 and a SERIAL# of 38583:
SQL> alter system kill session '1177,38583';
If successful, you should see this output:
System altered.
When you kill a session, this will mark the session as terminated, roll back active transactions (within the session), and release any locks (held by the session). The session will stay in a terminated state until any dependent transactions are rolled back. If it takes a minute or more to roll back the transaction, Oracle reports the session as “marked to be terminated” and returns control to the SQL prompt. If you specify IMMEDIATE (optional), Oracle will roll back any active transactions and immediately return control back to you.
Caution Killing a session that is executing a select statement is fairly harmless. However, if you terminate a session that is performing a large insert/update/delete then you may see a great deal of database activity (including I/O in the online redo logs) associated with Oracle rolling back the terminated transaction.
If you have access to the database server and access to an OS account that has privileges to terminate an Oracle process (such as the oracle OS account), you can also terminate a process directly with the kill command.
Caution Ensure that you don’t kill the wrong Oracle process. If you accidentally kill a required Oracle background process, this will cause your instance to abort.
For example, suppose you run the ps command and have done the associated work to determine that you have a SQL statement that has been running for hours and needs to be terminated. The kill command directly terminates the OS process. In this example, the process ID of 6254 is terminated:
$ kill -9 6254
Caution In some rare situations killing an Oracle process associated with a SQL transaction can have an adverse impact on the stability of the instance. For example, killing a process participating in a distributed transaction may cause the instance to crash. To determine if this is an issue for the version of the database you’re using see MOS bug IDs 8686128 and 12961905. In older versions of Oracle various other bugs associated with killing a process have been identified and fixed and are documented in MOS bug IDs 5929055 and 6865378.
Sometimes you’ll find yourself in a situation where you need to kill hung SQL processes, database jobs, or SQL statements that are consuming inordinate amounts of resources. For example, you may have a test server where a job has been running for several hours, is consuming much of the server resources, and needs to be stopped so that other jobs can continue to process in a timely manner.
Manually killing a SQL statement will cause the transaction to be rolled back. Therefore take care when doing this. Ensure that you are killing the correct process. If you erroneously terminate a critical process, this obviously will have an adverse impact on the application and associated data.