CHAPTER 8

image

Analyzing Server Performance

The separation of tasks between a SA and a DBA is often blurred. This blurring of roles can be especially true in small shops in which you wear multiple hats. Even in large organizations with established roles and responsibilities, you’ll still experience an occasional “all-hands-on-deck” fire drill in which you’re expected to troubleshoot server issues. In these scenarios, you must be familiar with the OS commands used to extract information from the server. An expert DBA does not diagnose database problems in a vacuum; you have to be server-savvy.

Whenever there are application-performance issues or availability problems, the first question asked from the DBA’s perspective is usually this one: “What’s wrong with the database?” Regardless of the source of the problem, the burden is often on the DBA to verify whether the database is behaving well. This process sometimes includes identifying server bottlenecks. The database and server have an interdependent relationship. DBAs need to be well-versed in techniques to monitor server activity.

When you have a server that houses dozens of databases, and you experience performance issues, you have to determine whether the bottleneck is related to CPU, memory, I/O, or the network. Furthermore, you have to pinpoint which processes on the box are consuming the most resources. In these scenarios, it is more productive to diagnose issues with OS tools to lead you to the process that is consuming the most server resources. After you identify the process, you can determine whether it is associated with a database and then further identify the type of process (SQL*Plus, RMAN, Data Pump, and so on).

This chapter covers techniques used to analyze the server’s CPU, memory, I/O, and network performance. Take some time to become familiar with the relevant commands covered in each section. Being able to quickly survey system activity will vastly broaden your DBA skill set.

Table 8-1 summarizes the OS utilities commonly used by DBAs and SAs. This table lists the recipe in which the tool is discussed and what aspect of performance the tool covers (note that some tools are covered in other chapters). Being familiar with these OS commands and how to interpret the output will allow you to work as a team with SAs, storage administrators, network engineers, and developers when diagnosing server performance issues.

Table 8-1. Performance and Monitoring Utilities

Table8-1a
Table8-1b

When diagnosing server issues, start with utilities such as vmstat, sar, top, and ps. These tools give you a quick overview of overall system performance. Then you can use other tools to drill down into CPU, memory, I/O, and network details.

8-1. Identifying System Bottlenecks

Problem

Application users are reporting that the database seems slow. You want to determine whether there are any system resource bottlenecks on the database server.

Solution

The vmstat (virtual memory statistics) tool is intended to help you quickly identify bottlenecks on your server. The vmstat command displays real-time performance information about processes, memory, paging, disk I/O, and CPU usage. This example shows using vmstat on a Linux server (the output is slightly different on Solaris) to display the default output with no options specified:

$ vmstat
procs -----------memory------------ --swap-- ----io---- --system-- -------cpu-------
r  b  swpd    free     buff   cache  si   so  bi    bo    in   cs   us sy  id  wa st
0  0     0 1185336 10615632  991736   0    0   5     9     9   12    0  0 100   0  0

Here are some general heuristics you can use when interpreting the output of vmstat:

  • If b (processes sleeping) is consistently greater than 0, you may not have enough CPU processing power.
  • If so (memory swapped out to disk) and si (memory swapped in from disk) are consistently greater than 0, you may have a memory bottleneck. On some systems, these columns may be labeled pi and po.
  • If the wa (time waiting for I/O) column is high, it is usually an indication that the storage subsystem is overloaded.

By default, only one line of server statistics is displayed when running vmstat (without supplying any options). This one line of output displays average statistics calculated from the last time the system was rebooted. Although it is fine for a quick snapshot, use vmstat with this syntax if you want to gather metrics over a period of time:

$ vmstat <interval in seconds> <number of intervals>

In this mode, vmstat reports statistics sampling from one interval to the next. For example, if you want to report system statistics every 2 seconds for 10 intervals, issue this command:

$ vmstat 2 10

The default unit of measure for the memory columns of vmstat is in kilobytes. If you want to view memory statistics in megabytes, use the -S m (statistics in megabytes) option:

$ vmstat -S m

How It Works

If your database server seems sluggish, analyze the vmstat output to determine where the resources are being consumed. Table 8-2 details the meanings of the columns displayed in the default output of vmstat.

Table 8-2. Column Descriptions of vmstat Output

Column

Description

R

Number of processes waiting for runtime

b

Number of processes in uninterruptible sleep

swpd

Total virtual memory (swap) in use (KB)

free

Total idle memory (KB)

buff

Total memory used as buffers (KB)

cache

Total memory used as cache (KB)

si

Memory swapped in from disk (KB/s)

so

Memory swapped out to disk (KB/s)

bi

Blocks read in (blocks/s) from block device

bo

Blocks written out (blocks/s) per second to block device

in

Interrupts per second

cs

Context switches per second

us

User-level code time as a percentage of total CPU time

sy

System-level code time as a percentage of total CPU time

id

Idle time as a percentage of total CPU time

wa

Time waiting for I/O completion

st

Time stolen from virtual machine

You can also send the vmstat output to a file, which is useful for analyzing historical performance over a period of time. This example samples statistics every 5 seconds for a total of 60 reports and then records the output in a file:

$ vmstat 5 60 > vmout.perf

Another useful way to use vmstat is with the watch tool. The watch command is used to execute another program on a periodic basis. This example uses watch to run the vmstat command every 5 seconds and to highlight any differences between each snapshot onscreen:

$ watch -n 5 -d vmstat

When running vmstat in watch -d (differences) mode, you see changes onscreen as they alter from snapshot to snapshot. To exit from watch, press Ctrl+C.

You can obtain a better idea of how vmstat operates by viewing the output while simulating some server activity. First, inspect the output of vmstat while there is little system activity:

$ vmstat 2 10
procs -----------memory------------ ---swap-- ----io----  --system-- -----cpu-------
 r  b   swpd    free   buff   cache  si   so   bi    bo     in   cs  us sy  id wa st
 0  0  44588 3170020 184644 5222292   0    0    2    68      1    0   5  1  94  0  0
 0  0  44588 3169392 184644 5222336   0    0    0     8    497  864   2  0  98  0  0
 0  0  44588 3169400 184644 5222396   0    0    0   220    448  746   0  0 100  0  0

Now simulate server activity by creating a large randomly populated 1GB file with the dd command:

$ dd if=/dev/urandom of=GIGtestfile bs=1M count=1024 oflag=direct &

This should have the effect of increasing the CPU usage as the random numbers are generated. Additionally, there should be I/O activity generated as the file is populated. While the prior command is running in the background (via the & operator), you can run vmstat:

$ vmstat 2 10

Here are a few lines of the output indicating that the CPU is less idle (more busy) and the I/O blocks written out (bo column in the output) parameter has increased:

procs -----------memory------------ --swap--  ----io----  --system-- -----cpu------
 r  b   swpd    free   buff   cache  si   so   bi    bo     in   cs  us sy id wa st
 1  0  44588 3128704 184664 5256092   0    0    0   912    743  920   1 25 74  0  0
 1  0  44588 3108500 184664 5274432   0    0    0   400    664  934   1 25 74  0  0
 5  0  44588 3070472 184664 5311608   0    0    0   366    647  882   1 25 73  0  0

OSWATCHER

Oracle provides a collection of scripts that gather and store metrics for CPU, memory, disk, and network usage. The OSWatcher tool suite automates the gathering of statistics using tools such as top, vmstat, iostat, mpstat, netstat, and traceroute.

You can obtain OSWatcher from the Oracle MOS web site. Search for document ID 301137.1 or for the document titled “OSWatcher User Guides.” Navigate to the Contents page and search for the Download link.

This utility also has an optional graphical component for visually displaying performance metrics. The OSWatcher utility is currently supported on the following platforms: Linux, Solaris, AIX, and HP-UX. For Windows, Oracle recommends using the Cluster Health Monitor (see document ID 736752.1).

8-2. Analyzing Current and Past System Performance

Problem

Users are reporting that the database application seems sluggish every morning at 10:00 a.m. To troubleshoot the issue, you want to view the current CPU, memory, and I/O load on the server; and display resource activity at 10:00 a.m. for previous days in the week.

Solution

The sar (system activity reporter) utility is unique in that it allows you to view current CPU, memory, and I/O resource usage; as well as server activity for a point in time in the past (e.g., an hour ago, yesterday, and so on).

Displaying CPU Use

To show real-time CPU statistics, use the -u (CPU utilization) option and specify a snapshot interval (in seconds) and the number of reports. The following displays current processor activity with a snapshot interval of 2 seconds for a total of 10 reports:

$ sar -u 2 10

Here is some sample output:

12:50:42          CPU     %user     %nice   %system   %iowait    %steal     %idle
12:50:44          all      0.00      0.00      0.00      0.00      0.00    100.00
12:50:46          all      2.49      0.00      0.37      0.12      0.00     97.01
12:50:48          all      0.37      0.00      0.37      0.00      0.12     99.13

The most important column in the output is %idle. A low %idle could be an indication that the CPUs are underpowered or indicative of a high application load. The %iowait column displays the time waiting for I/O. It follows that a high %iowait time indicates that the I/O subsystem is a potential bottleneck.

If you have multiple CPUs, you can view the output per CPU with the -P ALL options. You should now see one line per CPU in the output:

$ sar -u -P ALL

To report on the current day’s CPU activity, use the -u option without an interval:

$ sar -u

To view a previous day’s statistics, use sar with the -f (file) option. On Linux systems, the files that sar uses to report on statistics for different days of the month are logged in the /var/log/sa directory (on Solaris systems, look in the /var/adm/sa directory). These files have the naming convention of saNN, where NN is the two-digit day of the month. So if today is the ninth day of the month, and you want to report on CPU activity for the eighth day, use the following:

$ sar -u -f /var/log/sa/sa08

Keep in mind that sar keeps only the last 7 days of history files by default. See the “How It Works” section of this recipe if you need to increase the retention period.

The output of using the -f option can be quite long. You can report on a time range via the -s (start time) and -e (end time) options. For example, to report on memory load starting at 10:00 a.m. and ending at 11:00 a.m., do so as follows:

$ sar -r -f /var/log/sa/sa08 -s 10:00:00 -e 11:00:00

Some older versions of sar don’t provide an ending time option. You can creatively use commands such as grep to filter the output for the desired times; for example:

$ sar -r -f /var/log/sa/sa08 | grep ^10 | grep AM

In this manner, you can narrow down the range of activity you want to display.

Displaying Memory Activity

Use sar with the -r (report memory) option to report on memory statistics. To show real-time memory statistics, specify a snapshot interval (in seconds) and the number of reports. The following displays current memory activity with a snapshot interval of 2 seconds for a total of 10 reports:

$ sar -r 2 10

Here is a small snippet of output:

08:44:45 AM kbmemfree kbmemused  %memused kbbuffers  kbcached kbswpfree kbswpused  %swpused
kbswpcad
08:44:47 AM   2616532  13005612     83.25    200668   5705848  10243940     44500      0.43
836
...

The output shows the total memory free and used, and the amount of swap space used. A high degree of swapping indicates that you may need more memory. When run in this mode, the output can be wide and lengthy; it doesn’t quite fit within the limits of this physical page.

To report on real–time swapping statistics, specify the -W option. This example generates current swapping statistics snapshots every 3 seconds for a total of 10 reports:

$ sar -W 3 10

Here is some sample output indicating that little or no swapping is occurring:

08:54:01 AM  pswpin/s pswpout/s
08:54:04 AM      0.00      0.00
08:54:07 AM      0.00      0.00

To report on the current day’s memory activity, use the -r option without an interval:

$ sar -r

To view historical memory statistics, use sar with the -f (file) option. For example, to have sar display memory paging statistics for the first day of the month, run it with the -B (report paging statistics) and -f (file) options, as follows:

$ sar -B -f /var/log/sa/sa01

Here is a partial listing of the report:

11:10:01 AM  pgpgin/s pgpgout/s   fault/s  majflt/s
11:20:01 AM      0.02     16.17     18.37      0.00
11:30:01 AM      3.49     21.68     74.15      0.04
11:40:01 AM   4182.58    439.44    320.94      0.68
11:50:02 AM   4960.03    1027.79  4384.73      0.51

The previous output shows that there was a substantial increase in paging in from disk (pgpgin/s), pages paged out to disk (pgpgout/s), and page faults per second (fault/s) at approximately 11:40 a.m.

Displaying I/O Load

Use sar with the -b (report I/O) option to report on I/O statistics. To show real-time I/O statistics, specify a snapshot interval (in seconds) and the number of reports. On Linux systems, the following displays current I/O activity with a snapshot interval of 2 seconds for a total of 10 reports:

$ sar -b 2 10

Here’s a partial snippet of the output (this output may vary depending on your version of Linux and the sar command):

09:01:19 AM       tps      rtps      wtps   bread/s   bwrtn/s
09:01:21 AM     19.39      0.00     19.39      0.00    448.98
09:01:23 AM     13.93      0.00     13.93      0.00    366.17
...

The tps column shows the I/O transfers per second to the device. The rtps indicates read requests per second, and the wtps shows write requests per second.

Image Note  On Solaris systems, use sar with the -d (disk) option to report on disk activity. The -d option may be available on Linux, depending on the version.

To report on the current day’s I/O activity, specify the -b option with no time interval:

$ sar -b

To report on I/O statistics for a previous day in the month, use -b with the -f option. For example, to have sar display disk statistics for the tenth day of the month, run it as follows:

$ sar -b -f /var/log/sa/sa10

How It Works

The sar utility is used to generate current load metrics as well as report on system resource usage for a point in time in the past. If you have the requirement of troubleshooting performance issues that have occurred in the past, sar is the utility to use.

You can think of sar as an AWR or Statspack for the OS. The AWR or Statspack tools allow you to view database activity for a range of time in the past. The sar utility is similar in that it allows you to report on historical server activity for CPUs, memory, and I/O. The AWR or Statspack reports depend on scheduled database jobs that periodically populate permanent database tables with information containing historical database activity. Whereas the sar utility uses cron jobs to periodically populate OS files that can be used for historical reporting regarding server activity.

With that in mind, let’s look more closely at the sar cron jobs, the resource usage history files, and how to manually create a sar file.

Understanding sar cron jobs

This sar utility is configured when you install the sysstat system package. You can check for its existence as follows:

$ sar -V
sysstat version ...

If sysstat isn’t installed, and if you have root access, you can install it with the yum utility (or whatever utility is your standard for installing packages) as follows:

# yum install sysstat

When you install the sysstat package, sar will be installed along with two cron jobs. These cron jobs will be instantiated to create files used by the sar utility to report on historical server statistics. On Linux systems, you can view these cron jobs by looking in the /etc/cron.d/sysstat file; for example, here’s a sample cron entry:

# run system activity accounting tool every 10 minutes
*/10 * * * * root /usr/lib64/sa/sa1 1 1
# generate a daily summary of process accounting at 23:53
53 23 * * * root /usr/lib64/sa/sa2 -A

Once the cron jobs have been running for a few days, to report on a previous day’s CPU statistics, use the -f option to specify the file that corresponds to the day of interest.

On Solaris systems, you can view the sar–related cron jobs by viewing the /var/spool/cron/crontabs/sys file. Here’s a sample entry:

0 * * * 0-6 /usr/lib/sa/sa1
20,40 8-17 * * 1-5 /usr/lib/sa/sa1
5 18 * * 1-5 /usr/lib/sa/sa2 -s 8:00 -e 18:01 -i 1200 -A

Changing sar File Retention Period

On Linux systems, the files that sar uses to report statistics for different days of the month are located in the /var/log/sa directory. On Solaris systems, the sar files are located in the /var/adm/sa directory.

These files have the naming convention of saNN, where NN is the two-digit day of the month. Typically only the last week or so of files will be retained. A quick listing of the /var/adm/sa directory helps to clarify this:

$ cd /var/log/sa
$ ls -la sa[0-9]*

Here is some sample output:

-rw-r--r-- 1 root root 332016 Jun  1 16:50 sa01
-rw-r--r-- 1 root root 332016 Jun  2 16:50 sa02
-rw-r--r-- 1 root root 332016 Jun  3 16:50 sa03
-rw-r--r-- 1 root root 332016 Jun  4 16:50 sa04
-rw-r--r-- 1 root root 332016 Jun  5 16:50 sa05
-rw-r--r-- 1 root root 332016 Jun  6 16:50 sa06
-rw-r--r-- 1 root root 255984 Jun  7 11:20 sa07
-rw-r--r-- 1 root root 332016 May 30 16:50 sa30
-rw-r--r-- 1 root root 332016 May 31 16:50 sa31

From the prior output, the last 9 days of files are retained. These files are created and populated by the sar cron jobs. The sar utility uses the information contained therein to report on historical performance metrics.

On Linux systems, you can control the number of files retained by modifying the HISTORY parameter in the /etc/sysconfig/sysstat file. The location of the sysstat file varies by OS.

On Solaris systems, the old sar files are removed by the /usr/lib/sa/sa2 utility. The last line of the sa2 script finds the oldest sar files and removes them; for example:

/usr/bin/find /var/adm/sa ( -name ’sar*’ -o -name ’sa*’ ) -mtime +7 -exec /usr/bin/rm {} ;

As root, you can manually adjust the -mtime parameter as required. Usually the default of 7 days is sufficient.

Creating a sar File

While reporting on real-time statistics, use the -o (out) option to send output to a file:

$ sar -b 2 10 -o saroutJun1.perf

This code creates a binary output file that can later be used to analyze disk I/O metrics. At some later point, you can use sar with the -f option to report on the contents of that file; for example:

$ sar -b -f saroutJun1.perf

This code provides a way of indefinitely saving the sar metrics for a given point in time. During a quiet period, you might want to do this to establish a baseline of metrics or during a heavy load to capture specific metrics that you want to analyze at some later point (and not worry about the file being automatically deleted).

8-3. Identifying CPU-Intensive Processes

Problem

You want to identify which Oracle session is consuming the most CPU on the database server. If it is an Oracle session running a SQL query, you want to display the associated SQL.

Solution

There are two tools that are useful for quickly identifying top CPU-consuming processes: top and ps. First, let’s discuss top.

Using top

The top utility is one of the first tools a DBA or SA will use to view server resource usage. This utility provides a dynamic report that refreshes every few seconds and displays the top resource-consuming processes; for example:

$ top

Here is a partial listing of the output:

top - 15:31:27 up 7 days,  5:17,  2 users,  load average: 0.17, 0.35, 0.37
Tasks: 222 total,   2 running, 220 sleeping,   0 stopped,   0 zombie
Cpu(s):  8.2%us, 16.3%sy,  0.0%ni, 75.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.1%st
Mem:  15622144k total, 15524472k used,    97672k free,  9085000k buffers
Swap: 10288440k total,        0k used, 10288440k free,  3160848k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 9156 oracle    25   0 2595m 366m 184m R 99.9  2.4  13:51.84 oracle_9156_o12
    1 root      15   0 10368  676  572 S  0.0  0.0   0:00.17 init
    2 root      RT  -5     0    0    0 S  0.0  0.0   0:00.38 migration/0
    3 root      34  19     0    0    0 S  0.0  0.0   0:00.05 ksoftirqd/0
...

The first five lines show system load, summary of processes running, CPU load, memory load, and swap statistics. The lines after that show individual processes and corresponding resource consumption. For this particular example, the Oracle process 9156 is consuming a great deal of CPU. The output will refresh every few seconds. To exit top, press Ctrl+C.

Using ps

Now compare the use of top with the ps command. The ps command (in combination with the pcpu option) is used to identify the PIDs of sessions consuming the most CPU on the server; for example:

$ ps -e -o pcpu,pid,user,tty,args | sort -n -k 1 -r | head

Here is a partial listing of the output:

31.0  9156 oracle   ?        oracleO1212 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
1.0   9155 oracle   pts/2    sqlplus   as sysdba
...

The first column is the percentage of CPU being consumed. The second column shows the PID (process 9156 is consuming the most CPU on this server). The third column shows that the oracle user is running this process and we can derive the database name from the fifth column to be O1212.

We recommend that you create an alias for the ps command; for example:

$ alias topcpu=’ps -e -o pcpu,pid,user,tty,args | sort -n -k 1 -r | head’

This example allows you to quickly run the command without having to remember the syntax:

$ topcpu

Retrieving Oracle Process Information

The main takeaway from the prior sections using top and ps is that process 9156 is consuming a great deal of CPU on the server and that it is an Oracle process associated with the O1212 database. Now you can use the PID from the output as an input to the following query to show information about the Oracle session responsible for the high consumption of CPU resources:

SET LINES 200 PAGES 0 HEAD OFF LONG 100000
COL dummy_value NOPRINT
--
SELECT ’dummy_value’ dummy_value,
  ’USERNAME    : ’ || s.username     || CHR(10) ||
  ’SCHEMA      : ’ || s.schemaname   || CHR(10) ||
  ’OSUSER      : ’ || s.osuser       || CHR(10) ||
  ’MODULE      : ’ || s.program      || CHR(10) ||
  ’ACTION      : ’ || s.schemaname   || CHR(10) ||
  ’CLIENT INFO : ’ || s.osuser       || CHR(10) ||
  ’PROGRAM     : ’ || s.program      || CHR(10) ||
  ’SPID        : ’ || p.spid         || CHR(10) ||
  ’SID         : ’ || s.sid          || CHR(10) ||
  ’SERIAL#     : ’ || s.serial#      || CHR(10) ||
  ’KILL STRING : ’ || ’’’’ || s.sid || ’,’ || s.serial# || ’’’’  || CHR(10) ||
  ’MACHINE     : ’ || s.machine      || CHR(10) ||
  ’TYPE        : ’ || s.type         || CHR(10) ||
  ’TERMINAL    : ’ || s.terminal     || CHR(10) ||
  ’CPU         : ’ || q.cpu_time/1000000     || CHR(10) ||
  ’ELAPSED_TIME: ’ || q.elapsed_time/1000000 || CHR(10) ||
  ’BUFFER_GETS : ’ || q.buffer_gets  || CHR(10) ||
  ’SQL_ID      : ’ || q.sql_id       || CHR(10) ||
  ’CHILD_NUM   : ’ || q.child_number || CHR(10) ||
  ’START_TIME  : ’ || TO_CHAR(s.sql_exec_start,’dd-mon-yy hh24:mi’) || CHR(10) ||
  ’STATUS      : ’ || s.status       || CHR(10) ||
  ’SQL_TEXT    : ’ || q.sql_fulltext
FROM            v$session s
JOIN            v$process p ON (s.paddr  = p.addr)
LEFT OUTER JOIN v$sql     q ON (s.sql_id = q.sql_id)
WHERE s.username IS NOT NULL -- eliminates background procs
AND NVL(q.sql_text,’x’) NOT LIKE ’%dummy_value%’ -- eliminates this query from output
AND   p.spid           = ’&PID_FROM_OS’
ORDER BY q.cpu_time;

For this example, when you run the prior query and supply to it the PID of 9156, you get the following output:

USERNAME    : SYS
SCHEMA      : SYS
OSUSER      : oracle
MODULE      : sqlplus@dtc07dsg (TNS V1-V3)
ACTION      : SYS
CLIENT INFO : oracle
PROGRAM     : sqlplus@dtc07dsg (TNS V1-V3)
SPID        : 9156
SID         : 91
SERIAL#     : 60916
KILL STRING : ’91,60916’
MACHINE     : dtc07dsg
TYPE        : USER
TERMINAL    : pts/2
CPU         : 275.473216
ELAPSED_TIME: 279.805467
BUFFER_GETS : 64650
SQL_ID      : 1z4xyfmw1rpqy
CHILD_NUM   : 0
START_TIME  : 02-may-15 15:09
STATUS      : ACTIVE
SQL_TEXT    : select a.table_name from dba_tables a, dba_indexes, dba_extents, ...

From the prior output, you see that a SQL*Plus session is consuming a great deal of CPU time. You can also determine when the query started, the username, the kill string, and the SQL identifier from the output.

Once you identify information regarding the process, you can drill down further to display the execution plan and the resources the process is waiting for. For example, you can view the SQL execution plan with the SQL ID and child number:

SQL> SELECT * FROM table(DBMS_XPLAN.DISPLAY_CURSOR(’&sql_id’,&child_num));

Here’s another useful query that uses the SID and serial number to display the state of a session and whether it is working or waiting for a resource:

SELECT sid,
DECODE(state, ’WAITING’,’Waiting’, ’Working’) state,
DECODE(state, ’WAITING’, ’So far ’||seconds_in_wait,
’Last waited ’|| wait_time/100)|| ’ seconds for ’||event
FROM v$session
WHERE sid = ’&&session_id’
AND serial# = ’&&serial’;

This information is very useful when diagnosing performance issues related to Oracle processes running on the server.

How It Works

When you run multiple databases on one server and are experiencing server-performance issues, it can be difficult to identify which database and session are consuming the most system resources. In these situations, use the top utility or the ps command to identify the highest-consuming processes. The top utility provides a dynamic interface that periodically refreshes to give you the current snapshot of resource consumption, whereas the ps command provides a quick one time snapshot of top resource usage.

You may be wondering whether it is better to use top or ps to identify resource-consuming processes. If your server has top installed, top will probably be your first choice; its ease of use and interactive output is hard to beat. Having said that, the ps command is universally available, so you may have to use ps if the top command is not available. Also the ps command may show you a more descriptive program name associated with the process.

Once you have a process identified, if it’s an Oracle process use the SQL query in the “Solution” section to further identify the type of Oracle process. You then have the option of trying to tune the operation (whether it be SQL, RMAN, and so on), or you might want to terminate the process (see recipe 3-2 for details on how to kill a process and/or stop a SQL session).

Explaining top

The top command deserves a little more explanation than what was shown in the “Solution” section of this recipe. By default, top will repetitively refresh (every 3 seconds) information regarding the most CPU-intensive processes. While top is running, you can interactively change its output. For example, if you type >, the column that top is sorting moves one position to the right.

Table 8-3 lists some key features that you can use to alter the top display to the desired format.

Table 8-3. Commands to Interactively Change the top Output

Command

Function

Spacebar

Immediately refreshes the output.

< or >

Moves the sort column one position to the left or to the right. By default, top sorts on the CPU column.

d

Changes the refresh time.

R

Reverses the sort order.

z

Toggles the color output.

h

Displays the help menu.

F or O

Chooses a sort column.

Type q or press Ctrl+C to exit top. Table 8-4 describes several of the columns displayed in the default output of top.

Table 8-4. Column Descriptions of the top Output

Column

Description

PID

Unique process identifier.

USER

OS username running the process.

PR

Priority of the process.

NI

Nice value or process. Negative value means high priority. Positive value means low priority.

VIRT

Total virtual memory used by the process.

RES

Nonswapped physical memory used.

SHR

Shared memory used by the process.

S

Process status.

%CPU

Processes percent of CPU consumption since last screen refresh.

%MEM

Percent of physical memory the process is consuming.

TIME

Total CPU time used by the process.

TIME+

Total CPU time, showing hundredths of seconds.

COMMAND

Command line used to start a process.

You can also run top using the -b (batch mode) option and send the output to a file for later analysis:

$ top -b > tophat.out

While in batch mode, the top command will run until you kill it (by pressing Ctrl+C) or until it reaches a specified number of iterations. You could run the previous top command in batch mode with a combination of nohup and & to keep it running, regardless of whether you were logged on to the system. The danger there is that you might forget about it and eventually create a very large output file (and an angry SA).

If you have a particular process that you’re interested in monitoring, use the -p option to monitor a PID or the -U option to monitor a specific username. You can also specify a delay and number of iterations by using the -d and -n options. The following example monitors the oracle user with a delay of 5 seconds for 25 iterations:

$ top -U oracle -d 5 -n 25

Image Tip  Use the man top or top --help commands to list all the options available with your OS version.

Solaris prstat

Note that on Solaris systems, the prstat utility can also be used to identify which processes are consuming the most CPU resources. For example, you can instruct the prstat to report system statistics every 5 seconds:

$ prstat 5

Here is some sample output:

   PID USERNAME  SIZE   RSS STATE   PRI NICE      TIME  CPU PROCESS/NLWP
  7601 oracle   8430M 4081M sleep   101    -   3:51:47 2.0% oracle/1
  1614 oracle   6382M 4093M sleep   101    -  48:07:20 0.0% oracle/1
 20071 oracle   8431M 7746M sleep    54    0   0:00:00 0.0% oracle/1

Type q or press Ctrl+C to exit prstat. After identifying a top resource-consuming process, you can determine which database the process is associated with by using the ps command. This example reports on process information associated with the PID of 7601:

$ ps -ef | grep 7601 | grep -v grep
  oracle  7601     1   0   Apr 10 ?         231:47 ora_vktm_DWREP

USING THE /PROC/<PID> FILES TO MONITOR PROCESS ACTIVITY

For every Linux process that is running, a directory is created in the /proc virtual filesystem. For example, if you want to view details regarding the operating PID of 9156, you can navigate to the virtual /proc/9156 directory and do a long listing. There you will see several informational files and directories related to this running process:

$ cd /proc/9156
$ ls -l

Here is a partial listing of the output:

-r-------- 1 oracle dba 0 May  2 15:42 auxv
-r--r--r-- 1 oracle dba 0 May  2 15:02 cmdline
-rw-r--r-- 1 oracle dba 0 May  2 15:42 coredump_filter
-r--r--r-- 1 oracle dba 0 May  2 15:42 cpuset
lrwxrwxrwx 1 oracle dba 0 May  2 15:42 cwd -> /u01/app/oracle/product/12.1.0.2/db_1/dbs
-r-------- 1 oracle dba 0 May  2 15:29 environ
lrwxrwxrwx 1 oracle dba 0 May  2 15:42 exe -> /u01/app/oracle/product/12.1.0.2/db_1/bin/oracle

The output indicates this is an oracle process, and now you can analyze it further by looking at the memory usage maps file or the status file. Because these files don’t exist on disk, use a utility such as cat to display their contents:

$ cat /proc/<PID>/maps
$ cat /proc/<PID>/status

8-4. Identifying CPU Bottlenecks

Problem

You want to monitor the system load on your CPUs.

Solution

As a DBA, you also need to periodically examine the load on CPUs to determine system bottlenecks. The mpstat (multiple processor statistics) utility displays statistics for processors on the server:

$ mpstat

Here’s a snippet of the output:

Linux 2.6.18-308.4.1.0.1.el5xen (rmougserv)      05/02/2015
04:53:12 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
04:53:12 PM  all    0.10    0.00    0.10    0.10    0.00    0.00    0.01   99.70    179.01

The default output of mpstat shows only one line of aggregated statistics for all CPUs on the server. You can also view snapshots that report statistics accumulated between intervals. The following example uses the -P option to report only on processor 0; it displays output every 2 seconds for a total of 10 different reports:

$ mpstat -P 0 2 10

Here are a few lines of the output:

04:54:22 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
04:54:24 PM    0    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00     93.94
04:54:26 PM    0    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00     92.61
04:54:28 PM    0    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00    135.68

The amount of idle time is the most important statistic. If the CPUs have a low idle percentage, it is indicative of a high load. See Table 8-5 for an interpretation of the mpstat output.

Table 8-5. Column Definitions for mpstat Processor Statistics

Column

Description

CPU

Processor number. Starts at 0. The all row reports average statistics for all processors.

%user

Percentage of CPU utilization while executing at user level.

%nice

Percentage of CPU utilization while executing at user level with nice priority.

%sys

Percentage of CPU utilization while executing at kernel level.

%iowait

Percentage of time CPUs were idle during an outstanding disk I/O operation.

%irq

Percentage of time spent by CPUs servicing interrupts.

%soft

Percentage of time spent by CPUs to service software interrupts.

%steal

Percentage of time CPUs waiting while the hypervisor servicing another virtual processor.

%idle

Percentage of time that CPUs were idle without outstanding disk I/O operations.

intr/s

Total number of interrupts received per second by CPUs.

On Solaris systems, the output of mpstat is slightly different; for example:

$ mpstat -P 0 2 20
 CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
   0    8   0  358   795  266  645    5   68  195    3   497    3   3   0  94
   1    8   0  235   477  133  677    5   68  177    3   515    4   3   0  94
   2    7   0  328   609  257  705    5   70  197    3   511    3   3   0  94

The amount of idle time is the most telling statistic. A low idle time is indicative of high processer load.

How It Works

The mpstat utility is useful for specifically analyzing the CPU load. Here are some general guidelines for interpreting its output:

  • If %idle is high, your CPUs are most likely not overburdened.
  • If the %iowait output is a nonzero number, you may have some disk I/O contention.

On multiprocessor servers, you can use the -P ALL options of the mpstat command to print each CPU’s statistics on separate lines:

$ mpstat -P ALL

Here’s a partial listing of the output:

05:07:53 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
05:07:53 PM  all    0.11    0.00    0.12    0.10    0.00    0.00    0.01   99.67    179.26
05:07:53 PM    0    0.18    0.01    0.21    0.36    0.00    0.00    0.01   99.23    117.26
05:07:53 PM    1    0.11    0.00    0.12    0.01    0.00    0.00    0.00   99.76     18.92
05:07:53 PM    2    0.06    0.00    0.04    0.01    0.00    0.00    0.00   99.89     21.05
05:07:53 PM    3    0.09    0.00    0.10    0.01    0.00    0.00    0.00   99.80     22.04

The prior output shows that this server has four CPUs (indicated by a line for CPUs 0, 1, 2, and 3). The %idle column is in the 99% range, indicating that there is little load on the CPUs on this box.

You can also save the output of mpstat to a file. This example saves to a file all CPU activity reported every 10 seconds for 100 times:

$ mpstat -P ALL 10 100 > mpperf.perf

This code allows you to save performance statistics so that you can analyze and contrast performance for different time periods.

8-5. Identifying Memory-Intensive Processes

Problem

You want to identify which Oracle session is consuming the most memory on the database server. If it is an Oracle session running a SQL query, you want to display the associated SQL.

Solution

You can use either the top utility or the ps command to display top memory using processes. Let’s look at top first.

Using top

The easiest way to run top is as follows:

$ top

By default, top displays the output sorted by CPU usage. To shift the output to sort by memory usage, use the > key to shift the reporting output one column to the right (%MEM). Here is some sample output:

Tasks: 223 total,   1 running, 222 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  15622144k total, 15369420k used,   252724k free,  8759664k buffers
Swap: 10288440k total,        0k used, 10288440k free,  3505156k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 3217 cloudera  19   0 2770m 1.0g  17m S  0.0  6.9   0:52.84 java
  637 oracle    15   0 2409m 404m 394m S  0.0  2.7   0:00.65 ora_dbw0_o1212
 4204 emcadm    16   0  671m 249m  12m S  0.0  1.6   1:02.31 java
 2525 oracle    25   0 2462m 193m 186m S  0.0  1.3   0:15.40 oracle_2525_o12

Using ps

You can also use the ps command with the pmem option to identify the top memory-consuming processes and their associated PIDs. We recommend that you search specifically for Oracle processes because the ps output can be quite lengthy; for example:

$ ps -e -o pmem,pid,user,tty,args | grep -i oracle | sort -n -k 1 -r | head

Here is some sample output:

1.2   625 oracle   ?        ora_mman_O1212
0.7   655 oracle   ?        ora_mmon_O1212
0.4   691 oracle   ?        ora_cjq0_O1212

From the second column in the previous output, the process with the ID of 625 is consuming 1.2 percent of the memory. Now that you have the PID, you can use the query from the “Solution” section of recipe 8-2 to further identify the type of Oracle process (e.g., RMAN, SQL*Plus, Data Pump, and so forth).

How It Works

If you’re experiencing performance issues, determining which processes are consuming the most memory will give you another piece of information to troubleshoot the issue. The top and ps commands are quite useful in this regard. The top command provides an interactive way to dynamically view memory usage patterns, whereas the ps command is more useful as a one-line command to get a snapshot of memory activity. We recommend that you create an alias for the ps command; for example:

$ alias topmem=’ps -e -o pmem,pid,user,tty,args | grep -i oracle | sort -n -k 1 -r | head’

This code will allow you to quickly run the command without having to remember the syntax:

$ topmem

You should use the memory usage information along with the CPU usage (see recipe 8-2) to determine which processes are consuming the most resources and then further investigate those processes.

8-6. Identifying Memory Bottlenecks

Problem

You want to view the current usage of memory on your database server.

Solution

Paging and swapping activity is an indicator of the efficiency of memory usage on your sever. In general, high amounts of paging and swapping indicate an inadequate amount of memory. Numerous utilities are available to monitor paging and swapping. For example, you can use vmstat (virtual memory statistics) to monitor the current memory usage. In this line of code, vmstat reports are generated every 2 seconds for a total of 3 reports:

$ vmstat 2 3

Here is some sample output:

procs  -----------memory----------- --swap-- ----io---- --system-- ------cpu------
 r  b   swpd   free    buff   cache  si   so   bi    bo   in   cs  us sy  id wa st
 0  0      0 358448 8753168 3539596   0    0    4    10    8    1   1  1  98  0  0
 0  0      0 358424 8753168 3539596   0    0    0    16  416  637   0  0 100  0  0
 0  0      0 358424 8753168 3539596   0    0    0   378  484  631   0  0 100  0  0

If you have a fairly recent version of Linux, you can also use the -a option, which displays active and inactive memory. Here is an example of running vmstat with the -a option:

$ vmstat -a 2 3

Here’s what the output looks like with the additional columns:

procs -----------memory----------- ---swap-- -----io--- --system-- ------cpu-------
 r  b  swpd   free   inact  active  si   so   bi    bo    in   cs   us sy  id wa st
 0  0     0 358812 9180892 4841900   0    0    4    10     8    1    1  1  98  0  0
 0  0     0 358812 9180892 4841912   0    0    0    62   427  637    0  0 100  0  0
 0  0     0 358812 9180892 4841912   0    0    0     0   421  629    0  0 100  0  0

If your server shows high amounts of memory swapped in from disk (si column) or the amount of memory swapped out to disk (so column), you may have a memory bottleneck.

The output of vmstat on Solaris reports the same type of information as Linux systems; for example:

$ vmstat 2 3
 kthr      memory            page            disk          faults      cpu
 r b w     swap    free  re  mf pi po fr de sr vc  vc  vc --   in   sy    cs us sy  id
 0 0 0 18541480 2994160  20 119  0  0  0  0  0  2 184 146  0 9203 8216 10930  3  3  94
 0 0 0 18463736 3091248 138 335  0  0  0  0  0  2   6   5  0 5188 3617  5081  0  0  99
 0 0 0 18464824 3090824   0   3  0  0  0  0  0  0   2  38  0 4792 2256  4600  0  0 100

In this output, the page section reports on swapping, and I/O is reported under disk.

How It Works

One of the main indicators of memory health is the amount of paging and swapping that is occurring. If you read five different Linux performance–tuning white papers, you’ll get five slightly different definitions of paging and swapping. We do not split hairs about the exact definitions of those terms; our statement is that in general, paging and swapping are the movement of the contents of memory to and from disk.

Paging and swapping occur when there isn’t enough physical memory to accommodate all the memory needs of the processes on the server. When paging and swapping take place, performance usually suffers because the process of copying memory contents to and from disk is an inherently slow activity. A tool such as vmstat can help you identify excessive swapping.

Using free

You can also use the free command to display current memory used, both physical and virtual (swap):

$ free
             total       used       free     shared    buffers     cached
Mem:      15622144   15269820     352324          0    8753168    3539596
-/+ buffers/cache:    2977056   12645088
Swap:     10288440          0   10288440

From the previous output, you see that this system has 15GB of RAM, almost all of it being used. It has about 10GB of swap space, almost none of which is used. Don’t be too alarmed if your Linux system is using most of its physical memory; that’s typical on many Linux servers.

Image Note  See Chapter 9 for details on using ipcs to view the memory and semaphores used by your database.

You can use the -s option to have the free command report output on a repeating interval. This example uses free to display memory usage in two-second snapshots and sends the output to a file:

$ free -s 2 > freemem.perf

Press Ctrl+C to exit from free when using the -s option. By default, the free output reports memory usage in kilobytes. Use -m to print in megabytes or -g to display the output of free in gigabytes.

Using watch

An effective way to use free is in combination with the watch command. The watch command is used to execute another program on a periodic basis. This example uses watch to run the free utility every 3 seconds via the -n (interval) option. The -d (differences) option is used to have the output highlighted onscreen when there is a change in value from snapshot to snapshot:

$ watch -n 3 -d free
Every 3.0s: free                                 Sun May  3 17:42:33 2015
             total       used       free     shared    buffers     cached
Mem:      15622144   15264004     358140          0    8753168    3539700
-/+ buffers/cache:    2971136   12651008
Swap:     10288440          0   10288440

You should be able to visually see any changes in memory activity onscreen when running in this mode. To exit from watch, press Ctrl+C.

You can also view the current characteristics of memory by viewing the /proc/meminfo file. You can use the file to display the current physical memory and swap space being used. This example uses the cat utility to display the current memory usage:

$ watch -d cat /proc/meminfo

By default, the watch command will refresh the screen every 2 seconds. You should visually see differences highlighted from interval to interval:

Every 2.0s: cat /proc/meminfo               Sun May  3 17:44:38 2015

MemTotal:     15622144 kB
MemFree:        358044 kB
Buffers:       8753168 kB
Cached:        3539912 kB
SwapCached:          0 kB
Active:        4842728 kB
SwapTotal:    10288440 kB
SwapFree:     10243872 kB
...

If you see an unusual amount of swap space being used (low SwapFree), it is an indication that your server needs more memory. To exit from watch, press Ctrl+C.

8-7. Identifying I/O-Intensive Processes

Problem

You want to determine which processes are generating the most I/O.

Solution

Use the iotop utility to display the top I/O-generating processes. On most systems, running iotop requires root access (signified in this example by the # character command prompt). Here’s the simplest way to invoke it:

# iotop

Here is some sample output:

  TID  PRIO   USER      DISK READ   DISK WRITE  SWAPIN  IO>     COMMAND
21553  be/4 oracle      0.00 B/s    0.00 B/s    -5.02 % 99.99 % [oracle_21553_o1]
24423  be/4 oracle      0.00 B/s    0.00 B/s     0.00 %  1.99 % ora_p003_O1212
24348  be/4 oracle      0.00 B/s    0.00 B/s     0.00 %  1.36 % ora_lgwr_O1212
24331  be/4 oracle      0.00 B/s    0.00 B/s     0.00 %  0.72 % ora_gen0_O1212
24401  be/4 oracle      0.00 B/s    0.00 B/s     0.00 %  0.64 % ora_arc3_O1212
24427  be/4 oracle      0.00 B/s    0.00 B/s     1.99 %  0.56 % ora_p005_O1212

This output displays the overall disk read and disk write rates, along with the PID, user, and command. If it’s an Oracle process, you can use the SQL query from the “Solution” section of recipe 8-2 to retrieve further details regarding the process from the data dictionary.

How It Works

The iotop utility is a top-like utility that displays processes and associated I/O. There are many different modes you can use. For example, if you want to display I/O associated with one user, you can do so as follows:

# iotop -user oracle

If you want to change the refresh rate (in seconds), you can specify the interval with the -d switch. The following instructs iotop to refresh the screen every 5 seconds:

# iotop -d 5

The iotop is an effective utility that displays top I/O-consuming processes on a server. When you’re experiencing disk I/O bottlenecks, this tool provides a quick method for identifying processes that may need further investigation.

8-8. Identifying I/O Bottlenecks

Problem

You want to determine whether your disk storage is a bottleneck.

Solution

The iostat command can help you determine whether disk I/O is potentially a source of performance problems. Using the -x (extended) option with the -d (device) option is a useful way to generate I/O statistics. This example uses the -x and -d options to display extended device statistics every 10 seconds:

$ iostat -xd 10

You need a really wide screen to view this output; here’s a partial listing:

Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz
avgqu-sz   await  svctm  %util
sda          0.01   3.31  0.11  0.31    5.32   28.97     2.66    14.49    83.13
0.06  138.44   1.89   0.08

Image Note  On Solaris systems, the iostat output may report the disk utilization as %b (percent busy). Also on Solaris systems, use iostat -Mnxz <n> to display output using megabytes, list descriptive names, show extended statistics, and remove lines with all zeros.

This periodic extended output allows you to view which devices are experiencing spikes in read and write activity in real time. To exit from the previous iostat command, press Ctrl+C.

When trying to determine whether device I/O is the bottleneck, here are some general guidelines when examining the iostat output:

  • Look for devices with abnormally high blocks read or written per second.
  • If any device is near 100% utilization, it is a strong indicator that I/O is a bottleneck.

Once you have determined that you have a disk I/O–contention issue, you can use utilities such as AWR (if licensed), Statspack (no license required) or the V$ views to determine whether your database is I/O stressed. For example, the AWR report contains an I/O statistics section with the following subsections:

  • IOStat by Function Summary
  • IOStat by Filetype Summary
  • IOStat by Function/Filetype Summary
  • Tablespace IO Stats
  • File IO Stats

If you want to display current database sessions that are waiting for I/O resources, you can query the data dictionary as follows:

SELECT a.username, a.sql_id, b.object_name, b.object_type, a.event
FROM v$session a
    ,dba_objects b
    ,v$event_name c
WHERE b.object_id = a.row_wait_obj#
AND   a.event = c.name
AND   c.wait_class = ’User I/O’;

How It Works

The iostat command can help you determine whether disk I/O is potentially a source of performance problems. If you execute iostat without any options, you’ll get a default report that displays averages since the system was last started:

$ iostat
avg-cpu:  %user   %nice    %sys %iowait   %idle
          18.91    0.04    1.20    0.15   79.70

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda               7.14       398.01       409.52  164484368  169239542
sda1              0.00         0.00         0.00       1538        166
sda2              54.15      396.92       407.74  164032098  168505032
sda3              0.35         1.04         1.77     429820     733168

Notice that there are two sections in the iostat output. The first section is the CPU Utilization Report. The second section relates to disk I/O and is referred to as the Device Utilization Report.

Table 8-6 describes the columns used for disk I/O. (Use the -d option of iostat to display only device statistics.)

Table 8-6. Column Descriptions of iostat Disk I/O Output

Column

Description

Device

Device or partition name

tps

I/O transfers per second to the device

Blk_read/s

Blocks per second read from the device

Blk_wrtn/s

Blocks written per second to the device

Blk_read

Number of blocks read

Blk_wrtn

Number of blocks written

rrqm/s

Number of read requests merged per second that were queued to device

wrqm/s

Number of write requests merged per second that were queued to device

r/s

Read requests per second

w/s

Write requests per second

rsec/s

Sectors read per second

wsec/s

Sectors written per second

rkB/s

Kilobytes read per second

wkB/s

Kilobytes written per second

avgrq-sz

Average size of requests in sectors

avgqu-sz

Average queue length of requests

await

Average time in milliseconds for I/O requests sent to the device to be served

svctm

Average service time in milliseconds

%util

Percentage of CPU time during which I/O requests were issued to the device; near 100% indicates device saturation

You can also instruct iostat to display reports at a specified interval. The first report displayed reports averages since the last server reboot; each subsequent report shows statistics since the previously generated snapshot. The following example displays a device statistic report every 3 seconds:

$ iostat -d 3

To exit from the previous iostat command, press Ctrl+C. You can also specify a finite number of reports that you want generated, which is useful for gathering metrics to be analyzed over a period of time. This example instructs iostat to report every 2 seconds for a total of 15 reports:

$ iostat 2 15

When you work with locally attached disks, the output of the iostat command clearly shows you where the I/O is occurring. However, it is not that clear-cut in environments that use external arrays for storage. What you are presented with at the filesystem layer is some sort of a virtual disk that might also have been configured by a volume manager. Virtual disks are often referred to as volumes or logical units (LUNs).

A LUN is a logical disk that physically comprises one or more physical disks. The LUN represents the virtualization layer between the physical disks and the applications running on the database server. Figure 8-1 illustrates at a high level the abstraction involved with virtual disks.

9781484212554_Fig08-01.jpg

Figure 8-1. Abstraction layers between database application and physical disks

When you work with virtual disks, the output from iostat reports on read/write activity at the virtual disk level, not the underlying physical disks. In these situations, there may be many layers of abstraction between the database application and physical disks, which can make it difficult to isolate the exact source of an I/O bottleneck. We recommend that you work closely with your storage administrator to determine whether a particular set of LUNs and underlying physical disks are a source of poor I/O performance.

DETERMINING I/O RATES

To get an estimate of the I/O rate of your storage system, use the dd command in combination with the time command. On Linux systems, use the ofile and ifile parameters to simulate direct I/O; otherwise, you’re reading and writing to memory and will receive misleading results. For example, use the following to estimate writes:

$ time dd if=/dev/zero of=testfile bs=8k count=1000 oflag=direct

The following estimates reads:

$ time dd of=/dev/null if=testfile bs=8k count=1000 iflag=direct

These metrics are helpful for determining the speed at which the OS can write and read a file. Keep in mind that it will provide estimates for sequential writes/reads. You would have to consider something more sophisticated to simulate database I/O activity (random writes/reads and multiple processes). Having said that, this approach is still a good starting point for diagnosing I/O issues.

On Solaris systems, the dd command doesn’t have the oflag and iflag options. If the storage is ZFS, use the zfs utility to set caching to metatdata via zfs set primarycache=metadata <filesystem>. You will most likely have to contact your storage administrator to run the prior command with the appropriate filesystem.

8-9. Monitoring Network Traffic

Problem

You suspect that the network might be a bottleneck. You want to view network statistics.

Solution

Use the netstat (network statistics) command to display network traffic. Perhaps the most useful way to view netstat output is with the -ptc options. These options display the PID and TCP connections, and they continuously update the output:

$ netstat -ptc

Press Ctrl+C to exit the previous command. Here’s a partial listing of the output:

(Not all processes could be identified, non-owned process info will not be shown, you would have to be root to see it all.) Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address  Foreign Address  State       PID/Program name
tcp        0      0 rmug.com:62386 rmug.com:1521    ESTABLISHED 22864/ora_pmon_RMDB
tcp        0      0 rmug.com:53930 rmug.com:1521    ESTABLISHED 6091/sqlplus
tcp        0      0 rmug.com:1521  rmug.com:53930   ESTABLISHED 6093/oracleRMDB1
tcp        0      0 rmug.com:1521  rmug.com:62386   ESTABLISHED 10718/tnslsnr

If the Send-Q (bytes not acknowledged by remote host) column has an unusually high value for a process, it may indicate an overloaded network. The useful aspect about the previous output is that you can determine the OS PID associated with a network connection. If you suspect that the connection in question is an oracle session, you can use the techniques described in the “Solution” section of recipe 8-2 to map an OS PID to an oracle process or SQL statement.

Image Note  On Linux systems, the /proc/net directory stores information about current network settings and activity.

How It Works

When performance issues occur, the network is usually not the cause. Most likely you’ll determine that bad performance is related to a poorly constructed SQL statement, inadequate disk I/O, or not enough CPU or memory resources. However, as a DBA, you need to be aware of all sources of performance bottlenecks and how to diagnose them. In today’s highly interconnected world, you must possess network troubleshooting and monitoring skills. The netstat utility is a good starting place for monitoring server network connections.

On Linux systems, you can also use the sar command with the -n option to report on network statistics. The -n option takes one of the following as an argument: DEV (network devices), EDEV (error count), SOCK (sockets), or FULL (all). The following command displays the current day’s network device statistics:

$ sar -n DEV

Here’s a limited listing of the output:

12:00:01 AM  IFACE  rxpck/s txpck/s  rxbyt/s  txbyt/s  rxcmp/s  txcmp/s  rxmcst/s
12:10:01 AM     lo     0.00    0.00     0.00     0.00     0.00     0.00      0.00
12:10:01 AM   eth0     0.34    0.11    39.17    10.22     0.00     0.00      0.04
12:10:01 AM   eth1     0.00    0.00     0.00     0.00     0.00     0.00      0.00
12:10:01 AM   sit0     0.00    0.00     0.00     0.00     0.00     0.00      0.00

The previous output shows the number of packets transmitted and received per second, as well as the bytes and compressed packets (snapshots taken in 10-minute intervals).

If you experience performance issues when connecting to remote servers, sometimes it is useful to see the route that a packet takes to get from one server to another. To use traceroute, log on to the server and then trace the route to the remote server:

$ traceroute <remote_server_name>

This information is meaningful to a network engineer and can troubleshoot network performance issues.

8-10. Troubleshooting Database Connectivity

Problem

You’re attempting to connect to a database and are receiving the following error:

ERROR:
ORA-12154: TNS:could not resolve the connect identifier specified

You want to determine the root cause of this issue.

Solution

To diagnose database connectivity issues, first use the OS ping utility to determine whether the remote box is accessible; for example:

$ ping dwdb
dwdb is alive

If ping doesn’t work, work with your SA or network administrator to ensure that you have server-to-server connectivity in place.

Next, use telnet to see whether you can connect to the remote server and port (that the listener is listening on); for example:

$ telnet dwdb 1521
Trying 127.0.0.1...
Connected to dwdb.
Escape character is ’^]’.

This output indicates that connectivity to a server and port is okay. If it doesn’t work (you don’t see “Connected to” in the output), contact your SA or network administrator for further assistance.

Now use tnsping to determine whether Oracle Net is working. This utility will verify that the Oracle Net listener is running on the remote server; for example:

$ tnsping dwrep
..........
Used TNSNAMES adapter to resolve the alias
Attempting to contact (DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)
(HOST = dwdb.us.farm.com)(PORT = 1521))
(CONNECT_DATA = (SERVER = DEDICATED) (SERVICE_NAME = DWREP)))
OK (500 msec)

What’s sometimes confusing is that the prior output indicates only that the listener is up and receiving requests; it doesn’t necessarily indicate that there actually is a service (DWREP in this example) registered with the listener on the remote host. To further verify that the actual service is registered with the listener, you have to log on to the remote server and run the lsnrctl utility to verify the services registered with the listener:

$ lsnrctl services <listener_name>

If you don’t provide a listener name, the default name of LISTENER is assumed. You can also use the lsnrctl status <listener_name> command to verify that the listener is up and which services are registered with the listener. If you’re not sure what the listener name is, use the ps command to identify it:

$ ps -ef | grep tns

Here is some sample output that indicates that the default listener name (LISTENER) is used on this server:

oracle   27480     1  0 10:09 ?        00:00:00
/u01/app/oracle/product/12.1.0.2/db_1/bin/tnslsnr LISTENER -inherit

Also verify that the remote database is available by establishing a local connection as a non-SYS account (SYS can often connect to a troubled database when other schemas can’t); for example:

$ sqlplus system/manager

If you can’t connect via this command, verify that a critical mount point hasn’t filled up via the df command:

$ df -h

If Oracle can’t write to files in the ORACLE_HOME/dbs directory, or if archiving is enabled and the archive location is full, the database will hang and not accept new connections.

If everything looks good to this point, verify that the TNS information is correct. If the remote listener and database are working, ensure that the mechanism for determining TNS information (such as the tnsnames.ora file) contains the correct information. Sometimes the client machine will have multiple TNS_ADMIN locations and tnsnames.ora files. One way to verify whether a particular tnsnames.ora file is being used is to rename it and see whether you get a different error when you attempt to connect to the remote database.

How It Works

Diagnosing database connectivity issues can be quite frustrating. If you’ve followed all the suggestions in the “Solution” section of this recipe and are still having issues, examine the client sqlnet.log file and/or the server listener.log file. Sometimes these log files show additional information that will pinpoint the issue. The locations of these files can vary by the Oracle release and how a DBA has implemented Oracle Net. Usually the Oracle Net files are in the following directory:

$ORACLE_BASE/diag/tnslsnr/instance_name/listener/trace

If you can’t find the files, use the find command; for example:

$ cd $ORACLE_BASE
$ find . -name listener.log

Here is some sample output:

./diag/tnslsnr/dwrep/listener/trace/listener.log

If all else fails, you can enable higher levels of output to the Oracle Net logging files. See the Oracle Database Net Services Administrator’s Guide for details (freely available for download on the docs.oracle.com web site).

8-11. Tracing a Process

Problem

You’re troubleshooting database network connectivity issues and want to trace the tnsping process to determine which tnsnames.ora file is being used. In other words, you want to determine the directory locations and the order in which tnsping looks for tnsnames.ora files.

Solution

This solution has two sections: one for using strace on Linux and one for using truss on Solaris.

Using strace

To reiterate, the task at hand is to determine what order and in which directories the tnsping utility is looking for tnsnames.ora files. You know that tnsping looks in multiple locations, but aren’t sure of the order. The strace tool can provide this information. First, run strace to trace tnsping without any options:

$ strace tnsping bogus_service

This code generates a great deal of output; here’s a small snippet:

access("/u01/app/oracle/product/12.1.0.2/db_1/network/admin/tnsnames.ora", F_OK)
= -1 ENOENT (No such file or directory)
mmap(NULL, 143360, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)
= 0x2ac9aa003000
munmap(0x2ac9aa003000, 143360)          = 0
stat("/u01/app/oracle/product/12.1.0.2/db_1/ldap/admin/ldap.ora", 0x7fffba717a30) = -1
ENOENT (No such file or directory)
brk(0x126ff000)                         = 0x126ff000
...

Most of the output isn’t useful. To extract the relevant lines, pipe the output to grep and cut the field (delimited by double quotes) of interest:

$ strace tnsping bogus_sid 2>&1 >/dev/null|grep ’.ora’|cut -f2 -d"|cut -f1 -d"|uniq

Here is the useful output showing the directory locations and order in which tnsping is looking for tnsnames.ora files:

/u01/app/oracle/product/12.1.0.2/db_1/network/admin/sqlnet.ora
/home/oracle/.tnsnames.ora
/u01/app/oracle/product/12.1.0.2/db_1/network/admin/tnsnames.ora
/var/yp/binding/us.oracle.com.2
/u01/app/oracle/product/12.1.0.2/db_1/ldap/admin/fips.ora
/u01/app/oracle/product/12.1.0.2/db_1/ldap/admin/ldap.ora
/u01/app/oracle/product/12.1.0.2/db_1/network/admin/ldap.ora

This output shows the order, location, and file names that the tnsping process searches for when attempting to connect to a service registered with a listener.

Using truss

On a Solaris system, you can use truss to trace a process. First, use truss to trace tnsping without any options:

$ truss tnsping bogus_service

Here’s a small section of a large amount of output:

stat("/orahome/app/oracle/product/12.1.0.2/db_1/network/admin/ldap.ora", 0xFFFFFFFF7FFF5EB0) Err#2 ENOENT
stat("/etc/resolv.conf", 0xFFFFFFFF7FFF6540)    = 0
open("/etc/resolv.conf", O_RDONLY)              = 7

Most of that output isn’t useful. To extract the relevant lines, pipe the output to grep and cut the relevant field (delimited by double quote marks):

$ truss tnsping bogus_service 2>&1 | grep ’.ora’ | cut -f2 -d" | cut -f1 -d" | uniq

Here’s the information showing the order and location of the files that tnsping is accessing:

/orahome/app/oracle/product/12.1.0.2/db_1/network/admin/sqlnet.ora
/orahome/oracle/.tnsnames.ora
/orahome/app/oracle/product/12.1.0.2/db_1/network/admin/tnsnames.ora
/orahome/app/oracle/product/12.1.0.2/db_1/ldap/admin/fips.ora
/orahome/app/oracle/product/12.1.0.2/db_1/ldap/admin/ldap.ora
/orahome/app/oracle/product/12.1.0.2/db_1/network/admin/ldap.ora

So by tracing the process, you can glean more insight on its inner workings.

How It Works

Tracing utilities gives you a window into internal system calls (and parameters) initiated by a process. On rare occasions, you may need to trace a process. In our experience, DBAs rarely use tracing facilities. However as the “Solution” section demonstrates, tracing a process is an effective troubleshooting technique in certain situations. The strace (Linux) and truss (Solaris) utilities provide the capability to trace a process.

The “Solution” section showed examples of initiating a trace on a process as the command is interactively executed. It is also possible to trace a process that is already running. Suppose that you want to trace an Oracle background process. First determine its PID:

$ ps -ef | grep smon
oracle   24354     1  0 09:46 ?        00:00:00 ora_smon_O1212

On Linux, run strace and provide the PID to it:

$ strace -p 24354

To send the output to a file, use the -o option:

$ strace -o smon.txt  -p 24354

On Solaris, run truss with the appropriate PID. First, use ps to determine the PID of the process:

$ ps -ef | grep smon
  oracle 18618     1   0   Jun 12 ?           1:04 ora_smon_TRG

Now use truss to trace the process:

$ truss -p 18618

Here the output is sent to a file while tracing:

$ truss -o smon.txt -p 18618

Note that the dtrace utility is available for Linux and Solaris. This utility is an entire framework that has its own scripting language that provides insight into all aspects of the OS. Using dtrace can be quite involved but critical if you require extensive tracing capabilities. The strace and truss tools are better suited to trace a single process.

8-12. Listing Files Opened by Processes

Problem

You’ve logged on to a database server and want to help diagnose a problem. After logging on to the server, you realize there’s no mechanism to automatically set the oracle OS variables, so the standard variables (ORACLE_HOME, ORACLE_SID, and so on) haven’t been set for your user. You want to determine the location of ORACLE_HOME.

Solution

Identify the Oracle instances running on the server. Use the ps and grep commands to isolate any Oracle SMON background processes:

$ ps -ef | grep smon

The output indicates that the PID of 25128 is associated with the SMON background process:

oraus    25128     1  0 Mar18 ?        00:02:23 ora_smon_semgc1

Now use the lsof (list open files) command to identify files opened by that process. The following line of code searches for the string "oracle" in the output:

$ lsof -p 25128 | grep ’bin/oracle’ | awk ’{ print $9 }’

Here is some sample output:

/orahome/app/oracle/product/12.1.0.1/db_1/bin/oracle

In this way, you can quickly determine the location of ORACLE_HOME, even when the standard variables haven’t been set for your current user logon.

How It Works

The lsof command is ideal for identifying processes and associated open files. This command is quite flexible and has many options. For example, if you want to view processes that have opened a particular file, you can do so by passing the file name to the command; for example:

$ lsof /u01/dbfile/O1212/system01.dbf

Here are a few lines of the output:

COMMAND     PID   USER   FD   TYPE DEVICE  SIZE/OFF     NODE NAME
ora_dbrm_ 16201 oracle  256u   REG  202,2 524296192 14065848 /u01/dbfile/O1212/system01.dbf
ora_dbw0_ 16207 oracle  258uW  REG  202,2 524296192 14065848 /u01/dbfile/O1212/system01.dbf
ora_lgwr_ 16209 oracle  260u   REG  202,2 524296192 14065848 /u01/dbfile/O1212/system01.dbf
...

If you want to view processes that have opened files under a specific directory, you can do so with the +D option:

$ lsof +D /u01/dbfile/O1212

To report on files opened by a particular user, use the -u option:

$ lsof -u oracle

On Solaris systems, the lsof command may not be available. Instead, use the pfiles command to list open files. For example, if you want to identify ORACLE_HOME in a Solaris environment, first identify the SMON processes running on the server:

$ ps -ef | grep smon
oracle 22184     1   0   Mar 13 ?           5:15 ora_smon_EMREP

To identify files opened by the prior process, use the pfiles command, pass it the PID, and then search for any helpful keywords such as the string "dbs":

$ pfiles 22184 | grep dbs

Here is some output:

/orahome/app/oracle/product/12.1.0.2/db_1/dbs/hc_EMREP.dat

In this way, you can determine the location of files used by Oracle processes.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset