Oracle Database is the most complete and simultaneously complex database in today’s database marketplace. Every time Oracle launches a new release, a lot of new features are made available, and anyone who works as an Oracle database administrator (DBA) knows how difficult it is to stay current on all feature sets. It is entirely possible that an Oracle DBA may be required to manage multiple databases that are running on completely different releases of Oracle. To add to this complexity, each database may have several distinct features enabled, including various versions of table and index partitioning, Oracle Advanced Compression, replication between databases using either Oracle Streams or Oracle GoldenGate, Oracle Real Application Clusters (RAC), and many others. A RAC database is one of the most difficult environments to administer because its different architecture exists mainly to offer high availability for database applications. This chapter therefore focuses on the global cache (gc) buffer busy wait event, one of the most commonly encountered wait events in a RAC database.
As its name implies, any buffer busy wait event in an Oracle database simply means that at least one session is waiting for a buffer in the instance’s database buffer cache to become available. In Oracle Database versions prior to Release 10.1, there was just one event called buffer busy wait, but starting with that release a new event, gc buffer busy, was added to the mix for RAC databases to help monitor buffer busy waits related to cache fusion.
Starting with Oracle Database 12c Release 1, the Oracle Database 12c Reference Guide (Oracle, 2016) now describes four wait events to consider when discussing this type of wait event:
Buffer busy: A session cannot pin the buffer in the buffer cache because another session has pinned the buffer.
Read by other session: A session cannot pin the buffer in the buffer cache because another session is reading the buffer from disk.
GC buffer busy acquire: A session cannot pin the buffer in the buffer cache because another session is reading the buffer from the cache of another database instance in the RAC cluster database.
GC buffer busy release: A session cannot pin the buffer in the buffer cache because another session on another database instance in the RAC cluster database is bringing that buffer from a different instance’s buffer cache into its own cache so it can be pinned.
This chapter focuses on some important topics related to the gc buffer busy wait event:
How to leverage the ORAchk utility to help keep a RAC database healthy
How to use Automatic Workload Repository, Automatic Database Diagnostic Monitor, and Active Session History to find important information regarding gc buffer busy wait events
How to investigate whether a RAC database is experiencing an unnecessarily high frequency of the gc buffer busy wait event, and how to handle and repair this situation
The original RACcheck tool was renamed to ORAchk in early 2015, and for an excellent reason: it now not only works for RAC databases but also supports a lot of new features and products, such as Oracle GoldenGate, Oracle E-Business Suite, Oracle Sun Systems, and Oracle Enterprise Manager Cloud Control. ORAchk has thus been transformed into a serious audit tool for all Oracle database environments.
The following examples show how simple it is to deploy and use the ORAchk tool within a real-world Exadata X4-2 full rack environment. A full rack contains eight database nodes and 14 Exadata storage cells, but in this example, the full rack has been subdivided between two RAC databases, one using six nodes and the other using the remaining two nodes. Both databases leverage the Automatic Storage Management (ASM) disk groups resident across all 14 storage cells.
Complete details about how to download, install, and leverage the ORAchk utility can be found in MOS Note 1268927.2, “ORAchk—Health checks for the Oracle stack.”
1. Download the file described in the MOS note previously mentioned, transfer the file to one of the Exadata database nodes, and execute the following commands while logged in as the root user or a user with root permissions. (ORAchk requires that a user with root permissions must install the utility.)
[root@ex01dbadm01 tmp]# cd
[root@ex01dbadm01 ~]# mkdir ORAchk
[root@ex01dbadm01 ~]# cd ORAchk/
[root@ex01dbadm01 ORAchk]# unzip /tmp/orachk.zip
Archive: /tmp/orachk.zip
inflating: UserGuide.txt
inflating: collections.dat
inflating: rules.dat
...
inflating: .cgrep/lcgrep6
inflating: .cgrep/profile_only.dat
inflating: .cgrep/auto_upgrade_check.pl
inflating: .cgrep/diff_collections.pl
inflating: CollectionManager_App.sql
inflating: orachk
2. After unzipping all Exachk files, review the appropriate readme file to see if there are any new instructions pertaining to installing and configuring Exachk.
3. Run the ORAchk utility and answer all questions, as shown. Note that the warning about the binary being older than 120 days can be ignored if the file has just been downloaded:
[root@ex01dbadm01 ORAchk]# ./orachk
This version of orachk was released on 09-Oct-2014 and its older than 120 days. No new version of orachk is available in RAT_UPGRADE_LOC. It is highly recommended that you download the latest version of orachk from my oracle support to ensure the highest level of accuracy of the data contained within the report.
Do you want to continue running this version? [y/n][y]
CRS stack is running and CRS_HOME is not set. Do you want to set CRS_HOME to /u01/app/11.2.0.4/grid?[y/n][y]
...
4. You can deploy the ORAchk utility so that it will automatically verify whether a newer version is available, as well as send an email to the appropriate account after each regularly scheduled execution, as follows:
[root@ex01dbadm01 ORAchk]#
./orachk -set "AUTORUN_SCHEDULE=3 1 * *;[email protected]"
Created AUTORUN_SCHEDULE for ID[orachk.default]
Created NOTIFICATION_EMAIL for ID[orachk.default]
[root@ex01dbadm01 ORAchk]#
The AUTORUN_SCHEDULE
can be interpreted as follows:
AUTORUN_SCHEDULE * * * * :- Automatic run at specific time in daemon mode.
- - - -
? ? ? ?
? ? ? +----- day of week (0 - 6) (0 to 6 are Sunday to Saturday)
? ? +---------- month (1 - 12)
? +--------------- day of month (1 - 31)
+-------------------- hour (0 - 23)
5. To verify whether ORAchk has been customized in any way, run the following command:
[root@ex01dbadm01 ORAchk]# ./orachk -get all
ID: orachk.default
----------------------------------
AUTORUN_SCHEDULE = 3 1 * *
NOTIFICATION_EMAIL = [email protected]
[root@ex01dbadm01 ORAchk]#
6. Finally, you can configure ORAchk to start automatically after any server reboot by running the following command:
./orachk -initsetup
After the ORAchk utility finishes its execution, it generates a report in HTML format that lists all information regarding the environment in which the tool was executed. Figure 3.1 shows the main page of this output that lists any findings with a status of FAILED
or WARNING
that should be investigated.
Reviewing the details for one of the steps marked as FAILED
is as simple as clicking on the related link, as shown in Figure 3.2.
As this ORAchk detailed report section shows, there are issues that the Oracle DBA can research to isolate other potential causes behind a gc buffer busy wait event. In this case, there is an issue with the configuration of the Exadata’s InfiniBand network components that may be contributing to poor application performance.
The three key tools that every Oracle DBA who has worked on a database since Oracle 10g should know about—Automatic Database Diagnostic Monitor (ADDM), Automatic Workload Repository (AWR) reports, and Active Session History (ASH) reports—are also extremely useful for detecting performance issues related to the gc buffer busy wait event. The next sections explain exactly how to locate crucial information about the statements, user sessions, and database objects that are causing an Oracle database to perform poorly because of high occurrences of the gc buffer busy wait event.
ADDM is one of the fastest methods that an Oracle DBA can leverage to find specific recommendations about a database application workload that has been executed over a specific period of time, including which SQL statements are encountering a performance bottleneck. ADDM can draw on information retained within the AWR; it can also be executed in real time, in which case the most recent set of ASH data will be used for its analysis.
Listing 3.1 shows an excerpt from an ADDM report that was run against the database during the time that excessive gc buffer busy wait events were encountered.
...
RECOMMENDATION 1: Schema, 84.4% benefit (17609 seconds)
ACTION: Consider partitioning the INDEX "MID_B2W_ADMIN.STM_LOG_DATA_IDX"
with object id 131712 in a manner that will evenly distribute
concurrent DML across multiple partitions.
RELEVANT OBJECT: database object with id 131712
RATIONALE: The INSERT statement with SQL_ID "fv4un8f4w6zg8" was
significantly affected by "buffer busy" waits.
RELEVANT OBJECT: SQL statement with SQL_ID fv4un8f4w6zg8
insert into STM_LOG (NM_LOGIN, DS_ROLES, DS_OPERATION, DT_CRIACAO,
CD_MARCA, CD_LOG) values (:1, :2, :3, :4, :5, :6)
...
Notice that ADDM was intelligent enough to isolate the specific performance issue to a particular SQL statement and connect that statement to the gc buffer busy wait event. In fact, ADDM even offered a suggestion to partition the affected database object to potentially alleviate the issue. However, we can confirm this suggestion through additional means—AWR and ASH reports—as the next sections demonstrate.
Creating an AWR report is extremely simple; it can be done by issuing just one command from within SQL*Plus and then responding to the prompts that define which time period(s) the AWR report should span:
$> sqlplus / as sysdba
SQL> ?/rdbms/admin/awrrpt
...
Typical AWR report output usually contains an incredible amount of information about an Oracle database’s application workload behavior. When a database instance is suffering from a gc buffer busy wait event during the time period chosen for the AWR report, however, that event will usually surface as one of the Top 5 Timed Events, as shown in Figure 3.3.
As Figure 3.3 shows, the gc buffer busy event is the third-most frequently occurring wait event; this situation obviously is not optimal because it means the database instance is having to wait excessively for cache fusion to be handled properly. The excessive wait time may indicate that the root cause of this wait event may be a serious issue with the performance of the private interconnect network itself.
Another helpful source of information is the Segments by Global Cache Buffer Busy report, shown in Figure 3.4.
In this example, two segments—STM_LOG_DATA_IDX
and PRC_ITEM
—are experiencing the largest amount of gc buffer busy waits. These segments should therefore be investigated to determine why they are experiencing almost 77 percent of all waits in this category.
Figure 3.5 shows the Global Cache and Enqueue Services – Workload Characteristics report. It indicates that this database instance is experiencing significant interconnect problems because, on average, it is taking almost 0.5 seconds to receive a single buffer across the private interconnect.
This event is calculated using the following formula:
gc cr block receive time=
Time to send message to a remote LMS process by FG
+ Time taken by LMS to build block (statistics: gc cr block build time)
+ LMS wait for LGWR latency (statistics:gc cr block flush time)
+ LMS send time (Statistics: gc cr block send time)
+ Wire latency
The AWR report sections therefore offer concrete evidence that something is seriously wrong with this database’s environment. We will present some queries later in this chapter to show how to locate the SQL_ID
, statement, block, and other information about the database server process that is causing this problem. It’s important to remember that the root cause of this problem must be verified first at the hardware level—a malfunctioning network interface card (NIC), a failing network switch, misconfigured networking parameters, among many other possibilities—before pointing a finger at any other layers of the application and system as potential root causes of the problem.
Generating an ASH report can help you locate the specific SQL statements that are experiencing performance problems related to the gc buffer busy wait event. For example, the Top User Events report section shown in Listing 3.2 from the generated ASH report illustrates that wait event is definitely an issue within the selected 15-minute reporting period.
Top User EventsDB/Inst: BWMDPR/BWMDPR1(Feb 25 17:40 to 17:55)
Avg Active
Event Event Class % Activity Sessions
----------------------------------- --------------- ---------- ----------
CPU + Wait for CPU CPU 51.49 2.49
db file sequential read User I/O 13.17 0.64
gc buffer busy Cluster 5.63 0.27
direct path read User I/O 3.61 0.17
db file scattered read User I/O 3.17 0.15
-------------------------------------------------------------
Another part of the same ASH report also proves that the database is encountering gc buffer busy waits. Listing 3.3 shows the Top Blocking Sessions section that identifies which sessions are blocking other sessions and the event that these sessions were most commonly waiting for.
Top Blocking Sessions DB/Inst: BWMDPR/BWMDPR1 (Feb 25 17:40 to 17:55)
-> Blocking session activity percentages are calculated with respect to
waits on enqueues, latches and "buffer busy" only
-> '% Activity' represents the load on the database caused by
a particular blocking session
-> '# Samples Active' shows the number of ASH samples in which the
blocking session was found active.
-> 'XIDs' shows the number of distinct transaction IDs sampled in ASH
when the blocking session was found active.
Blocking Sid % Activity Event Caused % Event
--------------- ---------- ------------------------------ ----------
User Program # Samples Active XIDs
-------------------- ------------------------------ ------------------ --------
5074,38287 1.15 gc buffer busy 0.80
MID102_B2W_WL_APP 165/901 [ 18%] 0
4375,33093 1.13 read by other session 0.85
This report identifies that session ID 5074 with serial 38287 is one of the sessions that is waiting for the gc buffer busy wait event and is also blocking or making other sessions wait for that task to finish so that the resource can become available again to another session.
Finally, Listing 3.4 displays which object is responsible for generating the gc buffer busy wait event.
Top DB Objects DB/Inst: BWMDPR/BWMDPR1 (Feb 25 17:40 to 17:55)
-> With respect to Application, Cluster, User I/O and buffer busy waits only.
Object ID % Activity Event % Event
--------------- ---------- ------------------------------ ----------
Object Name (Type) Tablespace
----------------------------------------------------- -------------------------
131712 4.37 gc buffer busy 2.62
MID_B2W_ADMIN.STM_LOG_DATA_IDX (INDEX) TS_MID_B2W_ADMIN_INDEX_M_
gc current block busy 1.36
While ADDM, AWR, and ASH reports are valuable for both historical and real-time analysis, it’s also possible to isolate this information in real time when gc buffer busy wait events are occurring without leveraging these tools, as the final sections of this chapter demonstrate.
Almost every Oracle DBA has experienced a situation when having just the right query at her fingertips at just the right time has saved her IT organization from utter disaster. The next sections present several queries that can quickly help isolate and analyze the root causes of—as well as the objects most affected by—the gc buffer busy wait event.
Various dynamic views offer a look deep within an Oracle database’s wait events from the perspective of ASH. The query in Listing 3.5 shows how to identify the top 10 wait events in the database for the past 1 hour using the total amount of time waited for single events. In this particular scenario, it is apparent that the top wait event is gc buffer busy.
SELECT * FROM (
SELECT
h.event "Wait Event",
SUM(h.wait_time + h.time_waited)/1000000 "Total Wait Time"
FROM v$active_session_history h,
v$event_name e
WHERE h.sample_time < (SELECT MAX(sample_time)
FROM v$active_session_history)
AND h.sample_time > (SELECT MAX(sample_time) - 1/24
FROM v$active_session_history)
AND h.event_id = e.event_id
AND e.wait_class <>'IDLE'
GROUP BY h.event
ORDER BY 2 DESC)
WHERE ROWNUM <10;
Wait Event Total Wait Time
------------------------------- ---------------
gc buffer busy 40.23931
enq: TX - row lock contention 32.385347
enq: TX - index contention 28.62571
gc current block busy 25.963209
db file sequential read 14.387571
LNS wait on SENDREQ 13.18233
gc cr multi block request 12.478076
reliable message 5.038086
cr request retry 4.887495
Now that we have identified that the gc buffer busy wait event is definitely the key to this application workload’s poor performance, the next step is to use this information to locate the particular statement, user session, database object, and even individual database block if necessary that is suffering from this contention. The query in Listing 3.6 shows how to obtain all necessary information, including the affected object ID, object name, object type, its datafile, and—most important—the SQL_ID
of the session that is suffering gc buffer busy waits.
COL object_name FORMAT A30
COL program FORMAT A30
COL event FORMAT A30
SELECT DISTINCT
current_obj#,
o.object_name,
o.owner,
o.object_type,
current_file#,
session_state,
sql_id,
event
FROM v$active_session_history a,
dba_objects o
WHERE a.current_obj# = o.object_id
AND a.event LIKE '%gc buffer busy%';
CURR_ OBJECT_ OBJ_ CURR_
OBJ# NAME OWNER TYPE FILE# SESSION SQL_ID EVENT
------ ------------ --------- ------ ------ -------- -------------- --------------
104830 STM_DETALHE_ MID_B2W INDEX 445 WAITING 39j77bgam9506 gc buffer busy
LOG_VALOR _ADMIN
79829 STM_ITEM MID_B2W TABLE 140 WAITING 0ba26mnwcw7x2 gc buffer busy
_ADMIN
55681 STM_ITEM MID_B2W TABLE 141 WAITING 9apqp7fwnqsy7 gc buffer busy
_GROUP _ADMIN
131712 STM_LOG_ MID_B2W INDEX 445 WAITING fv4un8f4w6zg8 gc buffer busy
DATA_IDX _ADMIN
It is now obvious which objects are affected by gc buffer busy waits. After running the following query to identify the matching SQL text for SQL_ID 39j77bgam9506
, it is evident that the statement is actually performing an INSERT
:
SQL> select sql_text from v$sqltext where sql_id='39j77bgam9506' order by piece;
SQL_TEXT
----------------------------------------------------------------
insert into STM_DETALHE_LOG (CD_LOG, DS_ATRIBUTO, DS_VALOR, TP_D
ETALHE_LOG, CD_DETALHE_LOG) values (:1, :2, :3, :4, :5)
It’s therefore likely that this operation will benefit from a change to the index structure, such as re-creating the affected index as a hash-partitioned index rather than a standard nonpartitioned index. For instance, if this index were re-created with 64 partitions, the concurrency demands placed on the index would be significant because now the INSERT
s would be spread across 64 separate index partition branches rather than across just one index branch as before.
This next real-world example demonstrates several queries that can be executed within SQL*Plus to quickly isolate exactly what is causing a performance bottleneck within a database, including the capture of which database user account, session, and statement is causing the problem.
1. Verify the interval that will be chosen when analyzing the database in the pursuit of the database bottleneck at that specific time. The following query checks whether the AWR snapshot timeframe between AWR snapshot IDs is the one that is really needed for the analysis about to be performed:
col min for a30
col max for a30
SQL> SELECT
MIN(begin_interval_time) min,
MAX(end_interval_time) max
FROM dba_hist_snapshot
WHERE snap_id BETWEEN 54657 AND 54658;
MIN MAX
------------------------------ ------------------------------
28-FEB-15 09.00.15.104 AM 28-FEB-15 11.00.04.693 AM
2. Verify that the defined AWR snapshot interval indeed does encapsulate the class of wait events that as a whole are exhibiting the longest waits for the instance as of that AWR snapshot timeframe:
SQL> SELECT
wait_class_id,
wait_class,
COUNT(*) cnt
FROM dba_hist_active_sess_history
WHERE snap_id BETWEEN 54657 AND 54659
GROUP BY wait_class_id, wait_class
ORDER BY 3;
WAIT_CLASS_ID WAIT_CLASS CNT
------------- ---------------------------------------- ---------
4166625743 Administrative 23
3290255840 Configuration 23
3386400367 Commit 111
4217450380 Application 147
2000153315 Network 233
4108307767 System I/O 236
3875070507 Other 544
1893977003 Cluster 633
1740759767 User I/O 1019
3871361733 Concurrency 3402
11 rows selected.
From this query’s output, it’s obvious that the Concurrency
wait event class is encountering the highest waits for this time period.
3. Armed with this information, use the following query to find the event inside the wait class that is waiting the longest; the gc buffer busy is the worst one:
SELECT
event_id,
event,
COUNT(*) cnt
FROM dba_hist_active_sess_history
WHERE snap_id BETWEEN 54657 AND 54659
AND wait_class_id = 3871361733
GROUP BY event_id, event
ORDER BY 3;
EVENT_ID EVENT CNT
---------- --------------------------------------------- --------
...
2277737081 gc current grant busy 147
3046984244 gc cr block 3-way 194
737661873 gc cr block 2-way 318
111015833 gc current block 2-way 451
2701629120 gc current block busy 473
1478861578 gc buffer busy 1333
24 rows selected.
This query’s output corroborates the fact that the gc buffer busy wait event is encountering the highest waits for this time period.
4. Run this next query to isolate the SQL_ID
of statements that were run within sessions that are waiting for the gc buffer busy wait event:
SELECT
sql_id,
COUNT(*) cnt
FROM dba_hist_active_sess_history
WHERE snap_id BETWEEN 54657 AND 54659
AND event_id = 1478861578
GROUP BY sql_id
HAVING COUNT(*)>1
ORDER BY 2;
SQL_ID CNT
----------------------- ---
...
0s34c5d0n7577 48
5bwdfzr1s4cx0 70
Gppjjfhgxnbxx 94
fv4un8f4w6zg8 690
52 rows selected.
5. Locate the SQL text that corresponds to the identified SQL_ID
via this simple query:
SQL> SELECT
DISTINCT sql_text
FROM gv$sqltext
WHERE sql_id = 'fv4un8f4w6zg8';
SQL_TEXT
----------------------------------------------------------------
insert into STM_LOG (NM_LOGIN, DS_ROLES, DS_OPERATION, DT_CRIACA
O, CD_MARCA, CD_LOG) values (:1, :2, :3, :4, :5, :6)
Like the example obtained from running an ASH report in prior sections, this report points to a potential issue: an INSERT
statement is executing during the same timeframe that those same indexes are being accessed from too many user sessions simultaneously. This is the most likely culprit for generating high wait counts for the Concurrency
wait class and, more specifically, the gc buffer busy wait event.
Since every Oracle database is certainly unique, it is neither possible nor particularly desirable to implement the same repairs for the gc buffer busy wait events in all cases. However, there are several well-known avenues for potential investigation and repair of the scenarios we have presented so far:
If the object experiencing the gc buffer busy waits is an index, verify the possibility of re-creating the index using a HASH partitioning methodology and/or consider re-creating the index as a reverse key index.
If the database was just migrated from a single instance environment to a RAC environment, consider creating a new database service that will isolate the application to using only one node of the RAC database, thus isolating the application workload so that it accesses just one of the RAC database instances. This approach should significantly reduce global cache waits and is probably the simplest workaround to repair this issue; however, this approach also effectively limits the possibility of load-balancing the affected application workloads between multiple nodes. If the application workload is part of Oracle’s E-Business Suite (EBS), it is possible to create separate database services within each EBS module (AP, OM, IVN, and so forth).
A smaller database block size may help to improve or even eliminate block contention and the gc buffer busy wait event because fewer blocks will need to be handled when locating those needed to complete a transaction successfully. Of course, using an absurdly small block size just to overcome this problem may be equally counterproductive. Remember that a 2 KB block size has already yielded approximately 10 percent of its space to its block header and significantly reduces the amount of data that can be stored within a single block for the segments that are suffering from wait events; it could also contribute to significant row chaining and/or row migration as well as slower performance for full table scans for larger tables. And don’t forget that the only way to use a different block size than the database’s default block size is to create a tablespace with that smaller block size and then move the appropriate objects to that tablespace—definitely not a trivial exercise.
If Automatic Segment Space Management (ASSM) is not in use, be sure to check the configuration of the segment’s free lists for recommended proper settings; better yet, be sure to slate these objects to an early migration to a tablespace that already employs ASSM as soon as possible.
This chapter illustrated several different methods to detect and repair problems related to the gc buffer busy wait event. It demonstrated how to use the main reports that Oracle Database provides automatically through Oracle tools such as ADDM, AWR, and ASH to locate the bottleneck, as well as some valuable SQL*Plus queries that can quickly help analyze a database’s performance related to this wait event. Some final points to remember:
Every Oracle database is different; even if you isolate a problem and it appears to be the exact same issue you just encountered on another database, be sure to check all dependent objects before making any change. The change recommended may cause no negative or positive impact, but tools such as Oracle Database Replay and Oracle SQL Performance Analyzer can assist to ascertain if there are any unexpected side effects, however minimal.
In a reactive situation regarding gc buffer busy waits, it is important to capture all necessary data—ADDM, AWR, and ASH reports—and retain it for as long as possible (perhaps via snapshots in Oracle 11.2.0).
Don’t forget to proceed methodically when diagnosing and alleviating the true root cause(s) of gc buffer busy waits. While it may be tempting to seize upon a single potential cause, there may actually be multiple root causes—for example, a malfunctioning NIC in concert with poorly written application code combined with an inappropriate block size for the application’s data.