© Roger Cornejo 2018
Roger CornejoDynamic Oracle Performance Analyticshttps://doi.org/10.1007/978-1-4842-4137-0_8

8. Case Studies

Roger Cornejo1 
(1)
Durham, NC, USA
 

In the previous chapter, I provided a framework for making the many decisions necessary when implementing the DOPA process. The discussion focused upon how you would choose to run the analysis, the various parameters, and views you would choose. In this chapter, I’ll lead you through several real examples.

The case studies presented in this chapter show the DOPA process in action. They detail the analysis of historic AWR metrics and the identification of root cause(s) of real reported Oracle database performance problems. These represent just a sampling of the kinds of problems encountered by database tuning professionals, but taken together they demonstrate the dynamic nature of the DOPA process and its usefulness across problem types.

The DOPA process can provide valuable information and direct the tuning of multiple types of problems. I’ve listed in the following just a short list of the types of problems for which a dynamic analysis would prove very useful:
  • Identifying source of performance bottlenecks:
    • IO

    • CPU

    • Memory

    • Network

  • Identifying internal Oracle features involved in the performance issue:
    • Subsystems

    • Configuration

  • Determining trends for thousands of metrics

  • Identifying application issues:
    • Parsing

    • Workload

    • SQL tuning

  • Failure detection:
    • ORA-600 errors

    • TEMP

    • PGA aggregate limit exceeded

  • Analysis of metric normal ranges:
    • Threshold setting for monitoring

In any organization with mission-critical/important systems, it is essential to be able to quickly identify root causes of the issues and to react promptly by applying required corrections or tunings. As with any data analysis effort, context is key; without context for the issue you are analyzing, the results will have little meaning. In the context of database performance tuning (well-defined or not-so-well-defined issues), the unusual metric values flagged by the DOPA process will provide excellent input into the process of weaving the story behind the performance issue at hand.

With the approximately 11,000 metrics instrumented thus far in the DOPA process, there are about 2 million data points in a weeks’ worth of AWR data (rolled up in 1-hour intervals). Since we’re interested in performance bottlenecks, I designed the DOPA process to show only the metrics that are likely to be contributing to the problem. Likely, there will be many metrics that are unfamiliar to the new user of the DOPA process. While the metric names are usually informative as to what they are measuring, one might not be able to find a lot of Oracle documentation on a metric. Google is your friend to the extent that the metric is well documented, but not all of them are. Therefore, the DOPA process user will need to put on their expert hat, leveraging their knowledge of computer science and Oracle internals. The point is you will need to become familiar with a larger set of metrics than you are likely accustomed to. The DOPA process does not predetermine any metrics [although a future version will rank some well-known predictive metrics], and it is not designed to tell you specific metrics that are involved when a very specific performance issue is reported. Thus, it is the task of the DOPA process user to exercise good judgment and decide what metrics or problem areas to focus on for further analysis. The implementation of taxonomies is very helpful in this regard since they bundle the metrics into higher level categories for easier interpretation.

The analysis of the performance tuning expert is necessary to synthesize the results of the model produced by running the DOPA process for a given time interval. For example, a metric can be flagged by the DOPA process as out of normal range, but this condition could be a knock-on effect of another problem. For example, memory issues or poorly tuned SQL can surface in the metrics as IO problems (i.e., unusually high IO conditions can be a consequence of the real problem: low SGA memory or poorly tuned SQL). The user of the DOPA process will likely encounter this kind of behavior, as I did, time and time again. The model produced by the DOPA process is a very effective tool, but it must be used skillfully by the tuning professional, who must take the objective information it affords to convey the metrics-based “story” that best expresses the performance issue at hand.

For the following case studies, I had access to production Oracle instances that contained one weeks’ worth of historic performance metrics in the DBA_HIST views (in some cases, up to one month’s worth of data). Each case represents a real production performance incident. Each was solved in the “heat of battle” using the DOPA process.

Case Studies

In each of the case studies, I have answered the following questions to provide you with a framework for understanding the problem and a guide for the thought process I used in resolving the issue using the DOPA predictive model to guide me:
  • What was the user’s complaint? How did the problem manifest? Error messages?

  • What type of analysis did I run with the DOPA code? Metrics flagged, interesting patterns?

  • What action did I take to resolve the issue?

  • Why is this case particularly interesting?

  • What impact did the metrics analysis bring?

Case #1: Basic Example Showing How DOPA Process Identifies a SQL Problem

What was the user’s complaint? How did the problem manifest? Error messages?

The application had poor performance during a particular time interval. Users asked us to check to see if there were backups or other processes occurring on the database that might be impacting their application.

What type of analysis did I run with the DOPA process? Metrics flagged, interesting patterns?

I ran my code using a Metrics Aggregate View for the problem interval since the users provided a well-defined interval. Flagged metrics identified high values for several TEMP tablespace-related metrics as shown in Figure 8-1.
../images/469031_1_En_8_Chapter/469031_1_En_8_Fig1_HTML.jpg
Figure 8-1

Flagged metrics reveal high values for several TEMP tablespace-related metrics

Since TEMP usage will come from SQL and sometimes SQL can be tuned to reduce TEMP consumption, I looked at the ASH during the time interval during which the problem was occurring to find the top TEMP-producing SQL. I ran that SQL through the SQL Tuning Advisor and examined the SQL Stats and ASH for the SQL [it took 8 minutes to run, returning 300,000 rows (1000 at a time) over the wide area network; session elapsed time was 30 minutes]. SQL Tuning Advisor suggested a parallel SQL Profile at ~60% improvement. In this case the root cause was not really TEMP consumption, but rather the fact that the query was running over the WAN (app server in the UK with the database in the United States) with a low array fetch size.

What did you do to resolve the issue?

My resolution included the following actions:
  1. 1)

    Increasing the array fetch size

     
  2. 2)

    Relocating app server to same local area network with the database

     
  3. 3)

    Tuning using parallel SQL Profile

     

Why is this case particularly interesting?

This example shows how a problem area (high TEMP-consuming SQL) was quickly identified by the DOPA process. It also shows the importance of rightly interpreting that information. Clearly, the tuning expert must be able to make the jump from raw fact (high TEMP consumption) to a likely cause(s).

What impact did the metrics analysis bring? The DOPA process quickly identified one and only one problem area. Tuning attention was then directed to that problematic area. A deep-dive analysis at that point efficiently and effectively identified the root cause.

Case #2: Basic Example of How the DOPA Process Identified a Configuration Problem

What was user complaint? How did the problem manifest? Error messages?

Client reported performance degradation during a massive data load. There was a three-hour window during which this program ran.

What type of analysis did I run with the DOPA code? Metrics flagged, interesting patterns?

I ran the DOPA analysis using a Category Count View and the Oracle taxonomy for the problem interval.

The Category Count View, shown in Figure 8-2, shows REDO and IO metrics were the taxonomic areas most often flagged as abnormal. This would likely indicate a configuration problem as the cause of performance degradation.
../images/469031_1_En_8_Chapter/469031_1_En_8_Fig2_HTML.jpg
Figure 8-2

Category Count View showing high numbers of flags in REDO and DIRECT I/O probably indicative of a configuration problem

To get more detail, I ran the DOPA model again, this time using the Metrics Aggregate View. This view is shown in Figure 8-3. The Metrics Aggregate View shows the individual metrics with the most flagged values. In this instance, seven of the top nine metrics are REDO-related metrics, confirming that a REDO configuration problem is most likely the root problem.
../images/469031_1_En_8_Chapter/469031_1_En_8_Fig3_HTML.jpg
Figure 8-3

Metrics Aggregate View showing likely configuration issue

I used the objects that generated the most block changes (source: DBA_HIST_SEG_STAT and DBA_HIST_SEG_STAT_OBJ) as a “surrogate metric” to discover the SQL causing the massive REDO. Figure 8-4 shows the results of this search—it reveals that a single-row array insert of about 78 million rows (20,000 records at a time) which took about 7.5 seconds per execution and required approximately 8 hours of processing time was the problem.
../images/469031_1_En_8_Chapter/469031_1_En_8_Fig4_HTML.jpg
Figure 8-4

Example where the surrogate metric of block changes is used to get the likely SQL causing the massive REDO

The next step I took was to look at the wait events for this SQL; I did this using a custom query against the ASH. I could see from this query that log file switching and IO delays were contributing to the issue. This analysis confirmed that the REDO configuration was a major part of the problem.

The actual configuration problem shows up in the following Log Switch Frequency Map, shown in Figure 8-5, which reveals that log file switching was excessively high during the problem interval.
../images/469031_1_En_8_Chapter/469031_1_En_8_Fig5_HTML.jpg
Figure 8-5

Log Switch Frequency Map confirming a configuration problem

Confirming the size of the REDO Log files as in Figure 8-6, I concluded that the solution was to increase the REDO log file sizes to at least 2 GB each.
../images/469031_1_En_8_Chapter/469031_1_En_8_Fig6_HTML.jpg
Figure 8-6

Confirming the size of the REDO Log files

Advice from ADDM was consistent with my conclusion since it recommended an increase in the SGA. Since IO waits were also at play, this action will be taken at the earliest convenience (database restart required). The ADDM report excerpt is in Figure 8-7.
../images/469031_1_En_8_Chapter/469031_1_En_8_Fig7_HTML.jpg
Figure 8-7

ADDM report excerpt

Further iterations of the DOPA process showed high values for index leaf node splits.

What action did I take to resolve the issue?

The actions taken to address the performance problem are as follows:
  • Increase REDO log file sizes.

  • Increase the size of the SGA.

  • To address the problem of high values for index leaf node splits, I suggested to the client that they disable the indexes during the load.

Why is this case particularly interesting?

Leaf node splits is a metric implemented in AWR and evaluated as part of the DOPA process, but I have never seen it in any other small model approach. It was noteworthy that the DOPA process identified it as a problem in this analysis. It is quite likely that this is a problem that occurs, but is overlooked by other tools that fail to consider this metric.

REDO and IO issues as well as the index issues were immediately flagged in this analysis and led to a much richer root-cause analysis. As noted in the preceding text, inclusion of all metrics prevented important observations from being missed. Once the problem area was discovered using the DOPA process, a more in-depth analysis could be performed using other tools. For example, a little bit of further digging showed that the resource manager was kicking in as highlighted in Figure 8-8.
../images/469031_1_En_8_Chapter/469031_1_En_8_Fig8_HTML.jpg
Figure 8-8

Resource manager metrics from dba_hist_osstat

The high values for the metric dba_hist_osstat: “SRC_MGR_CPU_WAIT_TIME” pointed me to check if the resource manager was kicking in during the maintenance window activities. It turns out that it was occurring during the problem interval. To address this, the solutions include
  1. 1)

    Maintenance window should be moved to a period of less activity.

     
  2. 2)

    The resource management plan should be dialed back.

     
  3. 3)

    Possibly also the instance/machine needs more CPU.

     

What impact did the metrics analysis bring?

Using DOPA process as a first step in the analysis enabled me to positively identify the area of concern based on the flagged metrics. I was able to then use small model tools to zoom in on the problem area with great precision. Having done a comprehensive analysis first, I could be confident that important observations were not being overlooked and that all problem areas were addressed.

Case #3: CPU Problem Caused by SQL Execution

What was the user’s complaint? How did the problem manifest? Error messages?

The client complained that a specific query configured in a monitoring tool was throwing alerts—“ORA-01013: user requested cancel of current operation.” They reported that normal query execution time was usually much less than 60 seconds and that the application would error out if >60 seconds.

Client received the alerts on June 14–16 and also 18–20. The request read, “Please check and let us know is there any Performance issue at the database end.”

Another analyst had previously looked at the problem and recommended that the client modify the query with a parallel hint. This change was made by the client, but did not resolve the problem. In fact, it magnified the problem.

What type of analysis did I run with the DOPA code? Metrics flagged, interesting patterns?

I began the analysis by running the DOPA code using the Metrics Category Count View to gain a high-level picture of the situation. As shown in Figure 8-9, an overwhelming number of the flagged metrics were IO- and CPU-related metrics.
../images/469031_1_En_8_Chapter/469031_1_En_8_Fig9_HTML.jpg
Figure 8-9

Metrics Category Count View showing a predominance of IO- and CPU-related metrics flagged

A second iteration of the DOPA process, using the Metrics Aggregate View, gave a more detailed view. In the Metrics Aggregate View, shown in Figure 8-10, I can see a number of specific CPU-related metrics that are high. These are indicated by arrows.
../images/469031_1_En_8_Chapter/469031_1_En_8_Fig10_HTML.jpg
Figure 8-10

Metrics Aggregate View

I was able to do a deeper-dive analysis by looking at a Metrics Time-Series View. Using the Metrics Time-Series View, I focused in on dba_hist_sysmetric_summary: “Host CPU Utilization (%)” (shown in Figure 8-11) and could see that the CPU had been pegged for a long time. When this happens, all waits are inflated because of the wait for CPU (see “Database Wait Time Ratio” graph in Figure 8-12).
../images/469031_1_En_8_Chapter/469031_1_En_8_Fig11_HTML.jpg
Figure 8-11

Host CPU Utilization graph

../images/469031_1_En_8_Chapter/469031_1_En_8_Fig12_HTML.jpg
Figure 8-12

Database Wait Time Ratio graph

Additional information gathered revealed the following:
cpu_count = 40
SBR latency metrics look good.
Top SQL from ADDM:  6da7h2v0b1v8z Findings from ADDM:
98% for Java execution.
1- Statistics Finding
---------------------
  Table "CODS_USER_SS_PROD"."SS_CXSMILES_EXACT_SEARCH" was not analyzed.

Based on all the information, I determined that CPU was the root cause of the issue for the fluctuations in monitoring and that the problem would likely persist until the CPU situation was brought under control. I was confident that it wasn’t a capacity issue and that performance would improve once the SQL was tuned.

What action did I take to resolve the issue?
  1. 1.

    Implement a custom SQL Profile fixing the execution plan to the faster PHV (3051534597).

     
  2. 2.

    Monitor CPU situation and revisit if no improvement.

     

Why is this case particularly interesting?

The first analyst that looked at this problem recommended that the client modify the query with a parallel hint. This change was made by the client, but did not resolve the problem. In fact, it magnified the problem. In making this change, the analyst was treating the symptom and not the problem.

The DOPA process, a comprehensive metrics-based analysis, identified not only the CPU issue but also the impact of the CPU issue on various wait times. This is a great example of how the expertise of the analyst is critical for the interpretation of the raw data. In this case, the long wait times were a result of slow processing due to high CPU consumption, not the problem to be fixed. Throwing hardware at the problem might help, but it is unlikely to solve the problem.

What impact did the metrics analysis bring?

The comprehensive analysis enabled accurate identification of the problem and the information necessary to discern between cause and effect.

In follow-up evaluation after the fix, Figure 8-13 shows that the CPU situation has returned to normal. The long-duration high CPU condition of the server remained under control after implementing the custom manual SQL Profile.
../images/469031_1_En_8_Chapter/469031_1_En_8_Fig13_HTML.jpg
Figure 8-13

CPU graph showing Average Host CPU Utilization (%) returned to normal after the fix

Case #4: Example Where the DOPA Process Was Used to Detail Problem Areas

What was the user’s complaint? How did the problem manifest? Error messages?

Performance was reported to be slow on a particular database instance, and the client reported application level server busy errors (no “ORA” errors reported).

This is a database application that is hosted in the UK but has US users as well. The problem got worse as US users began to connect to the application and gradually eased when UK users logged off. The time period during which this happened most was every day between 10:00 and 15:00, for different users at different sites.

What type of analysis did I run with the DOPA code? Metrics flagged, interesting patterns?

Since the problem occurred only during the week, I took my normal ranges from the weekend, using the same time interval: 10:00–15:00.

I went first to the Metrics Aggregate View looking only at dba_hist_sysmetric_summary metrics (since this source is loaded with transactional related metrics). Figure 8-14 shows the model produced:
../images/469031_1_En_8_Chapter/469031_1_En_8_Fig14_HTML.jpg
Figure 8-14

Case Study - UKPRD835 - Metrics Aggregate View

Case Study - UKPRD835 - Metrics Aggregate View

From the Metrics Aggregate View, I identify the three metrics with high flag counts and high flag ratio of interest:
  • Current Open Cursors Count

  • Current Logons Count

  • Session Count

These are all related to connection and workload issues.

Next, I focused in on the Session Count metric individually. The Metrics Time-Series view of all values of this metric, both flagged and unflagged, shown in Figure 8-15, allows us to easily understand the periodicity of the metric trend. From this we can see high session counts every day during UK and US work hours exactly matching the problem description.
../images/469031_1_En_8_Chapter/469031_1_En_8_Fig15_HTML.jpg
Figure 8-15

Case Study - UKPRD835 - Metrics Time-Series View – Session Count

Further analysis revealed (among other things)
  • Logon storms of upward of 20,000 connections per hour

  • Lots of small SQL executions doing full table scans

  • Excessive parsing (shows up in the Metrics Aggregate View, Total Parse Count Per Sec)

The conclusion of my analysis was that the performance problem was a result of the application behavior; it was creating a new database session for every window and running one or more small queries to populate data for the window [this behavior was confirmed by both the client and the vendor]. This is a common design problem with applications—it manifests as logon storms. The application had not failed earlier primarily because it was operating on a 40 CPU machine and had ample capacity to deal with the overabundance of connections required; eventually they ran into the practical limits of the machine. Since it is older hardware and not possible to add any more CPU’s, the only fix possible is at the application level. This would involve adding a connection pool. This must be implemented by the vendor since it is a vendor product.

What action did I take to resolve the issue?

Advise the clients to work with the vendor to accomplish the necessary change in the application.

Why is this case particularly interesting? What impact did the metrics analysis bring?

The metrics give the clues to the problem, but it is still necessary for the analyst to tie this to the workings of the application in order to fully understand the problem. In this case, the metrics exactly matched the client’s complaint, and this gave them leverage in working with the vendor. The DOPA process provided objective data that the problem was occurring as a result of the application and not the database and therefore required a fix at the application level.

Case #5: Performance Investigation for Ill-Described Problem (a Narrative)

What was the user’s complaint? How did the problem manifest? Error messages?

An application support client asked for an AWR report for a time range 06:00–09:00 on a particular day; it was supplied to them by a level 2 application DBA analyst. I knew that this particular application is important to the business and that the app team usually worked through the vendor for support, but my obsessive tuning disorder got the best of me and I began a performance investigation even though they didn’t ask.

What type of analysis did I run with the DOPA code? Metrics flagged, interesting patterns?

I looked first at the AWR report since that is the data they had requested; it showed Average Active Sessions (AAS) in the hundreds (i.e., extremely high considering this database is hosted on a four-CPU machine). Up until that point, I had not seen an AAS that high so my curiosity was piqued and I decided to do a deeper dive.

The ADDM report showed CPU pegged and suggested adding CPUs.

The two tools I just mentioned are part of what I have called the “small model” arsenal of tools and were useful to a point, but I needed a more detailed picture of database to find out the problem and to determine if multiple areas were involved. For this, I turned to the DOPA process.

Using the Metrics Time-Series View (shown in Figure 8-16), I could see that CPU consumption was at 100% from 02:00 through 11:00 (see graph) [I learned later that operation DBAs restarted the database after 11:00 which cleared the problem]. Although the clients had indicated a problem window of 6:00–9:00, this view showed that the problem was of longer duration. Figure 8-16 shows a graph of the Metrics Time-Series View for host CPU consumption.
../images/469031_1_En_8_Chapter/469031_1_En_8_Fig16_HTML.jpg
Figure 8-16

A graph of the Metrics Time-Series View for host CPU consumption

Since the ADDM report also highlighted a particular SQL statement as high CPU consuming, I used dba_hist_sqlstat to better understand how this SQL statement had been operating historically. I discovered that there had been ~ 800 million executions of this query on the day of the high CPU condition. The SQL had execution times that were subsecond, but with the enormous number of executions, this added up to >24-hour clock time. I also found that this SQL usually had around 10 million executions on a normal, nonproblematic day with the same time per execution [and stable execution plan] taking only about 20 minutes or so of clock time in a day. Thus the problem was a result of an increased workload and not poor performance of the SQL itself.

Another iteration of the DOPA process using the Metrics Aggregate View revealed high values for the metric, dba_hist_sysmetric_summary: “Temp Space Used.” This is an important finding as it is actionable for further analysis which will likely contribute toward a resolution.

What I did next was to analyze the SQL in the active session history during the problem interval, sorting the highest TEMP-consuming SQL at the top. I ran those SQL through the SQL Tuning Advisor to see what kind of actionable recommendations would be recommended.

Another metric identified as high by the Metrics Aggregate View was dba_hist_sysstat: “concurrency wait time” [this metric can also be found on a SQL by SQL basis in dba_hist_sqlstat (ccwait_time_total/delta), v$sql, and v$sqlarea; from the Oracle Doc’s this is “The total wait time (in microseconds) for waits that belong to the Concurrency wait class”]. Since wait times are typically elevated during times of high CPU consumption, I am not particularly worried about actioning further investigation for this high metric.

What action did I take to resolve the issue?

My findings were shared with the application team. I explained that the workload had increased significantly as a result of the SQL statement running orders of magnitude more than usual, but that the execution time for each execution was no slower than usual. Thus, they needed to go back and figure out why the workload change had occurred.

Why is this case particularly interesting?

The DOPA process, as well as other small model tools, showed that the problem was related to high CPU consumption. The DOPA process enabled the deeper-dive analysis necessary to fully discover that the high CPU was a result of the workload increase and not the root cause of the problem. It also gave insight into metrics (Temp Space Used) that would have been missed using small model tools alone, and this insight provided an actionable finding with immediate performance improvement.

What impact did the metrics analysis bring?

Often time and effort are spent chasing symptoms and not the root cause. In this example, if the analysis had ended with the ADDM report, and the advice to add additional CPU had been implemented, the root cause would not have been discovered and resources wasted.

Case #6: Example of Where Someone Might Think It Was a Top SQL, but It’s Not

OR False-Negative SQL Analysis

What was user complaint? How did the problem manifest? Error messages?

Application support client reported a problem that occurred in a narrow one-hour interval. The problem manifested as a SQL tuning problem. Using the traditional approach, the DBA performance analyst (DBA) assigned to this incident examined the top SQL that was present during that interval. In this case all the SQL executing in that hour had very fast execution times (AWR report, ADDM report, custom analysis of DBA_HIST_SQLSTAT). The DBA reported back to the client that the top SQL analysis for a one-hour interval did not show significant app SQL level issues (i.e., no resolution/solution provided). Further, there was no actionable advice from the SQL Tuning Advisor for the application SQL during the interval. I was asked to provide a second set of expert eyes.

What type of analysis did I run with the DOPA code? Metrics flagged, interesting patterns?

I began my analysis using the DOPA process and the Metrics Aggregate View. The DOPA process was used to reexamine the time interval. I discovered many additional flags on REDO-related metrics. The dynamically flagged metrics provided the information needed to direct further investigation. This further investigation led to the observation of blocked sessions and significant waits on REDO log file writing and switching.

What did you do to resolve the issue?

The solution in this case was to increase the REDO log file sizes.

Why is this case particularly interesting?

The DOPA process led right to the underlying problem which was not detected using the traditional SQL tuning approach.

What impact did the metrics analysis bring?

An essential observation was missed using the small model tools but was detected by the DOPA process. This led to a proper diagnosis of the problem and actionable solution.

Case #7: Alert Log Example (Error Alert Failure Example) Narrative

What was the user’s complaint? How did the problem manifest? Error messages?

Users reported ORA 600 error* while running a select statement. ORA 600 errors were also found in the alert log.

What type of analysis did I run with the DOPA code? Metrics flagged, interesting patterns?

I ran the DOPA process using a Metrics Aggregate View. The dynamic code flagged as outside of normal the following metrics from dba_hist_system_event:
  • Other: ADR block file write time_waited_micro

  • Other: ADR block file write total_waits

  • Other: ADR file lock total_timeouts

  • Other: process diagnostic dump time_waited_micro

  • Other: process diagnostic dump total_waits

These are all related to how Oracle internally handles the writing out of the ORA 600 error. They relate to ADR, the Automatic Diagnostic Repository, which is a hierarchical file-based repository for handling diagnostic information.

What action did I take to resolve the issue?

Since ORA 600 errors are handled by the database operations group, no tuning was performed, nor were there any other actionable findings.

Why is this case particularly interesting?

This case is interesting because the DOPA process was able to quickly identify the problem area. I wasn’t able to determine the actual cause of the ORA 600 error, but I was able to determine that it was not a performance issue and was instead an issue with an internal process. This enabled me to refer it to the appropriate group for resolution without wasting valuable time on tuning, which would likely have proven ineffective in this situation.

What impact did the metrics analysis bring?

Because DOPA is able to analyze all of the available metrics, it is able to recognize problems not only with the areas familiar to most tuners but the inner workings of the machine as well. By quickly identifying the area, tuners do not have to waste valuable time trying to chase down problems in unfamiliar territory that can best be handled by others.

*The actual error message is:

ORA-00600: internal error code, arguments: [kggsmGetString:1], [0x7F54ADD62910], [13], [1], [11], [], [], [], [], [], [], []

Case #8: DOPA Analysis of Oracle Memory Error—Narrative

What was the user’s complaint? How did the problem manifest? Error messages?

Users reported an ORA-04031 error related to memory while running a select query (ORA-04031: unable to allocate bytes of shared memory). Usually this is something we would turn over to operations because we don’t have the remit or authority to add memory to the database. I decided to run the DOPA process because I was curious to see how much information the DOPA process would provide for this type of problem.

What type of analysis did I run with the DOPA code? Metrics flagged, interesting patterns?

I used the DOPA process and the Metrics Aggregate View to analyze the metrics for the problem time interval to see which metrics in the infrastructure taxonomy and memory category were outside normal range. The following metrics were flagged:

dba_hist_osstat
  • FREE_MEMORY_BYTES

  • INACTIVE_MEMORY_BYTES

    dba_hist_sys_time_model

  • Failed parse (out of shared memory) elapsed time

    dba_hist_sysstat

  • Session PGA memory

  • Session PGA memory max

  • Session UGA memory

  • Session UGA memory max

  • Sorts (memory)

    dba_hist_system_event

  • Concurrency: latch: In memory undo latch time_waited_micro

  • Concurrency: latch: In memory undo latch total_waits

  • Queueing: Streams AQ: enqueue blocked on low memory time_waited_micro

  • Queueing: Streams AQ: enqueue blocked on low memory total_waits

These metrics indicate:

What action did I take to resolve the issue?

No action was taken since the memory errors are handled by our operations team.

Why is this case particularly interesting?

This example shows that the DOPA process is able to provide additional information related to and necessary to address this kind of Oracle error. In my organization, tuning professionals are not tasked with maintaining the actual performance of the machine, but for organizations where DBAs must provide all levels of service, this will be very beneficial.

What impact did the metrics analysis bring?

The DOPA process can clearly be used to get additional details on this kind of Oracle error. Furthermore, in this case the database experienced a very specific memory error, and the DOPA process was able to show that this event directly coincided with capacity exhaustion as shown by the unusually high value for the flagged memory metrics at both the operating system level (dba_hist_osstat) and Oracle internal resource level as well (parsing memory, PGA/UGA memory, latches and Streams).

Summary

The case studies we’ve just looked at are a sampling of the types of problems for which the DOPA process can provide insightful analysis. I am persuaded that it can be used effectively for any kind of problem. Because it is metrics-based, it is an objective analysis, and because it uses all of the available metrics, it is exhaustive. When run in an iterative fashion, as I have shown, it can be used in a dynamic way to evaluate at the appropriate level to obtain information at both a macro and micro level. These are just some of the reasons the DOPA process is superior to small model approaches. That does not mean that small model analysis tools are obsolete. As you have seen in these examples, small model tools are very useful when used in conjunction with the DOPA process to focus in on problem areas. However, no other tool is as comprehensive, efficient, and objective as the DOPA process. I am confident that the dynamic method of Oracle performance analytics provided by DOPA will benefit every tuning expert who determines to learn how to use it.

The last two chapters cover monitoring (Chapter 9) and further enhancements (Chapter 10).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset