In the previous chapter, I provided a framework for making the many decisions necessary when implementing the DOPA process. The discussion focused upon how you would choose to run the analysis, the various parameters, and views you would choose. In this chapter, I’ll lead you through several real examples.
The case studies presented in this chapter show the DOPA process in action. They detail the analysis of historic AWR metrics and the identification of root cause(s) of real reported Oracle database performance problems. These represent just a sampling of the kinds of problems encountered by database tuning professionals, but taken together they demonstrate the dynamic nature of the DOPA process and its usefulness across problem types.
- Identifying source of performance bottlenecks:
IO
CPU
Memory
Network
- Identifying internal Oracle features involved in the performance issue:
Subsystems
Configuration
Determining trends for thousands of metrics
- Identifying application issues:
Parsing
Workload
SQL tuning
- Failure detection:
ORA-600 errors
TEMP
PGA aggregate limit exceeded
- Analysis of metric normal ranges:
Threshold setting for monitoring
In any organization with mission-critical/important systems, it is essential to be able to quickly identify root causes of the issues and to react promptly by applying required corrections or tunings. As with any data analysis effort, context is key; without context for the issue you are analyzing, the results will have little meaning. In the context of database performance tuning (well-defined or not-so-well-defined issues), the unusual metric values flagged by the DOPA process will provide excellent input into the process of weaving the story behind the performance issue at hand.
With the approximately 11,000 metrics instrumented thus far in the DOPA process, there are about 2 million data points in a weeks’ worth of AWR data (rolled up in 1-hour intervals). Since we’re interested in performance bottlenecks, I designed the DOPA process to show only the metrics that are likely to be contributing to the problem. Likely, there will be many metrics that are unfamiliar to the new user of the DOPA process. While the metric names are usually informative as to what they are measuring, one might not be able to find a lot of Oracle documentation on a metric. Google is your friend to the extent that the metric is well documented, but not all of them are. Therefore, the DOPA process user will need to put on their expert hat, leveraging their knowledge of computer science and Oracle internals. The point is you will need to become familiar with a larger set of metrics than you are likely accustomed to. The DOPA process does not predetermine any metrics [although a future version will rank some well-known predictive metrics], and it is not designed to tell you specific metrics that are involved when a very specific performance issue is reported. Thus, it is the task of the DOPA process user to exercise good judgment and decide what metrics or problem areas to focus on for further analysis. The implementation of taxonomies is very helpful in this regard since they bundle the metrics into higher level categories for easier interpretation.
The analysis of the performance tuning expert is necessary to synthesize the results of the model produced by running the DOPA process for a given time interval. For example, a metric can be flagged by the DOPA process as out of normal range, but this condition could be a knock-on effect of another problem. For example, memory issues or poorly tuned SQL can surface in the metrics as IO problems (i.e., unusually high IO conditions can be a consequence of the real problem: low SGA memory or poorly tuned SQL). The user of the DOPA process will likely encounter this kind of behavior, as I did, time and time again. The model produced by the DOPA process is a very effective tool, but it must be used skillfully by the tuning professional, who must take the objective information it affords to convey the metrics-based “story” that best expresses the performance issue at hand.
For the following case studies, I had access to production Oracle instances that contained one weeks’ worth of historic performance metrics in the DBA_HIST views (in some cases, up to one month’s worth of data). Each case represents a real production performance incident. Each was solved in the “heat of battle” using the DOPA process.
Case Studies
What was the user’s complaint? How did the problem manifest? Error messages?
What type of analysis did I run with the DOPA code? Metrics flagged, interesting patterns?
What action did I take to resolve the issue?
Why is this case particularly interesting?
What impact did the metrics analysis bring?
Case #1: Basic Example Showing How DOPA Process Identifies a SQL Problem
What was the user’s complaint? How did the problem manifest? Error messages?
The application had poor performance during a particular time interval. Users asked us to check to see if there were backups or other processes occurring on the database that might be impacting their application.
What type of analysis did I run with the DOPA process? Metrics flagged, interesting patterns?
Since TEMP usage will come from SQL and sometimes SQL can be tuned to reduce TEMP consumption, I looked at the ASH during the time interval during which the problem was occurring to find the top TEMP-producing SQL. I ran that SQL through the SQL Tuning Advisor and examined the SQL Stats and ASH for the SQL [it took 8 minutes to run, returning 300,000 rows (1000 at a time) over the wide area network; session elapsed time was 30 minutes]. SQL Tuning Advisor suggested a parallel SQL Profile at ~60% improvement. In this case the root cause was not really TEMP consumption, but rather the fact that the query was running over the WAN (app server in the UK with the database in the United States) with a low array fetch size.
What did you do to resolve the issue?
- 1)
Increasing the array fetch size
- 2)
Relocating app server to same local area network with the database
- 3)
Tuning using parallel SQL Profile
Why is this case particularly interesting?
This example shows how a problem area (high TEMP-consuming SQL) was quickly identified by the DOPA process. It also shows the importance of rightly interpreting that information. Clearly, the tuning expert must be able to make the jump from raw fact (high TEMP consumption) to a likely cause(s).
What impact did the metrics analysis bring? The DOPA process quickly identified one and only one problem area. Tuning attention was then directed to that problematic area. A deep-dive analysis at that point efficiently and effectively identified the root cause.
Case #2: Basic Example of How the DOPA Process Identified a Configuration Problem
What was user complaint? How did the problem manifest? Error messages?
Client reported performance degradation during a massive data load. There was a three-hour window during which this program ran.
What type of analysis did I run with the DOPA code? Metrics flagged, interesting patterns?
I ran the DOPA analysis using a Category Count View and the Oracle taxonomy for the problem interval.
The next step I took was to look at the wait events for this SQL; I did this using a custom query against the ASH. I could see from this query that log file switching and IO delays were contributing to the issue. This analysis confirmed that the REDO configuration was a major part of the problem.
Further iterations of the DOPA process showed high values for index leaf node splits.
What action did I take to resolve the issue?
Increase REDO log file sizes.
Increase the size of the SGA.
To address the problem of high values for index leaf node splits, I suggested to the client that they disable the indexes during the load.
Why is this case particularly interesting?
Leaf node splits is a metric implemented in AWR and evaluated as part of the DOPA process, but I have never seen it in any other small model approach. It was noteworthy that the DOPA process identified it as a problem in this analysis. It is quite likely that this is a problem that occurs, but is overlooked by other tools that fail to consider this metric.
- 1)
Maintenance window should be moved to a period of less activity.
- 2)
The resource management plan should be dialed back.
- 3)
Possibly also the instance/machine needs more CPU.
What impact did the metrics analysis bring?
Using DOPA process as a first step in the analysis enabled me to positively identify the area of concern based on the flagged metrics. I was able to then use small model tools to zoom in on the problem area with great precision. Having done a comprehensive analysis first, I could be confident that important observations were not being overlooked and that all problem areas were addressed.
Case #3: CPU Problem Caused by SQL Execution
What was the user’s complaint? How did the problem manifest? Error messages?
The client complained that a specific query configured in a monitoring tool was throwing alerts—“ORA-01013: user requested cancel of current operation.” They reported that normal query execution time was usually much less than 60 seconds and that the application would error out if >60 seconds.
Client received the alerts on June 14–16 and also 18–20. The request read, “Please check and let us know is there any Performance issue at the database end.”
Another analyst had previously looked at the problem and recommended that the client modify the query with a parallel hint. This change was made by the client, but did not resolve the problem. In fact, it magnified the problem.
What type of analysis did I run with the DOPA code? Metrics flagged, interesting patterns?
Based on all the information, I determined that CPU was the root cause of the issue for the fluctuations in monitoring and that the problem would likely persist until the CPU situation was brought under control. I was confident that it wasn’t a capacity issue and that performance would improve once the SQL was tuned.
- 1.
Implement a custom SQL Profile fixing the execution plan to the faster PHV (3051534597).
- 2.
Monitor CPU situation and revisit if no improvement.
Why is this case particularly interesting?
The first analyst that looked at this problem recommended that the client modify the query with a parallel hint. This change was made by the client, but did not resolve the problem. In fact, it magnified the problem. In making this change, the analyst was treating the symptom and not the problem.
The DOPA process, a comprehensive metrics-based analysis, identified not only the CPU issue but also the impact of the CPU issue on various wait times. This is a great example of how the expertise of the analyst is critical for the interpretation of the raw data. In this case, the long wait times were a result of slow processing due to high CPU consumption, not the problem to be fixed. Throwing hardware at the problem might help, but it is unlikely to solve the problem.
What impact did the metrics analysis bring?
The comprehensive analysis enabled accurate identification of the problem and the information necessary to discern between cause and effect.
Case #4: Example Where the DOPA Process Was Used to Detail Problem Areas
What was the user’s complaint? How did the problem manifest? Error messages?
Performance was reported to be slow on a particular database instance, and the client reported application level server busy errors (no “ORA” errors reported).
This is a database application that is hosted in the UK but has US users as well. The problem got worse as US users began to connect to the application and gradually eased when UK users logged off. The time period during which this happened most was every day between 10:00 and 15:00, for different users at different sites.
What type of analysis did I run with the DOPA code? Metrics flagged, interesting patterns?
Since the problem occurred only during the week, I took my normal ranges from the weekend, using the same time interval: 10:00–15:00.
Case Study - UKPRD835 - Metrics Aggregate View
Current Open Cursors Count
Current Logons Count
Session Count
These are all related to connection and workload issues.
Logon storms of upward of 20,000 connections per hour
Lots of small SQL executions doing full table scans
Excessive parsing (shows up in the Metrics Aggregate View, Total Parse Count Per Sec)
The conclusion of my analysis was that the performance problem was a result of the application behavior; it was creating a new database session for every window and running one or more small queries to populate data for the window [this behavior was confirmed by both the client and the vendor]. This is a common design problem with applications—it manifests as logon storms. The application had not failed earlier primarily because it was operating on a 40 CPU machine and had ample capacity to deal with the overabundance of connections required; eventually they ran into the practical limits of the machine. Since it is older hardware and not possible to add any more CPU’s, the only fix possible is at the application level. This would involve adding a connection pool. This must be implemented by the vendor since it is a vendor product.
What action did I take to resolve the issue?
Advise the clients to work with the vendor to accomplish the necessary change in the application.
Why is this case particularly interesting? What impact did the metrics analysis bring?
The metrics give the clues to the problem, but it is still necessary for the analyst to tie this to the workings of the application in order to fully understand the problem. In this case, the metrics exactly matched the client’s complaint, and this gave them leverage in working with the vendor. The DOPA process provided objective data that the problem was occurring as a result of the application and not the database and therefore required a fix at the application level.
Case #5: Performance Investigation for Ill-Described Problem (a Narrative)
What was the user’s complaint? How did the problem manifest? Error messages?
An application support client asked for an AWR report for a time range 06:00–09:00 on a particular day; it was supplied to them by a level 2 application DBA analyst. I knew that this particular application is important to the business and that the app team usually worked through the vendor for support, but my obsessive tuning disorder got the best of me and I began a performance investigation even though they didn’t ask.
What type of analysis did I run with the DOPA code? Metrics flagged, interesting patterns?
I looked first at the AWR report since that is the data they had requested; it showed Average Active Sessions (AAS) in the hundreds (i.e., extremely high considering this database is hosted on a four-CPU machine). Up until that point, I had not seen an AAS that high so my curiosity was piqued and I decided to do a deeper dive.
The ADDM report showed CPU pegged and suggested adding CPUs.
The two tools I just mentioned are part of what I have called the “small model” arsenal of tools and were useful to a point, but I needed a more detailed picture of database to find out the problem and to determine if multiple areas were involved. For this, I turned to the DOPA process.
Since the ADDM report also highlighted a particular SQL statement as high CPU consuming, I used dba_hist_sqlstat to better understand how this SQL statement had been operating historically. I discovered that there had been ~ 800 million executions of this query on the day of the high CPU condition. The SQL had execution times that were subsecond, but with the enormous number of executions, this added up to >24-hour clock time. I also found that this SQL usually had around 10 million executions on a normal, nonproblematic day with the same time per execution [and stable execution plan] taking only about 20 minutes or so of clock time in a day. Thus the problem was a result of an increased workload and not poor performance of the SQL itself.
Another iteration of the DOPA process using the Metrics Aggregate View revealed high values for the metric, dba_hist_sysmetric_summary: “Temp Space Used.” This is an important finding as it is actionable for further analysis which will likely contribute toward a resolution.
What I did next was to analyze the SQL in the active session history during the problem interval, sorting the highest TEMP-consuming SQL at the top. I ran those SQL through the SQL Tuning Advisor to see what kind of actionable recommendations would be recommended.
Another metric identified as high by the Metrics Aggregate View was dba_hist_sysstat: “concurrency wait time” [this metric can also be found on a SQL by SQL basis in dba_hist_sqlstat (ccwait_time_total/delta), v$sql, and v$sqlarea; from the Oracle Doc’s this is “The total wait time (in microseconds) for waits that belong to the Concurrency wait class”]. Since wait times are typically elevated during times of high CPU consumption, I am not particularly worried about actioning further investigation for this high metric.
What action did I take to resolve the issue?
My findings were shared with the application team. I explained that the workload had increased significantly as a result of the SQL statement running orders of magnitude more than usual, but that the execution time for each execution was no slower than usual. Thus, they needed to go back and figure out why the workload change had occurred.
Why is this case particularly interesting?
The DOPA process, as well as other small model tools, showed that the problem was related to high CPU consumption. The DOPA process enabled the deeper-dive analysis necessary to fully discover that the high CPU was a result of the workload increase and not the root cause of the problem. It also gave insight into metrics (Temp Space Used) that would have been missed using small model tools alone, and this insight provided an actionable finding with immediate performance improvement.
What impact did the metrics analysis bring?
Often time and effort are spent chasing symptoms and not the root cause. In this example, if the analysis had ended with the ADDM report, and the advice to add additional CPU had been implemented, the root cause would not have been discovered and resources wasted.
Case #6: Example of Where Someone Might Think It Was a Top SQL, but It’s Not
OR False-Negative SQL Analysis
What was user complaint? How did the problem manifest? Error messages?
Application support client reported a problem that occurred in a narrow one-hour interval. The problem manifested as a SQL tuning problem. Using the traditional approach, the DBA performance analyst (DBA) assigned to this incident examined the top SQL that was present during that interval. In this case all the SQL executing in that hour had very fast execution times (AWR report, ADDM report, custom analysis of DBA_HIST_SQLSTAT). The DBA reported back to the client that the top SQL analysis for a one-hour interval did not show significant app SQL level issues (i.e., no resolution/solution provided). Further, there was no actionable advice from the SQL Tuning Advisor for the application SQL during the interval. I was asked to provide a second set of expert eyes.
What type of analysis did I run with the DOPA code? Metrics flagged, interesting patterns?
I began my analysis using the DOPA process and the Metrics Aggregate View. The DOPA process was used to reexamine the time interval. I discovered many additional flags on REDO-related metrics. The dynamically flagged metrics provided the information needed to direct further investigation. This further investigation led to the observation of blocked sessions and significant waits on REDO log file writing and switching.
What did you do to resolve the issue?
The solution in this case was to increase the REDO log file sizes.
Why is this case particularly interesting?
The DOPA process led right to the underlying problem which was not detected using the traditional SQL tuning approach.
What impact did the metrics analysis bring?
An essential observation was missed using the small model tools but was detected by the DOPA process. This led to a proper diagnosis of the problem and actionable solution.
Case #7: Alert Log Example (Error Alert Failure Example) Narrative
What was the user’s complaint? How did the problem manifest? Error messages?
Users reported ORA 600 error* while running a select statement. ORA 600 errors were also found in the alert log.
What type of analysis did I run with the DOPA code? Metrics flagged, interesting patterns?
Other: ADR block file write time_waited_micro
Other: ADR block file write total_waits
Other: ADR file lock total_timeouts
Other: process diagnostic dump time_waited_micro
Other: process diagnostic dump total_waits
These are all related to how Oracle internally handles the writing out of the ORA 600 error. They relate to ADR, the Automatic Diagnostic Repository, which is a hierarchical file-based repository for handling diagnostic information.
What action did I take to resolve the issue?
Since ORA 600 errors are handled by the database operations group, no tuning was performed, nor were there any other actionable findings.
Why is this case particularly interesting?
This case is interesting because the DOPA process was able to quickly identify the problem area. I wasn’t able to determine the actual cause of the ORA 600 error, but I was able to determine that it was not a performance issue and was instead an issue with an internal process. This enabled me to refer it to the appropriate group for resolution without wasting valuable time on tuning, which would likely have proven ineffective in this situation.
What impact did the metrics analysis bring?
Because DOPA is able to analyze all of the available metrics, it is able to recognize problems not only with the areas familiar to most tuners but the inner workings of the machine as well. By quickly identifying the area, tuners do not have to waste valuable time trying to chase down problems in unfamiliar territory that can best be handled by others.
*The actual error message is:
ORA-00600: internal error code, arguments: [kggsmGetString:1], [0x7F54ADD62910], [13], [1], [11], [], [], [], [], [], [], []
Case #8: DOPA Analysis of Oracle Memory Error—Narrative
What was the user’s complaint? How did the problem manifest? Error messages?
Users reported an ORA-04031 error related to memory while running a select query (ORA-04031: unable to allocate bytes of shared memory). Usually this is something we would turn over to operations because we don’t have the remit or authority to add memory to the database. I decided to run the DOPA process because I was curious to see how much information the DOPA process would provide for this type of problem.
What type of analysis did I run with the DOPA code? Metrics flagged, interesting patterns?
I used the DOPA process and the Metrics Aggregate View to analyze the metrics for the problem time interval to see which metrics in the infrastructure taxonomy and memory category were outside normal range. The following metrics were flagged:
FREE_MEMORY_BYTES
INACTIVE_MEMORY_BYTES
dba_hist_sys_time_model
Failed parse (out of shared memory) elapsed time
dba_hist_sysstat
Session PGA memory
Session PGA memory max
Session UGA memory
Session UGA memory max
Sorts (memory)
dba_hist_system_event
Concurrency: latch: In memory undo latch time_waited_micro
Concurrency: latch: In memory undo latch total_waits
Queueing: Streams AQ: enqueue blocked on low memory time_waited_micro
Queueing: Streams AQ: enqueue blocked on low memory total_waits
These metrics indicate:
What action did I take to resolve the issue?
No action was taken since the memory errors are handled by our operations team.
Why is this case particularly interesting?
This example shows that the DOPA process is able to provide additional information related to and necessary to address this kind of Oracle error. In my organization, tuning professionals are not tasked with maintaining the actual performance of the machine, but for organizations where DBAs must provide all levels of service, this will be very beneficial.
What impact did the metrics analysis bring?
The DOPA process can clearly be used to get additional details on this kind of Oracle error. Furthermore, in this case the database experienced a very specific memory error, and the DOPA process was able to show that this event directly coincided with capacity exhaustion as shown by the unusually high value for the flagged memory metrics at both the operating system level (dba_hist_osstat) and Oracle internal resource level as well (parsing memory, PGA/UGA memory, latches and Streams).
Summary
The case studies we’ve just looked at are a sampling of the types of problems for which the DOPA process can provide insightful analysis. I am persuaded that it can be used effectively for any kind of problem. Because it is metrics-based, it is an objective analysis, and because it uses all of the available metrics, it is exhaustive. When run in an iterative fashion, as I have shown, it can be used in a dynamic way to evaluate at the appropriate level to obtain information at both a macro and micro level. These are just some of the reasons the DOPA process is superior to small model approaches. That does not mean that small model analysis tools are obsolete. As you have seen in these examples, small model tools are very useful when used in conjunction with the DOPA process to focus in on problem areas. However, no other tool is as comprehensive, efficient, and objective as the DOPA process. I am confident that the dynamic method of Oracle performance analytics provided by DOPA will benefit every tuning expert who determines to learn how to use it.
The last two chapters cover monitoring (Chapter 9) and further enhancements (Chapter 10).