This chapter addresses the statistical manipulation of the data that has been collected and formatted. I’ll start by reviewing some basic statistical concepts, including normal ranges, outliers, and variance, and then discuss how these concepts are applied as part of the DOPA process to establish normal ranges for the host of metrics that don’t have well-known norms. Don’t worry, you don’t have to break out the statistics books; I will show you how to use the embedded Oracle functions to easily accomplish this statistical analysis.
So far, you have learned how to gather the data, unpivot/normalize the values into key-value pairs, and union them together to create a single virtual data set of metrics and their values. The next step in the DOPA process is to perform some basic statistical operations. Once performed, these statistical calculations will allow the program to identify which part(s) of the database are experiencing performance problems by flagging those that are “abnormal” based on the statistical analysis.
Host CPU Utilization (%): optimal response time no greater than 65%
Average Active Sessions: less than or equal to CPU count
Average Synchronous Single-Block Read Latency: no greater than 20 milliseconds
Run Queue Per Sec: I am concerned with values above 5
Database Wait Time Ratio: no greater than 80%
DB time
User I/O wait time
Session Limit %
While it’s true that some of the metrics have a known expected value, in my experience the metrics with known values are the exception rather than the rule (given the thousands of metrics available in the AWR). The vast majority of metric values contained in Oracle’s repository do not have a known absolute normal value. That might be because the DBA is unfamiliar with the metrics, or they might be measurements of things that vary widely from machine to machine and even from time to time, depending on the workload, the number and types of applications running, the number of users, etc., and hence what is “normal” for one machine at a given time may change with changing use.
Response Time Per Txn
User Transaction Per Sec
Redo Generated Per Sec
Network Traffic Volume Per Sec
Physical Read Total Bytes Per Sec
Physical Write Total IO Requests Per Sec
Current Open Cursors Count
redo log space requests
hard parse elapsed time
parse time elapsed
Given that it is humanly impossible to know or calculate manually the normal ranges for the many thousands of metrics available to us in the AWR, I have devised a method which uses simple statistical functions built into Oracle to compute the normal ranges for each metric. Again, the reason for this is so that the program can identify metrics as problematic when they exceed the normal value established by that statistical analysis.
Most computer professionals will have had at least some basic statistics as part of their schooling and will therefore be somewhat familiar with the statistical concepts of norms and standard deviations, but I’m going to do a quick review before proceeding just for clarity. I’ll define a few terms and then explain how they are used in the DOPA process.
Statistics Review
Calculating the Sample Variance Using the Oracle Function
68% will be within the first standard deviation from the mean.
95% within two standard deviations.
99.7% within three standard deviations.
Removing outliers is important because they can significantly skew the calculation of “normal” for a given set of data. We want to identify them and remove them prior to calculating normal ranges for this reason.
A common method for identifying outliers uses a concept called the interquartile range or IQR. This method divides the data into quartiles by calculating the median of the numbers in the lower half of the data set (called Q1) and the median of the numbers in the upper half of the data set (called Q3). The data points contained between Q1 and Q3 represent the middle 50% of the data. A number is considered to be an outlier if it is more than 1.5 times the IQR below Q1 or 1.5 times the IQR above Q3.
Statistics in the DOPA Process
Having laid a foundation of basic statistical analysis, I will now proceed to show you how I use statistics in the DOPA process to establish “normal values” for a given metric.
When a problem is reported, I collect the metric data for the week prior. One week has been my default time frame, because on most of the databases on which I work, this is the AWR retention setting. For establishing normal ranges, I want to consider metrics that reflect normal operation of the database, so I use metric values from a time interval when it was performing without problems. For example, if the problem occurred today, I would grab the data from the previous week and include all the metrics except those collected over the last 24 hours when the problem occurred. The AWR retention setting can be altered and it is probably useful to do so; I have been attempting to increase retention to 30 days on our databases so that a longer pre-problem comparative block of data is available for reference and establishing normal values.
In this graph, you will observe a single peak where the value is much higher for a single time interval than all the values for the other time intervals. The value was 1230 during that hour. I would normally expect single-block read latency to be less than 10 milliseconds. The abnormally high value occurred only once and there were no user complaints related to it, so I assumed it was a glitch in the data collection. If the 1230 value is included in the calculation of mean, it significantly skews the results. Including the high value yields a normal range with upper bound of 182 milliseconds. This upper bound is represented graphically by the gray horizontal line at 182 in Figure 4-11. Remembering that we should expect single-block latency to be under 10, using this upper bound would cause the program to fail to identify many abnormal values.
The interquartile range method for eliminating outliers is easy to implement using SQL and Oracle functions, and it has worked well for me thus far. I am not a statistician, so I am open to exploring alternate methods for removing outliers. The important point is that outliers should be removed prior to performing the next series of operations to ensure that real problems are not missed due to skewed results.
After I have removed outliers, I proceed to calculate the normal range for the remaining values. This is easily accomplished using two Oracle functions—MEAN and STDDEV. As the name implies, MEAN determines the arithmetic mean for the entire data set. STDDEV computes the standard deviation based on that mean and the existing data points.
I use the standard deviation to set the upper and lower bounds of the “normal range” of values for each metric. The upper bound of “normal” for each metric is established as 2 standard deviations above the mean and the lower bound is 2 standard deviations below the mean. This is fairly standard and is referred to as the range rule of thumb.
In addition to calculating the standard deviation, I also calculate the variance. As mentioned in the earlier review of statistics, both standard deviation and variance indicate the amount of variability in the data, the main difference being in the units themselves. I have found it most useful to look at standard deviation because it is in the same units as the metric themselves. However, if you are just trying to get an idea of which variables are the most widely ranging, looking at the variance might be more insightful. Because variance is a squared value, it will get larger more quickly and hence be more obvious.
Summary
This statistical analysis detailed in this chapter enables us to state with a good deal of certainty what is normal for any given metric. To summarize, the outliers have been removed and the mean has been established for the remaining data points. Using the calculated mean, the standard deviation and variance have been established. All of this calculation is accomplished using Oracle’s embedded functions. This process of statistical manipulation is absolutely essential for the next step in the DOPA process, feature selection or flagging, which enables us to identify the metrics that are abnormal and, therefore, highly predictive of problems.
Next, we move on to feature selection.