© Roger Cornejo 2018
Roger CornejoDynamic Oracle Performance Analyticshttps://doi.org/10.1007/978-1-4842-4137-0_4

4. Statistical Analysis

Roger Cornejo1 
(1)
Durham, NC, USA
 

This chapter addresses the statistical manipulation of the data that has been collected and formatted. I’ll start by reviewing some basic statistical concepts, including normal ranges, outliers, and variance, and then discuss how these concepts are applied as part of the DOPA process to establish normal ranges for the host of metrics that don’t have well-known norms. Don’t worry, you don’t have to break out the statistics books; I will show you how to use the embedded Oracle functions to easily accomplish this statistical analysis.

So far, you have learned how to gather the data, unpivot/normalize the values into key-value pairs, and union them together to create a single virtual data set of metrics and their values. The next step in the DOPA process is to perform some basic statistical operations. Once performed, these statistical calculations will allow the program to identify which part(s) of the database are experiencing performance problems by flagging those that are “abnormal” based on the statistical analysis.

For some metrics collected by Oracle, there are absolute values which can be identified as “normal.” When these metrics exceed the “normal” values, a problem is clearly indicated. For example, a DBA will most likely be familiar with the following metrics and have some idea about the expected values for them (I’ve included some of the numbers that I consider normal):
  • Host CPU Utilization (%): optimal response time no greater than 65%

  • Average Active Sessions: less than or equal to CPU count

  • Average Synchronous Single-Block Read Latency: no greater than 20 milliseconds

  • Run Queue Per Sec: I am concerned with values above 5

  • Database Wait Time Ratio: no greater than 80%

  • DB time

  • User I/O wait time

  • Session Limit %

While it’s true that some of the metrics have a known expected value, in my experience the metrics with known values are the exception rather than the rule (given the thousands of metrics available in the AWR). The vast majority of metric values contained in Oracle’s repository do not have a known absolute normal value. That might be because the DBA is unfamiliar with the metrics, or they might be measurements of things that vary widely from machine to machine and even from time to time, depending on the workload, the number and types of applications running, the number of users, etc., and hence what is “normal” for one machine at a given time may change with changing use.

Following are some examples of metrics for which a DBA will probably not know “normal” values:
  • Response Time Per Txn

  • User Transaction Per Sec

  • Redo Generated Per Sec

  • Network Traffic Volume Per Sec

  • Physical Read Total Bytes Per Sec

  • Physical Write Total IO Requests Per Sec

  • Current Open Cursors Count

  • redo log space requests

  • hard parse elapsed time

    parse time elapsed

Given that it is humanly impossible to know or calculate manually the normal ranges for the many thousands of metrics available to us in the AWR, I have devised a method which uses simple statistical functions built into Oracle to compute the normal ranges for each metric. Again, the reason for this is so that the program can identify metrics as problematic when they exceed the normal value established by that statistical analysis.

Most computer professionals will have had at least some basic statistics as part of their schooling and will therefore be somewhat familiar with the statistical concepts of norms and standard deviations, but I’m going to do a quick review before proceeding just for clarity. I’ll define a few terms and then explain how they are used in the DOPA process.

Statistics Review

Mean is the average of a set of numbers calculated by taking the sum of the numbers and then dividing by the quantity of numbers considered. Figure 4-1 shows the mathematical computation of the mean for five numbers, and immediately following it in Figure 4-2, you will see how to accomplish this computation using Oracle’s embedded function.
../images/469031_1_En_4_Chapter/469031_1_En_4_Fig1_HTML.jpg
Figure 4-1

This is the mathematical computation of mean for the five numbers: 7, 2, 10, 4, and 12

Given a table of values (7, 2, 10, 4, and 12), you can use Oracle’s embedded functions to calculate mean (average) as follows:
../images/469031_1_En_4_Chapter/469031_1_En_4_Fig2_HTML.jpg
Figure 4-2

This is an example code to accomplish the computation of mean using Oracle’s embedded function

Variance is a statistical term that expresses how widely dispersed the data points in a particular set are. In a data set with high variance, the data points will be spread out more than in a data set with low variance. The mathematical formula for calculating sample variance used by Oracle is shown in Figure 4-3, and sample code for calculating the same using Oracle’s embedded functions is given in Figure 4-4.
../images/469031_1_En_4_Chapter/469031_1_En_4_Fig3_HTML.png
Figure 4-3

This is the mathematical formula for calculating sample variance

Calculating the Sample Variance Using the Oracle Function

Given the same table of values as in the mean example in the preceding section, an example of the variance calculation in Oracle follows:
../images/469031_1_En_4_Chapter/469031_1_En_4_Fig4_HTML.jpg
Figure 4-4

This is a sample code to calculate variance using Oracle’s embedded functions

Standard deviation is another way of expressing the degree of dispersion of data points in a particular set. The formula for its calculation is shown in Figure 4-5. As you can see from the formula, it is simply the square root of variance. Figure 4-6 shows how to accomplish this calculation using Oracle’s embedded functions. As with variance, when standard deviation is low, it indicates that values are grouped close to the mean, and when standard deviation is high, it indicates that the values are much more widely dispersed. Either calculation can be used to help improve understanding of a particular data set. The main difference between the two is that while variance is expressed as squared units, standard deviation is expressed in the same values as the mean of the data set. For this reason, I have chosen to use standard deviation in my calculations.
../images/469031_1_En_4_Chapter/469031_1_En_4_Fig5_HTML.jpg
Figure 4-5

Mathematical calculation of standard deviation

Given the same table of values as in the mean example in the above section, an example of the standard deviation calculation in Oracle follows:
../images/469031_1_En_4_Chapter/469031_1_En_4_Fig6_HTML.jpg
Figure 4-6

How to calculate standard deviation using Oracle’s embedded functions

Without going too deep, statisticians have adopted a theorem called the central limit theorem which enables us to assume that the sample mean of a set of random variables will tend toward a normal distribution, which enables us to compare means of different samples. Furthermore, the three-sigma rule (empirical rule) states that in a normal distribution, most of the data will fall within three standard distributions of the mean and, more specifically, that it will fall into a pattern as follows:
  • 68% will be within the first standard deviation from the mean.

  • 95% within two standard deviations.

  • 99.7% within three standard deviations.

Normal distributions are often represented graphically and can be recognized as the familiar bell curve represented in Figure 4-7.
../images/469031_1_En_4_Chapter/469031_1_En_4_Fig7_HTML.jpg
Figure 4-7

This is a figure of a bell curve showing a normal distribution

One last concept that is important to understand is outliers. Outliers are data points that are very far from most of the other data points in a set of data. Figure 4-8 shows a set of values for Average Synchronous Single-Block Read Latency on an hourly basis. It is clear to see from the graph that the value for the 10:00 hour is an outlier since it is so far beyond the other values.
../images/469031_1_En_4_Chapter/469031_1_En_4_Fig8_HTML.jpg
Figure 4-8

Example of an outlier in the data that one would want to remove before calculating the normal range

Removing outliers is important because they can significantly skew the calculation of “normal” for a given set of data. We want to identify them and remove them prior to calculating normal ranges for this reason.

A common method for identifying outliers uses a concept called the interquartile range or IQR. This method divides the data into quartiles by calculating the median of the numbers in the lower half of the data set (called Q1) and the median of the numbers in the upper half of the data set (called Q3). The data points contained between Q1 and Q3 represent the middle 50% of the data. A number is considered to be an outlier if it is more than 1.5 times the IQR below Q1 or 1.5 times the IQR above Q3.

Statistics in the DOPA Process

Having laid a foundation of basic statistical analysis, I will now proceed to show you how I use statistics in the DOPA process to establish “normal values” for a given metric.

When a problem is reported, I collect the metric data for the week prior. One week has been my default time frame, because on most of the databases on which I work, this is the AWR retention setting. For establishing normal ranges, I want to consider metrics that reflect normal operation of the database, so I use metric values from a time interval when it was performing without problems. For example, if the problem occurred today, I would grab the data from the previous week and include all the metrics except those collected over the last 24 hours when the problem occurred. The AWR retention setting can be altered and it is probably useful to do so; I have been attempting to increase retention to 30 days on our databases so that a longer pre-problem comparative block of data is available for reference and establishing normal values.

Once I have my metric data for what I believe is a pre-problem time interval long enough to be representative of normal function, I begin my statistical manipulation by identifying outliers. I use the IQR method that identifies numbers more than 1.5 interquartile ranges (IQRs) above Q3 and more than 1.5 IQRs below Q1 as outliers. I have instrumented the multiplication factor as a variable since there may be instances when I want to exclude or include more numbers at the extreme, but I use 1.5 as my default. An example of code used to calculate the IQR for use in identifying outliers is shown in Figure 4-9.
../images/469031_1_En_4_Chapter/469031_1_En_4_Fig9_HTML.png
Figure 4-9

This figure shows code used for removing outliers

When the code in Figure 4-9 is executed for a particular database for a particular time interval, the result will be a table. The bind variable for metric_name can be left blank to default to “Enqueue Requests Per Txn”; otherwise provide another metric_name from dba_hist_sysmetric_summary. The result table is shown in Figure 4-10. In this table, you can see the metric name and its value for each interval. Q1 and Q3 are the same for each and have been used to establish the upper and lower bounds for determining outliers. If the metric value for a given instance is determined to be an outlier, this is indicated in the “outlier” column as well as whether it is an upper or lower outlier.
../images/469031_1_En_4_Chapter/469031_1_En_4_Fig10_HTML.png
Figure 4-10

This table represents the result for the code shown in Figure 4-9 executed for a particular database for a particular time interval and shows Q1 and Q3 and whether the value was determined to be an outlier

The following example demonstrates why removing outliers is important. Figure 4-11 shows the data points for single-block read latency values over an eight-day time period.
../images/469031_1_En_4_Chapter/469031_1_En_4_Fig11_HTML.jpg
Figure 4-11

Single high value for single-block read latency over approximately an eight-day collection interval

In this graph, you will observe a single peak where the value is much higher for a single time interval than all the values for the other time intervals. The value was 1230 during that hour. I would normally expect single-block read latency to be less than 10 milliseconds. The abnormally high value occurred only once and there were no user complaints related to it, so I assumed it was a glitch in the data collection. If the 1230 value is included in the calculation of mean, it significantly skews the results. Including the high value yields a normal range with upper bound of 182 milliseconds. This upper bound is represented graphically by the gray horizontal line at 182 in Figure 4-11. Remembering that we should expect single-block latency to be under 10, using this upper bound would cause the program to fail to identify many abnormal values.

When the extremely high value is excluded as an outlier, the calculated mean value is 8 milliseconds. This is a much more accurate calculation for an upper bound for this metric. Figure 4-12 shows the values for upper bound before removing outliers and after outliers are removed.
../images/469031_1_En_4_Chapter/469031_1_En_4_Fig12_HTML.png
Figure 4-12

Calculation of the upper bound when including the unusually high value and also when it is excluded as an outlier

The graph for the data set when the outlier is removed is shown in Figure 4-13. Now that the outlier has been removed, the upper bound is determined to be 8, again represented by a horizontal gray line. Note that with this new upper bound, there are several metric values that rise above that line and will therefore be “flagged” as unusually high.
../images/469031_1_En_4_Chapter/469031_1_En_4_Fig13_HTML.jpg
Figure 4-13

This graph shows the same data set as in the preceding figure after the outlier is removed. Note the new upper bound at 8.

The interquartile range method for eliminating outliers is easy to implement using SQL and Oracle functions, and it has worked well for me thus far. I am not a statistician, so I am open to exploring alternate methods for removing outliers. The important point is that outliers should be removed prior to performing the next series of operations to ensure that real problems are not missed due to skewed results.

After I have removed outliers, I proceed to calculate the normal range for the remaining values. This is easily accomplished using two Oracle functions—MEAN and STDDEV. As the name implies, MEAN determines the arithmetic mean for the entire data set. STDDEV computes the standard deviation based on that mean and the existing data points.

I use the standard deviation to set the upper and lower bounds of the “normal range” of values for each metric. The upper bound of “normal” for each metric is established as 2 standard deviations above the mean and the lower bound is 2 standard deviations below the mean. This is fairly standard and is referred to as the range rule of thumb.

Once a normal range is established, individual values can be compared to the norm to determine if they are “abnormal.” Examples of how normal range is used abound. Blood chemistry labs are one familiar usage. Based on samples from a very large patient population, a range of normal values has been established for various blood chemistry metrics, among them red blood cell (RBC) count. If you have labs drawn, you will usually see your individual test result listed alongside the range of normal. Figure 4-14 shows an example of test results for an individual’s RBC. In this example the test results would be considered abnormal since the test result falls below the low end of the normal range.
../images/469031_1_En_4_Chapter/469031_1_En_4_Fig14_HTML.png
Figure 4-14

A common use of normal range is red blood cell count. This figure shows an example of the results of an individual test next to a normal range of values for that test.

In addition to calculating the standard deviation, I also calculate the variance. As mentioned in the earlier review of statistics, both standard deviation and variance indicate the amount of variability in the data, the main difference being in the units themselves. I have found it most useful to look at standard deviation because it is in the same units as the metric themselves. However, if you are just trying to get an idea of which variables are the most widely ranging, looking at the variance might be more insightful. Because variance is a squared value, it will get larger more quickly and hence be more obvious.

Figure 4-15 shows an example of several CPU-related metrics and the results of the statistical analysis that has been performed on those metrics in order to prepare the data for the next step in the DOPA process.
../images/469031_1_En_4_Chapter/469031_1_En_4_Fig15_HTML.png
Figure 4-15

This table displays several metrics with the results of the statistical analysis performed. The average, variance, standard deviation, and upper and lower bounds of the normal range are all given in columnar format.

Summary

This statistical analysis detailed in this chapter enables us to state with a good deal of certainty what is normal for any given metric. To summarize, the outliers have been removed and the mean has been established for the remaining data points. Using the calculated mean, the standard deviation and variance have been established. All of this calculation is accomplished using Oracle’s embedded functions. This process of statistical manipulation is absolutely essential for the next step in the DOPA process, feature selection or flagging, which enables us to identify the metrics that are abnormal and, therefore, highly predictive of problems.

Next, we move on to feature selection.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset