5. Feature Selection

Roger Cornejo¹

(1)

Durham, NC, USA

In the last chapter, we discussed the statistical calculations necessary for establishing values that could be considered normal for a given metric. Once a normal is established for each metric, it is possible to identify metrics outside of normal and highlight them. These “abnormal” metrics are the ones that will be most useful for predicting the cause of performance issues in the database. In this chapter, I’ll discuss the process of drawing out the Oracle performance metrics that are abnormal and therefore important for our analysis and discerning which ones can be safely ignored using a process that eliminates personal bias.

Sidebar Note:

For a simple example of feature selection in, say, the field of clinical blood chemistry analysis, consider that there could be hundreds of metrics for a single blood sample; feature selection as I’m using it here would provide a report where those metrics which fall outside of normal range are highlighted. The flagging of the out-of-normal blood chemistry results does not direct the physician as the clinical interpretation of, say, a low value for “red blood cell count” or that metric combined with low values for “hemoglobin,” it simply saves the domain expert time in zeroing in on the metrics which are likely to be important to a diagnosis.

In the field of analytics, each instance is called a “feature,” and “feature selection” refers to the process of identifying which features are most useful for creating a predictive model. Feature selection seeks to reduce the number of variables considered by excluding those that do not improve the accuracy of the model. Obviously, in order to do this, it must have a method for determining which variables are predictive.

For our use in performance tuning, an instance or feature is a metric value, and “feature selection” refers to the process of discovering which metrics are most likely causally related to a reported performance issue. Feature selection will highlight the metrics that are unusual (either lower or more commonly higher than normal) and therefore most likely to inform the tuning expert about the area of tuning that will yield the most significant improvement in performance. Accurate feature selection will produce a predictive model that is highly accurate. It is important to remember that the goal here is to produce a predictive model. The tuner will still be required to use his knowledge of the database and his skill to complete the analysis and perform the tuning, but by using the feature selection methods implemented as part of the DOPA process, he will be able to eliminate many variables not likely to yield significant performance improvement based on a scientific process, not based on personal bias or his familiarity/comfort with a given metric.

There are many methods for accomplishing feature selection. For example, machine learning methods use training sets of known problems to discover patterns that can be used to identify problems in other performance issues where the problem is not known. Other methods involve complex algorithms. I have chosen to use for the DOPA process a simple statistical method to accomplish feature selection in which each metric value is compared to the computed normal value. Although simple, it is effective in identifying abnormal metrics, easy to implement, and able to produce reliable results when used in the DOPA process.

I call the feature selection method I use for identifying metrics outside of normal range “flagging.” As noted in Chapter 4, I begin by collecting data from the previous week in order to establish normal ranges for all the metrics during a nonproblematic interval. After establishing the normal values, my code compares each metric at every time instance to the computed normal values that have been established. The program assigns a 0 to every metric that is within the normal range, and it assigns a 1 to every metric that is outside the normal range. This process creates a bitmap of the data set. The metrics outside of normal and identified with a 1 are “flagged” in this way by my process. When a report is generated, only the metrics “flagged” with a 1 are reported and the normal metrics are eliminated. The flagging just described is a very basic method of feature selection.

As I’ve stated, the statistical method for feature selection used in the DOPA process yields a bitmap of the data set of metric values. The purpose of the flagging process is to reduce the number of data points considered, so in reporting the outcome, we choose to ignore all the metrics with a 0 and report only the metrics flagged with a value of 1. This is what I did initially when I was developing this method. Table 5-1 shows an example of what results look like for a particular problem reported this way. The program sorts the metrics by time and reports only the metrics that are flagged with a 1 (you will see a 1 in FLAG column). The metrics that are furthest from their normal range are reported first for each time interval. The table shown is only a portion of the total results table (what you are viewing is a portion of the metrics from the first time interval). Note: When no feature selection is applied, you would simply see all the metrics, normal (flag value 0) and abnormal (flag value 1), for the time period.

Table 5-1. Output example when feature selection is applied; Note: FLAG column value is 1

../images/469031_1_En_5_Chapter/469031_1_En_5_Figa_HTML.jpg

In addition to the actual value of the metric for that time interval (value), you will also see in Table 5-1 the statistical results for AVG All (the mean of all values for that metric), the Variance and StdDev, and the LOWER_BOUND and UPPER_BOUND of normal range for each metric. What I want to draw to your attention at this time is the column FLAG which shows that all of these metrics have a value of 1 in that column.

The flagging process and removal of “normal” values from the data set have reduced the number of data points in the results immensely. While the initial data set of all metrics contained approximately two million rows of data points, the flagged results data set has just a few hundred. As in the clinical blood chemistry generic feature selection example given at the beginning of this chapter, a domain expert is still needed to interpret the results. In the case of Oracle performance analysis , the DBA/tuner is the domain expert and needs to interpret the results. In this example, I’ve discussed feature selection in the absence of a specific problem case, but in Chapter 8 on case studies, you can see feature selection being applied in the context of real production performance problems.

I found this type of results report (as in Table 5-1) very usable for solving problems when the problem was narrowly defined and occurring across a small time interval (one- to three-hour intervals). However, when I attempted to use it for more poorly defined problems that required me to look at longer time intervals, such as a full 24 hours, I was once again challenged to interpret a very large table of numbers. This prompted me to come up with some refinements that made it easier to interpret the results.

I began my refining process by implementing code to aggregate the flags for all the time intervals in the analysis period and to report the metrics in descending order of flag counts—that is, the metric with the most flagged values reported first and the metric with the least flagged intervals reported last. I included a column in my results table to show how many “flagged” values were identified during the analysis period for that metric. Because I was now dealing with an aggregated set of numbers, I calculated and included a column to display the average of all the flagged values and the average for all values within that analysis period. When these extra steps were implemented, the results table yielded a more functional tool.

Table 5-2 shows the results for flagging using the enhanced code just described. In this table, each metric is listed only once for the entire analysis period (instead of once for each time interval). The result table is for the same data as that shown in Table 5-1, but in this new format, Active Parallel Sessions is listed at the top of the table because it had the greatest number of flagged values in the analysis interval which was a 24-hour period (25 time intervals). FLAG_COUNT shows that there were 16 values outside of normal range for that metric. The average of all 25 intervals (AVG All) was 0, and the average of the 16 metrics that were flagged (AVG Flagged Values) was 4.

Table 5-2. Results When Flagged Values Are Treated as an Aggregate

../images/469031_1_En_5_Chapter/469031_1_En_5_Figb_HTML.jpg

This results table is a big improvement over the previous results table because it allows a much easier interpretation of the data. Comparing the two tables, you will see that in Table 5-1, Direct Reads are at the top of the list of metrics for the first hour; this observation might lead you to believe that this is where you should begin tuning efforts. However, when you look at Table 5-2 with the aggregated flagged metrics, you will see that Direct Reads doesn’t even appear near the top of the table because it was flagged in only a few intervals (fewer than five since this is where the table cuts off for this example). Instead, what stands out from the second table is that Active Parallel Sessions and PQ Slave Session Count were both flagged in 16 of the 25 time intervals. This observation is more likely to lead you in the direction of productive tuning efforts.

One further refinement I made to improve my ability to easily interpret the results of the flagging process was to create a flag ratio. My purpose was to help me quantify how far “out of whack” a particular value was. The calculation I use compares the amount the value is beyond either the upper or lower bound to the standard deviation. Essentially, I find the delta between the average flagged values and the upper limit of the range and then divide it by the standard deviation.

When reporting the results of running the code, I can arrange the flag ratio in descending order. As in Table 5-3, this will show me clearly the metrics that are most “out of whack” with the highest flag ratios at the top.

Table 5-3. Ordering the Model Output by Flag Ratio

../images/469031_1_En_5_Chapter/469031_1_En_5_Figc_HTML.jpg

Each time I run the DOPA process, I am essentially creating a predictive model for determining where in the database a problem is occurring. Any change in the metrics used, the time interval chosen, the sensitivity of flagging, etc., will result in a different and unique feature selection. This is why I call my process a dynamic process. It is adaptable and easily accommodates to manipulation of the attributes being considered.

Feature selection is a very potent tool for solving problems involving extremely large data sets.

In a paper titled, “An Introduction to Variable and Feature Selection,” the threefold objective of this tool is clearly expounded:

The objective of variable selection is three-fold: improving the prediction performance of the predictors, providing faster and more cost-effective predictors, and providing a better understanding of the underlying process that generated the data. (Guyon and Elisseeff in “An Introduction to Variable and Feature Selection” (PDF) < http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf >)

The first two goals are clearly achieved with the flagging method of feature selection used in the DOPA process. We’ve established that the AWR repository has far too many metrics for a person to evaluate them all, so reducing the number of data points is essential for enabling the performance tuning analyst to make a more accurate diagnosis of the performance problem. Using the DOPA process, he/she does not have to sort through every metric but can consider only the important ones; the tuner can decide which ones are important, not by relying on personal experience or bias to select which metrics to focus on, but by using the flagging method, a highly accurate scientific method for predicting the problem source. The feature selection used for creating the predictive model conserves both time and effort.

The third objective mentioned is the goal of gaining a better understanding of the underlying process. I believe I have only begun to realize this objective, and I can see where further application of the DOPA process may yield even greater impact. This is an idea I will revisit in a later chapter.

Summary

In this chapter, I have shown how I use the normal ranges established with the statistical methods described earlier and then apply a method of feature selection that compares each metric to the established norms, “flagging ” those outside of normal to be included in the predictive model. This flagging draws out the metrics most useful for directing tuning efforts, creating what is referred to as a predictive model.

The predictive model developed by the DOPA process thus far has a high degree of accuracy for finding the root cause of a problem. I refer to the DOPA process as dynamic because it is easily adaptable; each time it is run with slight changes in any number of variables (e.g., time interval, different database, altering the metric sources), it will produce a slightly different predictive model. Feature selection used in this way produces a highly accurate model, but I learned from my son that implementing a taxonomy could augment the usefulness of this analysis even further. Therefore, I have added this additional analytic component; it will be the topic of the next chapter.

When I first started using the feature selection capability of the DOPA process, I had to prove it to myself that it could be useful. After using it for some time in the actual heat of battle, solving performance problems, I am convinced that the DOPA process adds value in many scenarios, especially in scenarios where traditional methods have failed to solve the problem. In Chapter 8 many actual case studies are detailed. Further, the SQL code provided with this book can be used to immediately start building models and seeing results in your own Oracle databases. But before we get to case studies, I want to cover taxonomies (Chapter 6) and some general guidelines on building the models and reporting (Chapter 7).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 5. Feature Selection

Create new playlist

Sign In

Sign Up

5. Feature Selection

Summary

Table of Contents for
5. Feature Selection