Automated detection of example outliers

Four different outlier detection techniques are described in the following sections. They are as follows:

  • Detect Outlier (Distances)
  • Detect Outlier (Densities)
  • Detect Outlier (LOF)
  • Detect Outlier (COF)

None of these algorithms will automatically find the correct outliers for the data being explored. Given their parameters, they will flag up candidate outlier points to allow a person to get involved and make the final determination. This is an important point that needs to be built into any data exploration process.

Detect Outlier (Distances)

The simplest operator is Detect Outlier (Distances). Each example is considered in turn and the distance to the kth nearest neighbor is determined (k is provided as a parameter). Note that distance in this context means the Euclidean distance or one of a number of possibilities. The top n (n is provided as a parameter) examples that are at the largest distance from their kth nearest neighbor are marked as outliers.

The algorithm is conceptually simple and requires no underlying knowledge of the distribution of the data. It requires a value for k to be chosen and will always choose an outlier even if the data does not actually have any.

An empirical approach must be adopted for choosing k. Choosing 1 as a value for k will use the nearest point and not take into account the presence of any other points nearby that could indicate membership of a cluster. This tends to mark points at the edge of less dense clusters as outliers. Choosing a value of k that is too large will start to include legitimate points as outliers within denser clusters and miss the genuine outliers. This is illustrated in the following three screenshots where the images show some artificial data arranged in clusters with a random noise overlaid. This artificial data has 1,000 examples with two attributes, and is generated with the standard Generate Data operator using the target function Gaussian mixture clusters. In addition, 20 random points were added with values in the 0 to 1 range. The graphics also show the result of outlier detection with the shape of the point indicating if the outlier correctly matched the known outlier. Refer to outlierDistancePlotter.xml for the process that can recreate these.

The first of these shown in the next screenshot highlights 20 outliers detected with k set to 1.

Detect Outlier (Distances)

In the previous screenshot, triangles represent valid points being set as outliers, diamonds represent outliers being classed as valid points, and squares represent correctly identified outliers. The small points are the valid points correctly identified as valid.

The screenshot shows four points (the triangles) incorrectly set as outliers when in fact they are legitimately a part of a cluster. In addition, four points (the diamonds) are counted as valid points.

Changing k to a large value changes how points are classified. For example, the following screenshot shows what happens when k is set to 137.

Detect Outlier (Distances)

As explained before, the shapes of the points indicate what the outlier detection has produced. This time a large number of points in a cluster are being marked as outliers when they should not be. In addition, more genuine outliers are being missed.

Intuitively, there is a point somewhere between these two extremes that has the ideal value of k, and only by empirical observation in the context of your data will it be found. The following screenshot shows the case with k set to 2.

Detect Outlier (Distances)

Now the number of incorrectly identified outliers and misclassified real points has reduced to two each. In real life, of course, you do not have the luxury of knowing which points are in fact outliers. Unfortunately, it is a case of trial and error to understand why some points are being shown as outliers.

In passing, these graphics were produced using the advanced plotting capabilities of RapidMiner Studio. The process uses the Generate Attributes operator on the result of the outlier detection to create new attributes based on whether the outlier was accurate or not. These new attributes were used to set the shape and colors of individual points. The comments in the process file give instructions on how to recreate these diagrams.

A final point is that given an outlier is always found, it may be advisable to use outlier detection during initial exploration. However, once in production, a different approach should be used, and it is certainly not advised to use automated detection and implied deletion.

Detect Outlier (Densities)

The Detect Outlier (Densities) operator is also relatively simple to use. The operator requires two parameters, a distance and a proportion, which makes it slightly harder to set up. Unlike the distances operator, it does not necessarily find outliers and so may be more suited for use in a production environment. For a pair of distance and proportion parameters, the operator marks a point as an outlier if there are more than the proportion points further than the distance away from it. Looked at another way, a multidimensional sphere is drawn around each point corresponding to the distance parameter, and the number of other points that are within the sphere is compared to the total outside points. If there is more than the proportion outside the sphere, the point is an outlier. This corresponds to a density measurement; points in less dense neighborhoods are outliers.

Generally the proportion parameter should be set to a number approaching 1 (note that the parameter is in the 0 to 1 range, representing 0 percent to 100 percent); the closer the parameter is to 1, the fewer the number of outliers. The distance parameter must be set somewhere between the maximum and minimum distance in the data; the larger this value, the fewer the number of outliers. It always requires an empirical investigation to determine the optimum values, and the determination will usually include an iterative investigation to determine why there are outliers and what rule is needed to deal with them.

The main weakness of the density approach is that the algorithm cannot cope with regions of different densities. This leads us to the Local Outliers Factor (LOF) operator.

Detect Outlier (LOF)

The Detect Outlier (LOF) operator proposed by Breunig et al is able to cater for different densities and therefore is able to find outliers from regions of different densities within a dataset. The algorithm takes two simple parameters corresponding to a minimum and maximum number of k nearest neighbors. A local density is estimated from these neighbors and this allows regions of lower density to be determined. Points in low density regions are outliers. The obvious question is what values should you choose? As usual it depends, and empirical investigation is the only way. Having said that, the values of k do not seem to have too much impact on whether an example is indicated as an outlier. However, there is a drawback; the operator does not give a Boolean flag to indicate whether an example is an outlier and instead a numerical value is returned. The higher this is, the more likely it is that the example could be an outlier. This means that a relative determination must be made to decide whether an example is really an outlier. For example, by using the data from the previous section that discussed the Detect Outliers (Distances) operator and performing an LOF outlier determination for multiple different values of the upper and lower bound for k and then aggregating the results, the following graph is produced. The process to do this, called bestLOFFinal.xml, is available with the files that accompany this book. The instructions for configuring the advanced plotter to display this graph are included with this process.

Detect Outlier (LOF)

The graph in the previous screenshot differs from the previous screenshots because att 1 is not plotted. Instead, the graph shows the average(outlier) factor produced by the outlier detection (average(outlier) on the y axis) as the upper and lower k values are changed. Recall that the original data contains 1,020 data points and what is shown in the previous screenshot is the average outlier factor, for multiple combinations of the lower and upper bounds. The upper and lower parameters are not shown but the result on the average(outlier) factor can be seen. Using att2 gives the opportunity to relate the graph to the real data as shown previously. As can be seen, outliers, shown as larger points, are consistently above the main clusters of points. It is also interesting to observe the minimum values where the clusters are seem to form concave shapes. Local minima presumably correspond to the centers of the clusters. By empirical observation, it is possible to select a threshold above which a point is an outlier. In this case, a point with an outlier factor above 4 is a candidate to be considered as an outlier, although that needs to be understood in the context of the data.

The complexity of the operator can adversely affect performance for large datasets and this is an additional point to bear in mind.

Detect Outliers (COF)

If the data has class labels, it makes intuitive sense that an outlier for one class would have a different characteristic when compared with another class. The Detect Outliers (COF) operator can be used in this case. This operator is complex in its operation and the interested reader can research further by referring to A Comparative Study of Outlier Mining and Class Outlier Mining by Motaz K.Saad and Nabil M.Hewahi.

The algorithm has straightforward parameters. The two important ones are the number of neighbors and the number of class outliers. The number of neighbors parameter must be chosen empirically, although experiments indicate that it is not that critical. A very low value will presumably lead to oversensitivity to single points and a very large value may include points that are in different classes.

The number of outliers simply dictates how many outliers will be found. Note that this is independent of the number of classes. For example, if this is set to 10 and there are five classes, the process may determine two outliers from each class or another combination. This means the operator always finds outliers, but, unlike the distance operator, an outlier factor is also returned for all examples that are marked as outliers. This allows a threshold to be used as an additional check for outliers. All non-outliers have this outlier attribute set to infinity. Generally, the lower this value, the more likely the example is an outlier. By performing an empirical investigation similar to the local outlier factor method, it is possible to determine a range for thresholds.

The algorithm is relatively expensive in operation and large datasets may require more processing time, so this must be kept in mind when using it. In addition, if there is no label in the input data, clearly the operator cannot be used. This means that unlabeled test data cannot be checked for outliers and one of the alternatives must be used.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset