Outlier and anomaly detection

Anomalies are the unusual and unexpected patterns in an observed world. Thus analyzing, identifying, understanding, and predicting anomalies from seen and unseen data is one of the most important task in data mining. Therefore, detecting anomalies allows extracting critical information from data which then can be used for numerous applications.

While anomaly is a generally accepted term, other synonyms, such as outliers, discordant observations, exceptions, aberrations, surprises, peculiarities or contaminants, are often used in different application domains. In particular, anomalies and outliers are often used interchangeably. Anomaly detection finds extensive use in fraud detection for credit cards, insurance or health care, intrusion detection for cyber-security, fault detection in safety critical systems, and military surveillance for enemy activities.

The importance of anomaly detection stems from the fact that for a variety of application domains anomalies in data often translate to significant actionable insights. When we start exploring a highly unbalanced dataset, there are three possible interpretation of your dataset using kurtosis.  Consequently, the following questions need to be answered and known by means of data exploration before applying the feature engineering:

  • What is the percentage of the total data being present or not having null or missing values for all the available fields? Then try to handle those missing values and interpret them well without losing the data semantics.
  • What is the correlation between the fields? What is the correlation of each field with the predicted variable? What values do they take (that is, categorical or on categorical, numerical or alpha-numerical, and so on)?

Then find out if the data distribution is skewed or not. You can identify the skewness by seeing the outliers or long tail (slightly skewed to the right or positively skewed, slightly skewed to the left or negatively skewed, as shown in Figure 1). Now identify if the outliers contribute towards making the prediction or not. More statistically, your data has one of the 3 possible kurtosis as follows:

  • Mesokurtic if the measure of kurtosis is less than but almost equal to 3
  • Leptokurtic if the measure of kurtosis is more than 3
  • Platykurtic if the measure of kurtosis is less than 3
Figure 1: Different kind of skewness in imbalance dataset

Let’s give an example. Suppose you are interested in fitness walking and you walked on a sports ground or countryside in the last four weeks (excluding the weekends). You spent the following time (in minutes to finish a 4 KM walking track):15, 16, 18, 17.16, 16.5, 18.6, 19.0, 20.4, 20.6, 25.15, 27.27, 25.24, 21.05, 21.65, 20.92, 22.61, 23.71, 35, 39, and 50. Compute and interpret the skewness and kurtosis of these values using R would produce a density plot as follows.

The interpretation presented in Figure 2 of the distribution of data (workout times) shows the density plot is skewed to the right so is leptokurtic. So the data points to the right-most position can be thought as the unusual or suspicious for our use case. So we can potentially identify or remove them to make our dataset balanced. However, this is not the purpose of this project but only the identification is.

Figure 2: Histogram of the workout time (right-skewed)

Nevertheless, by removing the long tail, we cannot remove the imbalance completely. There is another workaround called outlier detection and removing those data points would be useful. 

Moreover, we can also look at the box-plots for each individual feature. Where the box plot displays the data distribution based on five-number summaries: minimum, first quartile, median, third quartile, and maximum, as shown in Figure 3, where we can look for outliers beyond three (3) Inter-Quartile Range (IQR):

Figure 3: Outliers beyond three (3) Inter-Quartile Range (IQR)

Therefore, it would be useful to explore if removing the long tail could provide better predictions for supervised or unsupervised learning. But there is no concrete recommendation for this highly unbalanced dataset. In short, the skewness analysis does not help us in this regard. 

Finally, if you observe your model cannot provide you the perfect classification but the mean square error (MSE) can provide some clue on finding the outlier or anomaly. For example, in our case,  even if our projected model cannot classify your dataset into fraud and non-fraud cases but the mean MSE is definitely higher for fraudulent transactions than for regular ones. So even it would sound naïve, still we can identify outlier instances by applying an MSE threshold for what we can consider outliers. For example, we can think of an instance with an MSE > 0.02 to be an anomaly/outlier.

Now question would be how we can do so? Well, through this end-to-end project, we will see that how to use autoencoders and anomaly detection. We will also see how to use autoencoders to pre-train a classification model. Finally, we’ll see how we can measure model performance on unbalanced data. Let's get started with some knowing about autoencoders.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset