In the second example, we'll focus on modeling the opposite of the previous example. Instead of discussing what typical fraud-less cases are, we'll discuss the normal expected behavior of the system. If something cannot be matched against our expected model, it will be considered anomalous.
We'll work with a publicly available dataset released by Yahoo Labs that is useful for discussing how to detect anomalies in time series data. For Yahoo, the main use case is in detecting unusual traffic on Yahoo servers.
Even though Yahoo announced that their data is publicly available, you have to apply to use it, and it takes about 24 hours before the approval is granted. The dataset is available here:
http://webscope.sandbox.yahoo.com/catalog.php?datatype=s&did=70
The data set comprises real traffic to Yahoo services, along with some synthetic data. In total, the dataset contains 367 time series, each of which contain between 741 and 1680 observations, recorded at regular intervals. Each series is written in its own file, one observation per line. A series is accompanied by a second column indicator with a one if the observation was an anomaly, and zero otherwise. The anomalies in real data were determined by human judgment, while those in the synthetic data were generated algorithmically. A snippet of the synthetic times series data is shown in the following table:
In the following section, we'll learn how to transform time series data to attribute presentation that allows us to apply machine learning algorithms.
Detecting anomalies in raw, streaming time series data requires some data transformations. The most obvious way is to select a time window and sample time series with fixed length. In the next step, we want to compare a new time series to our previously collected set to detect if something is out of the ordinary.
The comparison can be done with various techniques, as follows:
This list is, by no means, exhaustive. Different approaches are focused on detecting different anomalies (for example, in value, frequency, and distribution). We will focus on a version of distribution-based approaches.
In histogram-based anomaly detection, we split signals by some selected time window as shown in the following image.
For each window, we calculate the histogram, that is, for a selected number of buckets, we count how many values fall into each bucket. The histogram captures basic distribution of values in a selected time window as shown at the center of the diagram.
Histograms can be then directly presented as instances, where each bin corresponds to an attribute. Further, we can reduce the number of attributes by applying a dimensionality-reduction technique such as Principal Component Analysis (PCA), which allows us to visualize the reduced-dimension histograms in a plot as shown at the bottom-right of the diagram, where each dot corresponds to a histogram.
In our example, the idea is to observe website traffic for a couple of days and then to create histograms, for example, four-hour time windows to build a library of positive behavior. If a new time window histogram cannot be matched against positive library, we can mark it as an anomaly:
For comparing a new histogram to a set of existing histograms, we will use a density-based k-nearest neighbor algorithm, Local Outlier Factor (LOF) (Breunig et al, 2000). The algorithm is able to handle clusters with different densities as shown in the following image. For example, the upper-right cluster is large and widespread as compared to the bottom-left cluster, which is smaller and denser:
Let's get started!
In the first step, we'll need to load the data from text files to a Java object. The files are stored in a folder, each file contains one-time series with values per line. We'll load them into a list of Double
:
String filePath = "chap07/ydata/A1Benchmark/real"; List<List<Double>> rawData = new ArrayList<List<Double>>();
We will need the min
and max
value for histogram normalization, so let's collect them in this data pass:
double max = Double.MIN_VALUE; double min = Double.MAX_VALUE; for(int i = 1; i<= 67; i++){ List<Double> sample = new ArrayList<Double>(); BufferedReader reader = new BufferedReader(new FileReader(filePath+i+".csv")); boolean isAnomaly = false; reader.readLine(); while(reader.ready()){ String line[] = reader.readLine().split(","); double value = Double.parseDouble(line[1]); sample.add(value); max = Math.max(max, value); min = Double.min(min, value); if(line[2] == "1") isAnomaly = true; } System.out.println(isAnomaly); reader.close(); rawData.add(sample); }
We will create a histogram for a selected time window with the WIN_SIZE
width. The histogram will hold the HIST_BINS
value buckets. The histograms consisting of list of doubles will be stored into an array list:
int WIN_SIZE = 500; int HIST_BINS = 20; int current = 0; List<double[]> dataHist = new ArrayList<double[]>(); for(List<Double> sample : rawData){ double[] histogram = new double[HIST_BINS]; for(double value : sample){ int bin = toBin(normalize(value, min, max), HIST_BINS); histogram[bin]++; current++; if(current == WIN_SIZE){ current = 0; dataHist.add(histogram); histogram = new double[HIST_BINS]; } } dataHist.add(histogram); }
Histograms are now completed. The last step is to transform them into Weka's Instance
objects. Each histogram value will correspond to one Weka attribute, as follows:
ArrayList<Attribute> attributes = new ArrayList<Attribute>(); for(int i = 0; i<HIST_BINS; i++){ attributes.add(new Attribute("Hist-"+i)); } Instances dataset = new Instances("My dataset", attributes, dataHist.size()); for(double[] histogram: dataHist){ dataset.add(new Instance(1.0, histogram)); }
The dataset is now loaded and ready to be plugged into an anomaly-detection algorithm.
To demonstrate how LOF calculates scores, we'll first split the dataset into training and testing set using the testCV(int, int)
function. The first parameter specifies the number of folds, while the second parameter specifies which fold to return.
// split data to train and test Instances trainData = dataset.testCV(2, 0); Instances testData = dataset.testCV(2, 1);
The LOF algorithm is not a part of the default Weka distribution, but it can be downloaded through Weka's package manager:
http://weka.sourceforge.net/packageMetaData/localOutlierFactor/index.html
LOF algorithm has two implemented interfaces: as an unsupervised filter that calculates LOF values (known-unknowns) and as a supervised k-nn classifier (known-knowns). In our case, we want to calculate the outlier-ness factor, therefore, we'll use the unsupervised filter interface:
import weka.filters.unsupervised.attribute.LOF;
The filter is initialized the same way as a usual filter. We can specify the k
number of neighbors, for example, k=3
, with –min
and –max
parameters. LOF allows us to specify two different k parameters, which are used internally as the upper and lower bound to find the minimal/maximal number lof
values:
LOF lof = new LOF(); lof.setInputFormat(trainData); lof.setOptions(new String[]{"-min", "3", "-max", "3"});
Next, we load training instances into the filter that will serve as a positive example library. After we complete the loading, we call the batchFinished()
method to initialize internal calculations:
for(Instance inst : trainData){ lof.input(inst); } lof.batchFinished();
Finally, we can apply the filter to test data. Filter will process the instances and append an additional attribute at the end containing the LOF score. We can simply output the score on the console:
Instances testDataLofScore = Filter.useFilter(testData, lof); for(Instance inst : testDataLofScore){ System.out.println(inst.value(inst.numAttributes()-1)); }
The LOF score of the first couple of test instances is as follows:
1.306740014927325 1.318239332210458 1.0294812291949587 1.1715039094530768
To understand the LOF values, we need some background on the LOF algorithm. It compares the density of an instance to the density of its nearest neighbors. The two scores are divided, producing the LOF score. The LOF score around 1 indicates that the density is approximately equal, while higher LOF values indicate that the density of the instance is substantially lower than the density of its neighbors. In such cases, the instance can be marked as anomalous.