First, we'll take a look at suspicious behavior detection, where the goal is to learn known patterns of frauds, which correspond to modeling known-knowns.
We'll work with a dataset describing insurance transactions publicly available at Oracle Database Online Documentation (2015), as follows:
http://docs.oracle.com/cd/B28359_01/datamine.111/b28129/anomalies.htm
The dataset describes insurance vehicle incident claims for an undisclosed insurance company. It contains 15,430 claims; each claim comprises 33 attributes describing the following components:
A sample of the database shown in the following screenshot depicts the data loaded into Weka:
Now the task is to create a model that will be able to identify suspicious claims in future. The challenging thing about this task is the fact that only 6% of claims are suspicious. If we create a dummy classifier saying no claim is suspicious, it will be accurate in 94% cases. Therefore, in this task, we will use different accuracy measures: precision and recall.
Recall the outcome table from Chapter 1, Applied Machine Learning Quick Start, where there are four possible outcomes denoted as true positive, false positive, false negative, and true negative:
Classified as | |||
---|---|---|---|
Actual |
Fraud |
No fraud | |
Fraud |
TP—true positive |
FN—false negative | |
No fraud |
FP—false positive |
TN—true negative |
Precision and recall are defined as follows:
Pr= 0 and Re = 0 as it never marks any instance as fraud (TP=0). In practice, we want to compare classifiers by both numbers, hence we use F-measure. This is a de-facto measure that calculates a harmonic mean between precision and recall, as follows:Now let's move on to designing a real classifier.
To design a classifier, we can follow the standard supervised learning steps as described in Chapter 1, Applied Machine Learning Quick Start. In this recipe, we will include some additional steps to handle unbalanced dataset and evaluate classifiers based on precision and recall. The plan is as follows:
.csv
formatFirst, let's load the data using the CSVLoader
class, as follows:
String filePath = "/Users/bostjan/Dropbox/ML Java Book/book/datasets/chap07/claims.csv"; CSVLoader loader = new CSVLoader(); loader.setFieldSeparator(","); loader.setSource(new File(filePath)); Instances data = loader.getDataSet();
Next, we need to make sure all the attributes are nominal. During the data import, Weka applies some heuristics to guess the most probable attribute type, that is, numeric, nominal, string, or date. As heuristics cannot always guess the correct type, we can set types manually, as follows:
NumericToNominal toNominal = new NumericToNominal(); toNominal.setInputFormat(data); data = Filter.useFilter(data, toNominal);
Before we continue, we need to specify the attribute that we will try to predict. We can achieve this by calling the setClassIndex(int)
function:
int CLASS_INDEX = 15; data.setClassIndex(CLASS_INDEX);
Next, we need to remove an attribute describing the policy number as it has no predictive value. We simply apply the Remove
filter, as follows:
Remove remove = new Remove(); remove.setInputFormat(data); remove.setOptions(new String[]{"-R", ""+POLICY_INDEX}); data = Filter.useFilter(data, remove);
Now we are ready to start modeling.
The vanilla approach is to directly apply the lesson as demonstrated in Chapter 3, Basic Algorithms – Classification, Regression, Clustering, without any pre-processing and not taking into account dataset specifics. To demonstrate drawbacks of vanilla approach, we will simply build a model with default parameters and apply k-fold cross validation.
First, let's define some classifiers that we want to test:
ArrayList<Classifier>models = new ArrayList<Classifier>(); models.add(new J48()); models.add(new RandomForest()); models.add(new NaiveBayes()); models.add(new AdaBoostM1()); models.add(new Logistic());
Next, we create an Evaluation
object and perform k-fold cross validation by calling the crossValidate(Classifier, Instances, int, Random, String[])
method, outputting precision
, recall
, and fMeasure
:
int FOLDS = 3; Evaluation eval = new Evaluation(data); for(Classifier model : models){ eval.crossValidateModel(model, data, FOLDS, new Random(1), new String[] {}); System.out.println(model.getClass().getName() + " "+ " Recall: "+eval.recall(FRAUD) + " "+ " Precision: "+eval.precision(FRAUD) + " "+ " F-measure: "+eval.fMeasure(FRAUD)); }
The evaluation outputs the following scores:
weka.classifiers.trees.J48 Recall: 0.03358613217768147 Precision: 0.9117647058823529 F-measure: 0.06478578892371996 ... weka.classifiers.functions.Logistic Recall: 0.037486457204767065 Precision: 0.2521865889212828 F-measure: 0.06527070364082249
We can see the results are not very promising. Recall, that is, the share of discovered frauds among all frauds is only 1-3%, meaning that only 1-3/100 frauds are detected. On the other hand, precision, that is, the accuracy of alarms is 91%, meaning that in 9/10 cases, when a claim is marked as fraud, the model is correct.
As the number of negative examples, that is, frauds, is very small, compared to positive examples, the learning algorithms struggle with induction. We can help them by giving them a dataset, where the share of positive and negative examples is comparable. This can be achieved with dataset rebalancing.
Weka has a built-in filter, Resample, which produces a random subsample of a dataset using either sampling with replacement or without replacement. The filter can also bias distribution towards a uniform class distribution.
We will proceed by manually implementing k-fold cross validation. First, we will split the dataset into k equal folds. Fold k will be used for testing, while the other folds will be used for learning. To split dataset into folds, we'll use the StratifiedRemoveFolds
filter, which maintains the class distribution within the folds, as follows:
StratifiedRemoveFolds kFold = new StratifiedRemoveFolds(); kFold.setInputFormat(data); double measures[][] = new double[models.size()][3]; for(int k = 1; k <= FOLDS; k++){ // Split data to test and train folds kFold.setOptions(new String[]{ "-N", ""+FOLDS, "-F", ""+k, "-S", "1"}); Instances test = Filter.useFilter(data, kFold); kFold.setOptions(new String[]{ "-N", ""+FOLDS, "-F", ""+k, "-S", "1", "-V"}); // select inverse "-V" Instances train = Filter.useFilter(data, kFold);
Next, we can rebalance train dataset, where the–Z
parameter specifies the percentage of dataset to be resampled, and –B
bias the class distribution towards uniform distribution:
Resample resample = new Resample(); resample.setInputFormat(data); resample.setOptions(new String[]{"-Z", "100", "-B", "1"}); //with replacement Instances balancedTrain = Filter.useFilter(train, resample);
Next, we can build classifiers and perform evaluation:
for(ListIterator<Classifier>it = models.listIterator(); it.hasNext();){ Classifier model = it.next(); model.buildClassifier(balancedTrain); eval = new Evaluation(balancedTrain); eval.evaluateModel(model, test); // save results for average measures[it.previousIndex()][0] += eval.recall(FRAUD); measures[it.previousIndex()][1] += eval.precision(FRAUD); measures[it.previousIndex()][2] += eval.fMeasure(FRAUD); }
Finally, we calculate the average and output the best model:
// calculate average for(int i = 0; i < models.size(); i++){ measures[i][0] /= 1.0 * FOLDS; measures[i][1] /= 1.0 * FOLDS; measures[i][2] /= 1.0 * FOLDS; } // output results and select best model Classifier bestModel = null; double bestScore = -1; for(ListIterator<Classifier> it = models.listIterator(); it.hasNext();){ Classifier model = it.next(); double fMeasure = measures[it.previousIndex()][2]; System.out.println( model.getClass().getName() + " "+ " Recall: "+measures[it.previousIndex()][0] + " "+ " Precision: "+measures[it.previousIndex()][1] + " "+ " F-measure: "+fMeasure); if(fMeasure > bestScore){ bestScore = fMeasure; bestModel = model; } } System.out.println("Best model:"+bestModel.getClass().getName());
Now the performance of the models has significantly improved, as follows:
weka.classifiers.trees.J48 Recall: 0.44204845100610574 Precision: 0.14570766048577555 F-measure: 0.21912423640160392 ... weka.classifiers.functions.Logistic Recall: 0.7670657247204478 Precision: 0.13507459756495374 F-measure: 0.22969038530557626 Best model: weka.classifiers.functions.Logistic
What we can see is that all the models have scored significantly better; for instance, the best model, Logistic Regression, correctly discovers 76% of frauds, while producing a reasonable amount of false alarms—only 13% of claims marked as fraud are indeed fraudulent. If an undetected fraud is significantly more expensive than investigation of false alarms, then it makes sense to deal with an increased number of false alarms.
The overall performance has most likely still some room for improvement; we could perform attribute selection and feature generation and apply more complex model learning that we discussed in Chapter 3, Basic Algorithms – Classification, Regression, Clustering.