The credit card fraud dataset

Generally in a fraud dataset, we have sufficient data for the negative class (non-fraud/genuine transactions) and very few or no data for the positive class (fraudulent transactions). This is termed a class imbalance problem in the ML world. We train an AE on the non-fraud data and learn features using the encoder. The decoder is then used to compute the reconstruction error on the training set to find a threshold. This threshold will be used on the unseen data (test dataset or otherwise). We use the threshold to identify those test instances whose values are greater than the threshold as fraud instances.

For the project in this chapter, we will be using a dataset that is sourced from this URL: https://essentials.togaware.com/data/. This is a public dataset of credit card transactions. This dataset is originally made available through the research paper Calibrating Probability with Undersampling for Unbalanced Classification, A. Dal Pozzolo, O. Caelen, R. A Johnson and G. Bontempi, IEEE Symposium Series on Computational Intelligence (SSCI), Cape Town, South Africa, 2015. The dataset is also available at this URL: http://www.ulb.ac.be/di/map/adalpozz/data/creditcard.Rdata. The dataset was collected and analyzed during a research collaboration of Worldline and the Machine Learning Group (http://mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on big data mining and fraud detection.

The following are the characteristics of the dataset:

  • The paper made the dataset available as an Rdata file. There is a CSV converted version of this dataset available on Kaggle as well as other sites.
  • It contains transactions made by credit cards in September 2013 by European cardholders.
  • The transactions occurred on two days are recorded and is presented as the dataset.
  • There are a total of 284,807 transactions in the dataset.
  • The dataset suffers from a severe class imbalance problem. Only 0.172% of all transactions are fraudulent transactions (492 fraudulent transactions).
  • There are a total thirty features in the dataset, namely V1, V2, ...,V28, Time, and Amount.
  • The variables V1, V2, ...,V28 are the principal components obtained with PCA from the original set of variables.
  • Due to confidentiality, the original set of variables that yielded the principal components are not revealed.
  • The Time feature contains the seconds elapsed between each transaction and the first transaction in the dataset.
  • The Amount feature is the transaction amount.
  • The dependent variable is named Class. The fraudulent transactions are represented as 1 in the class and genuine transactions are represented as 0.

We will now jump into using AEs for the credit card fraud detection.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset