Description of the dataset and using linear models

For this project, we will be using the credit card fraud detection dataset from Kaggle. The dataset can be downloaded from https://www.kaggle.com/dalpozz/creditcardfraud. Since I am using the dataset, it would be a good idea to be transparent by citing the following publication:

  • Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson, and Gianluca Bontempi, Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015.

The datasets contain transactions made by credit cards by European cardholders in September 2013 over the span of only two days. There is a total of 285,299 transactions, with only 492 frauds out of 284,807 transactions, meaning the dataset is highly imbalanced and the positive class (fraud) accounts for 0.172% of all transactions.

It contains only numerical input variables, which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. There are 28 features, namely V1, V2, ..., V28, that are principal components obtained with PCA, except for the Time and Amount. The feature Class is the response variable, and it takes value 1 in the case of fraud and 0 otherwise. We will see details later on.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset