Getting the data

The first step in our data analysis pipeline is to get the dataset. We have actually cleaned the data and provided meaningful names to the data attributes and you can check that out by opening the german_credit_dataset.csv file. You can also get the actual dataset from the source which is from the Department of Statistics, University of Munich through the following URL: http://www.statistik.lmu.de/service/datenarchiv/kredit/kredit_e.html.

You can download the data and then run the following commands by firing up R in the same directory with the data file, to get a feel of the data we will be dealing with in the following sections:

> # load in the data and attach the data frame
> credit.df <- read.csv("german_credit_dataset.csv", header = TRUE, sep = ",") 
> # class should be data.frame
> class(credit.df)
[1] "data.frame"
> 
> # get a quick peek at the data
> head(credit.df)

The following figure shows the first six rows of the data. Each column indicates an attribute of the dataset. We will be focusing on each attribute in more detail later.

Getting the data

To get detailed information about the dataset and its attributes, you can use the following code snippet:

> # get dataset detailed info
> str(credit.df)

The preceding code will enable you to get a quick look at the total number of data points you are dealing with, which includes the number of records, the number of attributes, and the detailed information about each attribute including things such as the attribute name, type, and some samples of attribute values, as you can see in the following screenshot. Using this, we can get a good idea about the different attributes and their data types so that we know what transformations to apply on them and what statistical methods to use during descriptive analytics.

Getting the data

From the preceding output, you can see that our dataset has a total of 1000 records, where each record deals with data points pertaining to one bank customer. Each record has various data points or attributes describing the data and we have a total of 21 attributes for each record. The data type and sample values for each attribute are also shown in the previous image.

Note

Do note that by default R has assigned the int datatype to the variables based on their values but we will be changing some of that in our data preprocessing phase based on their actual semantics.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset