Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Getting the data

The first step in our data analysis pipeline is to get the dataset. We have actually cleaned the data and provided meaningful names to the data attributes and you can check that out by opening the german_credit_dataset.csv file. You can also get the actual dataset from the source which is from the Department of Statistics, University of Munich through the following URL: http://www.statistik.lmu.de/service/datenarchiv/kredit/kredit_e.html.

You can download the data and then run the following commands by firing up R in the same directory with the data file, to get a feel of the data we will be dealing with in the following sections:

> # load in the data and attach the data frame
> credit.df <- read.csv("german_credit_dataset.csv", header = TRUE, sep = ",") 
> # class should be data.frame
> class(credit.df)
[1] "data.frame"
> 
> # get a quick peek at the data
> head(credit.df)

The following figure shows the first six rows of the data. Each column indicates an attribute of the dataset. We will be focusing on each attribute in more detail later.

To get detailed information about the dataset and its attributes, you can use the following code snippet:

> # get dataset detailed info
> str(credit.df)

The preceding code will enable you to get a quick look at the total number of data points you are dealing with, which includes the number of records, the number of attributes, and the detailed information about each attribute including things such as the attribute name, type, and some samples of attribute values, as you can see in the following screenshot. Using this, we can get a good idea about the different attributes and their data types so that we know what transformations to apply on them and what statistical methods to use during descriptive analytics.

From the preceding output, you can see that our dataset has a total of 1000 records, where each record deals with data points pertaining to one bank customer. Each record has various data points or attributes describing the data and we have a total of 21 attributes for each record. The data type and sample values for each attribute are also shown in the previous image.

Note

Do note that by default R has assigned the int datatype to the variables based on their values but we will be changing some of that in our data preprocessing phase based on their actual semantics.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Getting the data

Create new playlist

Sign In

Sign Up

Getting the data

Note

Table of Contents for
Getting the data