Chapter 3. Data Wrangling

I assume that by now you are at ease with importing datasets from various sources and exploring the look and feel of the data. Handling missing values, creating dummy variables and plots are some tasks that an analyst (predictive modeller) does with almost all the datasets to make them model-worthy. So, for an aspiring analyst it will be better to master these tasks, as well.

Next in the line of items to master in order to juggle data like a pro is data wrangling. Put simply, it is just a fancy word for the slicing and dicing of data. If you compare the entire predictive modelling process to a complex operation/surgery to be performed on a patient, then the preliminary analysis with a stethoscope and diagnostic checks on the patient is the data cleaning and exploration process, zeroing down on the ailing area and deciding which body part to operate on is data wrangling, and performing the surgery/operation is the modelling process.

Surgery/operation

Predictive modelling

Diagnostic checks/asking questions to fill missing pieces of information/discarding trivial information

Data exploration/Data cleaning

Zeroing down on specific body part/sourcing required pieces like blood, catheter

Data wrangling

Operating the area

Modelling the data

A surgeon can vouch for the fact that zeroing down on a specific body part is the most critical piece of the puzzle to crack down before one gets to the root of the ailment. The same is the case with data wrangling. The data is not always at one place or in one table, maybe the information you need for your model is scattered across different datasets. What does one do in such cases? One doesn't always need the entire data. Many a times, one needs only a column or a few rows or a combination of a few rows and columns. How to do all this jugglery? This is the crux of this chapter. Apart from this, the chapter tries to provide the reader with all the props needed in their tryst with predictive modelling.

At the end of the chapter, the reader should be comfortable with the following functions:

  • Sub-set a dataset: Slicing and dicing data, selecting few rows and columns based on certain conditions that is similar to filtering in Excel
  • Generating random numbers: Generating random numbers is an important tool while performing simulations and creating dummy data frames
  • Aggregating data: A technique that helps to group the data by categories in the categorical variable
  • Sampling data: This is very important before venturing into the actual modelling; dividing a dataset between training and testing data is essential
  • Merging/appending/concatenating datasets: This is the solution of the problem that arises when the data required for the purpose of modelling is scattered over different datasets

We will be using a variety of public datasets in this chapter. Another good way of demonstrating these concepts is to use dummy datasets created using random numbers. In fact, random numbers are used heavily for this purpose. We will be using a mix of both public datasets and dummy datasets, created using random numbers.

Let us now kick-start the chapter by learning about subsetting a dataset. As it unfolds, one will realize how ubiquitous and indispensable this is.

Subsetting a dataset

As discussed in the introductory section, the task of subsetting a dataset can entail a lot of things. Let us look at them one by one. In order to demonstrate it, let us first import the Customer Churn Model dataset, which we used in the last chapter:

import pandas as pd
data=pd.read_csv('E:/Personal/Learning/Predictive Modeling Book/Book Datasets/Customer Churn Model.txt')

Selecting columns

Very frequently, an analyst might come across situations wherein only a handful of columns among a vast number of columns are useful and are required in the model. It then becomes important, to select particular columns. Let us see how to do that.

If one wishes to select the Account Length variable of the data frame we just imported, one can simply write:

account_length=data['Account Length']
account_length.head()

The square bracket ([ ]) syntax is used to subset a column of a data frame. One just needs to type the appropriate column name in the square brackets. Selecting one column returns a Series object (an object similar to a data frame) consisting of the values of the selected column. The output of the preceding snippet is as follows:

Selecting columns

Fig. 3.1: First few entries of the Account Length column

The fact that this process returns a series can be confirmed by typing type(account_length), this will return something similar to the following output, as a result:

Selecting columns

Selecting multiple columns can be accomplished in a similar fashion. One just needs to add an extra square bracket to indicate that it is a list of column names that they are selecting and not just one column.

If one wants to select Account Length, VMail Message, and Day Calls, one can write the code, as follows:

Subdata = data[['Account Length','VMail Message','Day Calls']]
subdata.head()

The output of the preceding snippet should be similar to the following screenshot:

Selecting columns

Fig. 3.2: First few entries of the Account Length and VMail Message columns

Unlike in the case of selecting a single column, selecting multiple columns throws up a data frame, as the result:

type(subdata)
Selecting columns

One can also create a list of required columns and pass the list name as the parameter inside the square bracket to subset a data frame. The following code snippet will give the same result, as shown in Fig. 3.3, in the next section:

wanted_columns=['Account Length','VMail Message','Day Calls']
subdata=data[wanted]
subdata.head()

In some cases, one might want to delete or remove certain columns from the dataset before they proceed to modelling. The same approach, as taken in the preceding section, can be taken in such cases.

This approach of subsetting columns from data frames works fine when the list of columns is relatively small (3-5 columns). After this, the time consumed in typing column names warrants some more efficient methods to do this. The trick is to manually create a list to complement (a list not containing the elements that are present in the other set) the bigger list and create the bigger list using looping. The complement list of a big table will always be small; hence, we need to make the method a tad bit efficient.

Let us have a look at the following code snippet to observe how to implement this:

wanted=['Account Length','VMail Message','Day Calls']
column_list=data.columns.values.tolist()
sublist=[x for x in column_list if x not in wanted]
subdata=data[sublist]
subdata.head()

The sublist as expected contains all the column names except the ones listed in the wanted list, as shown in the following screenshot:

Selecting columns

Fig. 3.3: Column names of the subdata data frame

In the third line of the preceding code snippet, a list comprehension has been used. It is a convenient method to run for loops over lists and get lists as output. Many of you, who have experience with Python, will know of this. For others, it is not rocket science; just a better way to run for loops.

Selecting rows

Selecting rows is similar to selecting columns, in the sense that the same square bracket is used, but instead of column names the row number or indices are used. Let us see some examples to know how to select a particular number of rows from a data frame:

  • If one wants to select the first 50 rows of the data frame, one can just write:
    data[1:50]
  • It is important to note that one needs to pass a range of numbers to subset a data frame over rows. To select 50 rows starting from 25th column, we will write:
    data[25:75]
  • If the lower limit is not mentioned, it denotes that the upper limit is the starting row of the data, which is row 1 in most cases. Thus, data[:50] is similar to data[1:50].

In the same way, if the upper limit is not mentioned, it is assumed to be the last row of the dataset. To select all the rows except the first 50 rows, we will write data[51:].

A variety of permutations and combinations can be performed on these rules to fetch the row that one needs.

Another important way to subset a data frame by rows is conditional or Boolean subsetting. In this method, one filters the rows that satisfy certain conditions. The condition can be either an inequality or a comparison written inside the square bracket. Let us see a few examples of how one can go about implementing them:

  • Suppose, one wants to filter the rows that have clocked Total Mins to be greater than 500. This can be done as follows:
    data1=data[data['Total Mins']>500]
    data1.shape
  • The newly created data frame, after filtering, has 2720 rows compared to 3333 in the unfiltered data frame. Clearly, the balance rows have been filtered by the condition.
  • Let us have a look at another example, where we provide equality as a condition. Let us filter the rows for which the state is VA:
    data1=data[data['State']=='VA']
    data1.shape
  • This data frame contains only 77 rows, while the rest get filtered.
  • One can combine multiple conditions, as well, using AND (&) and OR (|) operators. To filter all the rows in the state VA that have Total Mins greater than 500, we can write:
    data1=data[(data['Total Mins']>500) & (data['State']=='VA')]
    data1.shape
  • This data frame contains only 64 rows; it's lesser than the previous data frame. It also has two conditions, both of which must be satisfied to get filtered. The AND operator has a subtractive effect.
  • To filter all the rows that are either in state VA or have Total Mins greater than 500, we can write the following code:
    data1=data[(data['Total Mins']>500) | (data['State']=='VA')]
    data1.shape
  • This data frame has 2733 rows, which is greater than 2720 rows obtained with just one filter of Total Mins being greater than 500. The OR operator has an additive affect.

Selecting a combination of rows and columns

This is the most used form of subsetting a dataset. Earlier in this chapter we selected three columns of this dataset and called the sub-setted data frame a subdata. What if we wish to look at specific rows of that sub-setted data frame? How can we do that? We just need another square bracket adjacent to the one already existing.

Let's say, we need to look at the first 50 rows of that sub-setted data frame. We can write a snippet, as shown:

subdata_first_50=data[['Account Length','VMail Message','Day Calls']][1:50]
subdata_first_50

We can use the already created subdata data frame and subset it for the first 50 rows by typing:

subdata[1:50] or subdata[:50]

Alternatively, one can subset the columns using the list name as explained earlier and then subset for rows.

Another effective (but a little unstable, as its behavior changes based on the version of Python installed) method to select both rows and columns together is the .ix method. Let's see how to use this method.

Basically, in the .ix method, we can provide row and column indices (in a lay man's term, row and column numbers) inside the square bracket. The syntax can be summarized, as follows:

  • The data frame name is appended with ix
  • Inside the square bracket, specify the row number (range) and column number (range) in that order

Now, let's have a look at a few examples:

  • Selecting the first 100 rows of the first 5 columns:
    data.ix[1:100,1:6]

The output looks similar to the following screenshot:

Selecting a combination of rows and columns

Fig. 3.4: First 100 rows of the first 5 columns

  • Selecting all rows from the first five columns:
    data.ix[:,1:6] 
  • Selecting first 100 rows from all the columns:
    data.ix[1:100,:]

The row and column numbers/name can be passed off as a list, as well. Let's have a look at how it can be done:

  • Selecting the first 100 rows from the 2nd, 5th, and 7th columns:
    data.ix[1:100,[2,5,7]]

The output looks similar to the following screenshot:

Selecting a combination of rows and columns

Fig. 3.5: First 100 rows of the 2nd, 5th and 7th columns

  • Selecting the 1st, 2nd and 5th rows from the 2nd, 5th and 7th columns:
    data.ix[[1,2,5],[2,5,7]]

The output looks similar to the following screenshot:

Selecting a combination of rows and columns

Fig. 3.6: 1st, 2nd and 5th rows of the 2nd, 5th and 7th columns

Instead of row and column indices or numbers, we can also write corresponding column names, as shown in the following example:

data.ix[[1,2,5],['Area Code','VMail Plan','Day Mins']]

Creating new columns

Many times during the analysis, we are required to create a new column based on some calculation or modification of the existing columns containing a constant value to be used in the modelling. Hence, the knowledge of creating new columns becomes an indispensable tool to learn. Let's see how to do that.

Suppose, in the Customer Churn Model dataset, we want to calculate the total minutes spent during the day, evening, and night. This requires summing up the 3 columns, which are Day Mins, Eve Mins, and Night Mins. It can be done, as shown in the following snippet:

data['Total Mins']=data['Day Mins']+data['Eve Mins']+data['Night Mins']
data['Total Mins'].head() 

The output of the snippet is, as follows:

Creating new columns

Fig. 3.7: First few entries of the new Total Mins column

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset