I assume that by now you are at ease with importing datasets from various sources and exploring the look and feel of the data. Handling missing values, creating dummy variables and plots are some tasks that an analyst (predictive modeller) does with almost all the datasets to make them model-worthy. So, for an aspiring analyst it will be better to master these tasks, as well.
Next in the line of items to master in order to juggle data like a pro is data wrangling. Put simply, it is just a fancy word for the slicing and dicing of data. If you compare the entire predictive modelling process to a complex operation/surgery to be performed on a patient, then the preliminary analysis with a stethoscope and diagnostic checks on the patient is the data cleaning and exploration process, zeroing down on the ailing area and deciding which body part to operate on is data wrangling, and performing the surgery/operation is the modelling process.
Surgery/operation |
Predictive modelling |
---|---|
Diagnostic checks/asking questions to fill missing pieces of information/discarding trivial information |
Data exploration/Data cleaning |
Zeroing down on specific body part/sourcing required pieces like blood, catheter |
Data wrangling |
Operating the area |
Modelling the data |
A surgeon can vouch for the fact that zeroing down on a specific body part is the most critical piece of the puzzle to crack down before one gets to the root of the ailment. The same is the case with data wrangling. The data is not always at one place or in one table, maybe the information you need for your model is scattered across different datasets. What does one do in such cases? One doesn't always need the entire data. Many a times, one needs only a column or a few rows or a combination of a few rows and columns. How to do all this jugglery? This is the crux of this chapter. Apart from this, the chapter tries to provide the reader with all the props needed in their tryst with predictive modelling.
At the end of the chapter, the reader should be comfortable with the following functions:
We will be using a variety of public datasets in this chapter. Another good way of demonstrating these concepts is to use dummy datasets created using random numbers. In fact, random numbers are used heavily for this purpose. We will be using a mix of both public datasets and dummy datasets, created using random numbers.
Let us now kick-start the chapter by learning about subsetting a dataset. As it unfolds, one will realize how ubiquitous and indispensable this is.
As discussed in the introductory section, the task of subsetting a dataset can entail a lot of things. Let us look at them one by one. In order to demonstrate it, let us first import the Customer Churn Model
dataset, which we used in the last chapter:
import pandas as pd data=pd.read_csv('E:/Personal/Learning/Predictive Modeling Book/Book Datasets/Customer Churn Model.txt')
Very frequently, an analyst might come across situations wherein only a handful of columns among a vast number of columns are useful and are required in the model. It then becomes important, to select particular columns. Let us see how to do that.
If one wishes to select the Account Length
variable of the data frame we just imported, one can simply write:
account_length=data['Account Length'] account_length.head()
The square bracket ([ ]
) syntax is used to subset a column of a data frame. One just needs to type the appropriate column name in the square brackets. Selecting one column returns a Series
object (an object similar to a data frame) consisting of the values of the selected column. The output of the preceding snippet is as follows:
The fact that this process returns a series can be confirmed by typing type(account_length)
, this will return something similar to the following output, as a result:
Selecting multiple columns can be accomplished in a similar fashion. One just needs to add an extra square bracket to indicate that it is a list of column names that they are selecting and not just one column.
If one wants to select Account Length
, VMail Message
, and Day Calls
, one can write the code, as follows:
Subdata = data[['Account Length','VMail Message','Day Calls']] subdata.head()
The output of the preceding snippet should be similar to the following screenshot:
Unlike in the case of selecting a single column, selecting multiple columns throws up a data frame, as the result:
type(subdata)
One can also create a list of required columns and pass the list name as the parameter inside the square bracket to subset a data frame. The following code snippet will give the same result, as shown in Fig. 3.3, in the next section:
wanted_columns=['Account Length','VMail Message','Day Calls'] subdata=data[wanted] subdata.head()
In some cases, one might want to delete or remove certain columns from the dataset before they proceed to modelling. The same approach, as taken in the preceding section, can be taken in such cases.
This approach of subsetting columns from data frames works fine when the list of columns is relatively small (3-5 columns). After this, the time consumed in typing column names warrants some more efficient methods to do this. The trick is to manually create a list to complement (a list not containing the elements that are present in the other set) the bigger list and create the bigger list using looping. The complement list of a big table will always be small; hence, we need to make the method a tad bit efficient.
Let us have a look at the following code snippet to observe how to implement this:
wanted=['Account Length','VMail Message','Day Calls'] column_list=data.columns.values.tolist() sublist=[x for x in column_list if x not in wanted] subdata=data[sublist] subdata.head()
The sublist
as expected contains all the column names except the ones listed in the wanted
list, as shown in the following screenshot:
In the third line of the preceding code snippet, a list comprehension has been used. It is a convenient method to run for loops over lists and get lists as output. Many of you, who have experience with Python, will know of this. For others, it is not rocket science; just a better way to run for
loops.
Selecting rows is similar to selecting columns, in the sense that the same square bracket is used, but instead of column names the row number or indices are used. Let us see some examples to know how to select a particular number of rows from a data frame:
50
rows of the data frame, one can just write:data[1:50]
data[25:75]
data[:50]
is similar to data[1:50]
.In the same way, if the upper limit is not mentioned, it is assumed to be the last row of the dataset. To select all the rows except the first 50 rows, we will write data[51:]
.
A variety of permutations and combinations can be performed on these rules to fetch the row that one needs.
Another important way to subset a data frame by rows is conditional or Boolean subsetting. In this method, one filters the rows that satisfy certain conditions. The condition can be either an inequality or a comparison written inside the square bracket. Let us see a few examples of how one can go about implementing them:
Total Mins
to be greater than 500. This can be done as follows:data1=data[data['Total Mins']>500] data1.shape
VA
:data1=data[data['State']=='VA'] data1.shape
AND
(&
) and OR
(|
) operators. To filter all the rows in the state VA
that have Total Mins
greater than 500, we can write:data1=data[(data['Total Mins']>500) & (data['State']=='VA')] data1.shape
AND
operator has a subtractive effect.VA
or have Total Mins
greater than 500, we can write the following code:data1=data[(data['Total Mins']>500) | (data['State']=='VA')] data1.shape
Total Mins
being greater than 500. The OR
operator has an additive affect.This is the most used form of subsetting a dataset. Earlier in this chapter we selected three columns of this dataset and called the sub-setted data frame a subdata
. What if we wish to look at specific rows of that sub-setted data frame? How can we do that? We just need another square bracket adjacent to the one already existing.
Let's say, we need to look at the first 50 rows of that sub-setted data frame. We can write a snippet, as shown:
subdata_first_50=data[['Account Length','VMail Message','Day Calls']][1:50] subdata_first_50
We can use the already created subdata
data frame and subset it for the first 50 rows by typing:
subdata[1:50] or subdata[:50]
Alternatively, one can subset the columns using the list name as explained earlier and then subset for rows.
Another effective (but a little unstable, as its behavior changes based on the version of Python installed) method to select both rows and columns together is the .ix
method. Let's see how to use this method.
Basically, in the .ix
method, we can provide row and column indices (in a lay man's term, row and column numbers) inside the square bracket. The syntax can be summarized, as follows:
ix
Now, let's have a look at a few examples:
data.ix[1:100,1:6]
The output looks similar to the following screenshot:
data.ix[:,1:6]
data.ix[1:100,:]
The row and column numbers/name can be passed off as a list, as well. Let's have a look at how it can be done:
data.ix[1:100,[2,5,7]]
The output looks similar to the following screenshot:
data.ix[[1,2,5],[2,5,7]]
The output looks similar to the following screenshot:
Instead of row and column indices or numbers, we can also write corresponding column names, as shown in the following example:
data.ix[[1,2,5],['Area Code','VMail Plan','Day Mins']]
Many times during the analysis, we are required to create a new column based on some calculation or modification of the existing columns containing a constant value to be used in the modelling. Hence, the knowledge of creating new columns becomes an indispensable tool to learn. Let's see how to do that.
Suppose, in the Customer Churn Model
dataset, we want to calculate the total minutes spent during the day, evening, and night. This requires summing up the 3 columns, which are Day Mins
, Eve Mins
, and Night Mins
. It can be done, as shown in the following snippet:
data['Total Mins']=data['Day Mins']+data['Eve Mins']+data['Night Mins'] data['Total Mins'].head()
The output of the snippet is, as follows: