As we have seen in all the chapters on the modelling techniques, a predictive model is nothing but a set of mathematical equations derived using a few lines of codes. In essence, this code together with a slide-deck highlighting the high-level results from the model constitute a project. However, the user of our solution is more interested in finding a solution for the problem he is facing in the business context. It is the responsibility of the analyst or the data scientist to offer the solution in a way that is user-friendly and maximizes output or insights.
There are some general guidelines that can be followed for the optimum results in a predictive modelling project. As predictive modelling comprises a mix of computer science techniques, algorithms, statistics, and business context capabilities, the best practices in the predictive modelling are a total of the best practices in the aforementioned individual fields.
In this chapter, we will be learning about the best practices adopted in the field of predictive modelling to get the optimum results. The major headings under which all the best practices in the field of predictive analytics/modelling can be clubbed are as follows:
The following are some of the best practices or conventional wisdom amassed over the decade-long existence of predictive modelling.
When one uses Python for predictive modelling, one needs to write small snippets of code. To ensure that one gets the maximum out of their code snippets and that the work is reproducible, one should be aware of and aspire to follow the best practices in coding. Some of the best practices for coding are as follows.
There is a tradeoff between the elegance and understandability of a code snippet. As a code snippet becomes more elegant, its understandability by a new user (other than the author of the snippet) decreases. Some of the users are interested only in the end results, but most of the users like to understand what is going on behind the hood and want to have a good understanding of the code.
For the code snippet to be understandable by a new person or the user of the code, it is a common practice to comment on the important lines, if not all the lines, and write the headings for the major chunks of the code. Some of the properties of a comment are as follows:
#
in front of the line.Some of the reasons to prefer a commented code are as follows:
The following is an example of a well-commented code snippet clearly stating the objective of the code in the header and the purpose of each line in the comments. This code has already been used in the Chapter 3, Data Wrangling. You can revisit this for more context and then try to understand the code with comments. Most likely, it would be easier to understand with the comments:
# appending 333 similar datasets to form a bigger dataset import pandas as pd # importing pandas library filepath='E:/Personal/Learning/Predictive Modeling Book/Book Datasets/Merge and Join/lotofdata' # defining filepath variable as the folder # which has all the small datasets data_final=pd.read_csv('E:/Personal/Learning/Predictive Modeling Book/Book Datasets/Merge and Join/lotofdata/001.csv') # initialising the # data-final data frame with the first dataset of the lot data_final_size=len(data_final) # initializing the data_final_size variable which counts the number of rows in the data_final data frame for i in range(1,333): # defining a loop over all the 333 files if i<10: filename='0'+'0'+str(i)+'.csv' # the files are named as 001.csv, 101.csv etc. Accordingly, 3 conditions arise for the filename # variable. i<10 requires appending 2 zeros at the beginning. if 10<=i<100: filename='0'+str(i)+'.csv' # i<100 requires appending 1 zeros at the beginning. if i>=100: filename=str(i)+'.csv' # i>=100 requires appending no zeros at the beginning. file=filepath+'/'+filename # defining the file variable by appending filepath and filename variable. file variable contains a new # file in every iteration data=pd.read_csv(file) # file is read as data frame called data data_final_size+=len(data) # data_final_size variable is updated by adding the length of the currently read file data_final=pd.concat([data_final,data],axis=0) # concatanating/appending data to the data_final data frame on the axis=0 i.e. on rows print data_final_size # printing the final_data_size variable containing the number of rows in the final # data frame
The same code looks as follows in the IPython Notebook:
Any code implements a set of tasks. Many of these talks are an important part of the overall task at hand, but can be segregated from the main code. These tasks can be defined separately as functions parameterizing all the possible inputs required for the particular task. These functions can then be used with the particular inputs, and the output can be used in the implementation of the main task.
The functions are useful because of a variety of reasons, as follows:
apply
method.Let us now see a few examples of a function defined to implement small tasks.
This function takes a positive integer greater than 2
as an input and creates a Fibonacci sequence with as many members in the sequence as the positive integer:
def fib(n): a,b=1,2 count=3 fib_list=[1,2] while(count<=n): a,b=b,a+b count=count+1 fib_list.append(b) return fib_list
This function calculates the distance of a particular place (defined by latitude and longitude) from a list of possible stations and finds the station that is the closest to the given place. This function can take an array consisting of latitude and longitude and calculate the distances from possible stations for each location. It can also be applied to a column of data consisting of latitude and longitude in each row to find the closest station for each location (defined by latitude and longitude):
def closest_station(lat, longi,stations): loc = np.array([lat, longi]) deltas = stations - loc[None, :] dist2 = (deltas**2).sum(1) return np.argmin(dist2)
An example of the list of stations can be a list of two possible stations containing the latitude and longitudes for both the stations as follows:
stations = np.array([[41.995, -87.933], [41.786, -87.752]])
A function can work without any input as well. These functions will perform some task, but will not necessarily return an output. They perform some sort of transformation or manipulation. Functions can also be defined to implement several repetitive tasks at once; for example, applying the same conversion to all the columns of a dataset.
One example of defining such a function is to define a function to convert several columns of a dataset at once to a desired data type. This is a very common and widely used data preparation step used in almost all the predictive modelling projects because the data type of some of the columns need to be changed to facilitate a particular operation or calculation in the business context.
In the following example, a hypothetical dataset called datafile
is opened as a dictionary so that it can be read line by line and sub settled easily over columns. The dataset has columns Date
, Latitude
, Longitude
, NumA1
, and NumA2
that need to be converted to date
, float
, float
, int
, and int
data types, respectively. A dictionary consisting of the column names and the required data type of the column name is defined. Each column is then converted to the required data type and the resultant line is appended to the final dataset called data:
def load_data(): data = [] for line in csv.DictReader(open("..../datafile.csv")): for name, converter in {"Date" : date, "Latitude" : float, "Longitude" : float, "NumA1" : int, "NumA2" : int}.items(): line[name] = converter(line[name]) data.append(line) return data
As one can see, these small tasks are significant on their own so that they are identified as a separate subtask, but in the larger picture, they are a part of the larger task and can be used later for further analyses.
One of the most essential guidelines to follow while writing a legible and an easy-to-debug code is to create variables and avoid hard-coding as much as possible.
Some of the benefits of avoiding hard-coding can be listed as follows:
The following code is one example of defining a variable and avoiding hard-coding. Here, we are defining a variable for a particular directory path and a couple of files. We can use these variables for subsequent usage such as reading one of the files. If one of the inputs, such as the directory path, needs to be changed, it can be done by making a change only at one place, that is, where it is defined:
import pandas as pd import os filepath='E:PersonalLearningPredictive Modeling BookBook DatasetsInteresting Datasets' filename1='chopstick-effectiveness.csv' filename2='pigeon-racing.csv' df1=pd.read_csv(os.path.join(filepath,filename1)) df2=pd.read_csv(os.path.join(filepath,filename2)) df1.head() df2.head()
Some parameters, which need to be changed regularly, should always be defined as a variable. Some codes need to be run repeatedly with a change only in one of the variables. In such cases also, defining a variable comes in very handy.
While developing the code, the changes and improvements are suggested phase-wise and not all at once. It is not possible to write in one sitting, a perfectly working code with no scope for improvement. However, the intermediate code might be used for demonstrative (the proof of concept or POC; before a project starts officially, evidence is needed to prove that the concept can be put into reality) and testing purposes. Hence, there is a need to follow a version control.
It essentially means saving a copy of the old code, making a copy of it, renaming it, and making changes to the new copy. The new copy is the new version of the code. This new copy can be released as the latest production version once it is tested after making the changes and has started running without error. Until then, the latest but one version of the code should be used as the production version. Version control can be done manually or by using version controlling tools such as GitHub and so on.
As far as possible, try to use a function or method, if it already exists, to perform a particular task in the production version of the code. For better understanding of how the method works, one can deconstruct the method and try building it up from scratch (as we have done for the logistic regression algorithm in this book) on their own, but this should be a part of the exploratory work. In the production version, already existing methods should be used.
For example, to calculate correlation, one should use the already existing formula and not reinvent the wheel from scratch. Another example is the groupby
functionality in pandas
to split the dataset into groups based on the different categories of a categorical variable. This saves time and also increases the elegance of the code snippet. There are an ample number of libraries to choose from to perform tasks in Python. One should choose a library that performs well and is stable over a range of IDEs, interpreters, and OSs.