Chapter 9. Best Practices for Predictive Modelling

As we have seen in all the chapters on the modelling techniques, a predictive model is nothing but a set of mathematical equations derived using a few lines of codes. In essence, this code together with a slide-deck highlighting the high-level results from the model constitute a project. However, the user of our solution is more interested in finding a solution for the problem he is facing in the business context. It is the responsibility of the analyst or the data scientist to offer the solution in a way that is user-friendly and maximizes output or insights.

There are some general guidelines that can be followed for the optimum results in a predictive modelling project. As predictive modelling comprises a mix of computer science techniques, algorithms, statistics, and business context capabilities, the best practices in the predictive modelling are a total of the best practices in the aforementioned individual fields.

In this chapter, we will be learning about the best practices adopted in the field of predictive modelling to get the optimum results. The major headings under which all the best practices in the field of predictive analytics/modelling can be clubbed are as follows:

  • Code: This makes the code legible, reproducible, elegant, and parametric.
  • Data handling: This makes sure the data is read correctly without any loss. Also, it makes preliminary guesses about the data.
  • Algorithms: This explains the math underlying the selected algorithm in a lucid manner and illustrates how the selected algorithm fits the best for the problem in the business context. In brief, it answers as to why this is the most suited algorithm for the given problem.
  • Statistics: This applies the statistical tests relevant to the business context and interprets their result; it interprets the statistical output parameters of the algorithms or models and documents their implications in the business context.
  • Business context/communication: This clearly states the key insights unearthed from the analysis, the improvement or change the model has brought and what are its implications in the business context, and the key action items for the business.

The following are some of the best practices or conventional wisdom amassed over the decade-long existence of predictive modelling.

Best practices for coding

When one uses Python for predictive modelling, one needs to write small snippets of code. To ensure that one gets the maximum out of their code snippets and that the work is reproducible, one should be aware of and aspire to follow the best practices in coding. Some of the best practices for coding are as follows.

Commenting the codes

There is a tradeoff between the elegance and understandability of a code snippet. As a code snippet becomes more elegant, its understandability by a new user (other than the author of the snippet) decreases. Some of the users are interested only in the end results, but most of the users like to understand what is going on behind the hood and want to have a good understanding of the code.

For the code snippet to be understandable by a new person or the user of the code, it is a common practice to comment on the important lines, if not all the lines, and write the headings for the major chunks of the code. Some of the properties of a comment are as follows:

  • The code should be succinct, brief, and preferably a one-liner.
  • The comment should be part of the code, but shouldn't be executable unlike the other parts of the code. In Python, a line can be commented by appending a hash # in front of the line.

Some of the reasons to prefer a commented code are as follows:

  • Commenting can also be used for testing the code and trying small modifications in a considerably large code snippet.
  • Transferring the understanding of the code to a new person is an integral part of the process of knowledge transfers in the project management.

The following is an example of a well-commented code snippet clearly stating the objective of the code in the header and the purpose of each line in the comments. This code has already been used in the Chapter 3, Data Wrangling. You can revisit this for more context and then try to understand the code with comments. Most likely, it would be easier to understand with the comments:

# appending 333 similar datasets to form a bigger dataset

import pandas as pd          # importing pandas library
filepath='E:/Personal/Learning/Predictive Modeling Book/Book Datasets/Merge and Join/lotofdata' # defining filepath variable as the folder
# which has all the small datasets
data_final=pd.read_csv('E:/Personal/Learning/Predictive Modeling Book/Book Datasets/Merge and Join/lotofdata/001.csv') # initialising the
# data-final data frame with the first dataset of the lot
data_final_size=len(data_final)    # initializing the data_final_size variable which counts the number of rows in the data_final data frame
for i in range(1,333):             # defining a loop over all the 333 files
    if i<10:
        filename='0'+'0'+str(i)+'.csv' # the files are named as 001.csv, 101.csv etc. Accordingly, 3 conditions arise for the filename
        # variable. i<10 requires appending 2 zeros at the beginning.
    if 10<=i<100:
        filename='0'+str(i)+'.csv' # i<100 requires appending 1 zeros at the beginning.
    if i>=100:
        filename=str(i)+'.csv'     # i>=100 requires appending no zeros at the beginning.

    file=filepath+'/'+filename     # defining the file variable by appending filepath and filename variable. file variable contains a new
    # file in every iteration
    data=pd.read_csv(file)         # file is read as data frame called data
    data_final_size+=len(data)     # data_final_size variable is updated by adding the length of the currently read file


    data_final=pd.concat([data_final,data],axis=0)  # concatanating/appending data to the data_final data frame on the axis=0 i.e. on rows
print data_final_size                               # printing the final_data_size variable containing the number of rows in the final
# data frame

The same code looks as follows in the IPython Notebook:

Commenting the codes

Fig. 9.1: An example of a well-commented code

Defining functions for substantial individual tasks

Any code implements a set of tasks. Many of these talks are an important part of the overall task at hand, but can be segregated from the main code. These tasks can be defined separately as functions parameterizing all the possible inputs required for the particular task. These functions can then be used with the particular inputs, and the output can be used in the implementation of the main task.

The functions are useful because of a variety of reasons, as follows:

  • Functions are also useful when the same task needs to be performed a large number of times with just a minor change in the inputs.
  • Defining separate functions makes the code legible and easier to understand and follow. In the absence of a function, the code becomes cluttered and difficult to follow.
  • If the task performed by the function is a calculation, transformation, or aggregation, then it can be easily applied across columns using the apply method.
  • Debugging also becomes difficult if they are not present. These functions can be tested on their own. If they work fine, then we know that the error is somewhere else.

Let us now see a few examples of a function defined to implement small tasks.

Example 1

This function takes a positive integer greater than 2 as an input and creates a Fibonacci sequence with as many members in the sequence as the positive integer:

  def fib(n):
  a,b=1,2
  count=3
  fib_list=[1,2]
  while(count<=n):
    a,b=b,a+b
    count=count+1
    fib_list.append(b)
  return fib_list

Example 2

This function calculates the distance of a particular place (defined by latitude and longitude) from a list of possible stations and finds the station that is the closest to the given place. This function can take an array consisting of latitude and longitude and calculate the distances from possible stations for each location. It can also be applied to a column of data consisting of latitude and longitude in each row to find the closest station for each location (defined by latitude and longitude):

def closest_station(lat, longi,stations):
    loc = np.array([lat, longi])
    deltas = stations - loc[None, :]
    dist2 = (deltas**2).sum(1)
    return np.argmin(dist2)

An example of the list of stations can be a list of two possible stations containing the latitude and longitudes for both the stations as follows:

    stations = np.array([[41.995, -87.933],
                         [41.786, -87.752]])

Example 3

A function can work without any input as well. These functions will perform some task, but will not necessarily return an output. They perform some sort of transformation or manipulation. Functions can also be defined to implement several repetitive tasks at once; for example, applying the same conversion to all the columns of a dataset.

One example of defining such a function is to define a function to convert several columns of a dataset at once to a desired data type. This is a very common and widely used data preparation step used in almost all the predictive modelling projects because the data type of some of the columns need to be changed to facilitate a particular operation or calculation in the business context.

In the following example, a hypothetical dataset called datafile is opened as a dictionary so that it can be read line by line and sub settled easily over columns. The dataset has columns Date, Latitude, Longitude, NumA1, and NumA2 that need to be converted to date, float, float, int, and int data types, respectively. A dictionary consisting of the column names and the required data type of the column name is defined. Each column is then converted to the required data type and the resultant line is appended to the final dataset called data:

def load_data():
    data = []
    for line in csv.DictReader(open("..../datafile.csv")):
        for name, converter in {"Date" : date,
                                "Latitude" : float, "Longitude" : float,
                                "NumA1" : int, "NumA2" : int}.items():
            line[name] = converter(line[name])
        data.append(line)
    return data

As one can see, these small tasks are significant on their own so that they are identified as a separate subtask, but in the larger picture, they are a part of the larger task and can be used later for further analyses.

Avoid hard-coding of variables as much as possible

One of the most essential guidelines to follow while writing a legible and an easy-to-debug code is to create variables and avoid hard-coding as much as possible.

Some of the benefits of avoiding hard-coding can be listed as follows:

  • Hard-coding makes it difficult to spot the error and debug the code. If a variable is created, one needs to check for the error at just one place, that is, the place where the variable has been defined. If not, one has to go through the entire code, spot the places where the hard-coding has been used and check for error at all those places.
  • Also, once defined, a variable can be used at multiple places in the code. Making a change in the script also becomes easier as the variable-related change needs to be done only once if the variable is defined.

The following code is one example of defining a variable and avoiding hard-coding. Here, we are defining a variable for a particular directory path and a couple of files. We can use these variables for subsequent usage such as reading one of the files. If one of the inputs, such as the directory path, needs to be changed, it can be done by making a change only at one place, that is, where it is defined:

import pandas as pd
import os
filepath='E:PersonalLearningPredictive Modeling BookBook DatasetsInteresting Datasets'
filename1='chopstick-effectiveness.csv'
filename2='pigeon-racing.csv'

df1=pd.read_csv(os.path.join(filepath,filename1))
df2=pd.read_csv(os.path.join(filepath,filename2))

df1.head()
df2.head()

Some parameters, which need to be changed regularly, should always be defined as a variable. Some codes need to be run repeatedly with a change only in one of the variables. In such cases also, defining a variable comes in very handy.

Version control

While developing the code, the changes and improvements are suggested phase-wise and not all at once. It is not possible to write in one sitting, a perfectly working code with no scope for improvement. However, the intermediate code might be used for demonstrative (the proof of concept or POC; before a project starts officially, evidence is needed to prove that the concept can be put into reality) and testing purposes. Hence, there is a need to follow a version control.

It essentially means saving a copy of the old code, making a copy of it, renaming it, and making changes to the new copy. The new copy is the new version of the code. This new copy can be released as the latest production version once it is tested after making the changes and has started running without error. Until then, the latest but one version of the code should be used as the production version. Version control can be done manually or by using version controlling tools such as GitHub and so on.

Using standard libraries, methods, and formulas

As far as possible, try to use a function or method, if it already exists, to perform a particular task in the production version of the code. For better understanding of how the method works, one can deconstruct the method and try building it up from scratch (as we have done for the logistic regression algorithm in this book) on their own, but this should be a part of the exploratory work. In the production version, already existing methods should be used.

For example, to calculate correlation, one should use the already existing formula and not reinvent the wheel from scratch. Another example is the groupby functionality in pandas to split the dataset into groups based on the different categories of a categorical variable. This saves time and also increases the elegance of the code snippet. There are an ample number of libraries to choose from to perform tasks in Python. One should choose a library that performs well and is stable over a range of IDEs, interpreters, and OSs.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset