Training the model

We will train a model for matching these question pairs. Let's start by importing the relevant libraries, as follows:

import sys
import os
import pandas as pd
import numpy as np
import string
import tensorflow as tf

Following is a function that takes a pandas series of text as input. Then, the series is converted to a list. Each item in the list is converted into a string, made lower case, and stripped of surrounding empty spaces. The entire list is converted into a NumPy array, to be passed back:

def read_x(x):
x = np.array([list(str(line).lower().strip()) for line in x.tolist()])
return x

Next up is a function that takes a pandas series as input, converts it to a list, and returns it as a NumPy array:

def read_y(y):
return np.asarray(y.tolist())

The next function splits the data for training and validation. Validation data is helpful to see how well the model trained on the training data generalizes with unseen data. The data for validation is randomly picked by shuffling the indices of the data. The function takes question pairs and their corresponding labels as input, along with the ratio of the split:

def split_train_val(x1, x2, y, ratio=0.1):
indicies = np.arange(x1.shape[0])
np.random.shuffle(indicies)
num_train = int(x1.shape[0]*(1-ratio))
train_indicies = indicies[:num_train]
val_indicies = indicies[num_train:]

The ratios of training and validation are set to 10%, and accordingly, the shuffled indices are separated for training and validation by slicing the array. Since the indices are already shuffled, they can be used for splitting the training data. The input data has two question pairs, x1 and x2, with a y label indicating whether the pair is a duplicate:

train_x1 = x1[train_indicies, :]
train_x2 = x2[train_indicies, :]
train_y = y[train_indicies]

Similar to the training question pairs and labels, the validation data is sliced, based on a 10% ratio split of the indices:

val_x1 = x1[val_indicies, :]
val_x2 = x2[val_indicies, :]
val_y = y[val_indicies]

return train_x1, train_x2, train_y, val_x1, val_x2, val_y

The training and validation data are picked from the shuffled indices, and the data is split based on the indices.

Note that the question pairs have to be picked from the indices, for both training and testing. 
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset