Encoding the text

Next, a function is created to convert the list of strings into vectors. The character set is made first by concatenating the English characters, and the dictionary is created with those characters as keys, while the integers are the values:

def get_encoded_x(train_x1, train_x2, test_x1, test_x2):
chars = string.ascii_lowercase + '? ()=+-_~"`<>,./|[]{}!@#$%^&*:;' + "'"

The preceding example is just a set of characters from the English language. Other language characters can also be included, to make the approach generic for different languages. 

Note that this character set can be inferred from the dataset, to include non-English characters.

Next, a character map is formed between the set of characters and integers. The character map is formed as a dictionary, as shown in the following code snippet:

char_map = dict(zip(list(chars), range(len(chars))))

Next, a maximum sentence length is obtained from the dataset by going through every question. First, a list of the length of every individual line is formed, and the maximum value is taken from that: 

max_sent_len = max([len(line) for line in np.concatenate((train_x1,  
train_x2, test_x1, test_x2))])
print('max sentence length: {}'.format(max_sent_len))

We need to preset the maximum length of all of the questions, to quantize the vector to that size. Whenever a question is smaller in length (when compared to the maximum length), spaces are simply appended to the text:

def quantize(line):
line_padding = line + [' '] * (max_sent_len - len(line))
encode = [char_map[char] if char in char_map.keys() else char_map[' '] for char in line_padding]
return encode

The train and test question pairs are encoded by calling the preceding quantize function and converting them to a NumPy array. Every question is iteratively quantized, as shown here:

train_x1_encoded = np.array([quantize(line) for line in train_x1])
train_x2_encoded = np.array([quantize(line) for line in train_x2])
test_x1_encoded = np.array([quantize(line) for line in test_x1])
test_x2_encoded = np.array([quantize(line) for line in test_x2])
return train_x1_encoded, train_x2_encoded, test_x1_encoded,
test_x2_encoded, max_sent_len, char_map

Next, there is quantization where every character is split and encoded using the character map after padded with spaces. Then, the array of integers is converted to a NumPy array. The next function combines the preceding functions to preprocess the data. The data, present in the form of .csv, is read for both training and testing. Question 1 and question 2 are from different columns of the data frame, and are split accordingly. The data is a binary, whether the question is duplicate or not.

First, load the .csv file of the train and test dataset using the pandas framework:

def pre_process():
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')

Then, pass the pandas DataFrames to the functions defined at the beginning, to covert the raw text to a NumPy array, shown as follows. The question pandas series are subset before passing them to the functions:

train_x1 = read_x(train_data['question1'])
train_x2 = read_x(train_data['question2'])
train_y = read_y(train_data['is_duplicate'])

Convert the test question pairs to NumPy arrays by using the same functions shown here:

test_x1 = read_x(test_data['question1'])
test_x2 = read_x(test_data['question2'])

Next, pass the NumPy arrays of the train and test data for encoding, and get back the encoded question pairs, the maximum sentence length, and the character map:


train_x1, train_x2, test_x1, test_x2, max_sent_len, char_map = get_encoded_x(train_x1, train_x2, test_x1, test_x2)
train_x1, train_x2, train_y, val_x1, val_x2, val_y = split_train_val(train_x1, train_x2, train_y)

return train_x1, train_x2, train_y, val_x1, val_x2, val_y, test_x1, test_x2, max_sent_len, char_map

The encoding can be called during the training process. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset