How to do it...

The preceding algorithm in code is performed as follows (the code file is available as Handwritten_text_recognition.ipynb in GitHub):

Download and import the dataset. This dataset will contain the images of handwritten text and their corresponding ground truth (transcription).
Build a function that resizes pictures without distorting the aspect ratio and pad the rest of pictures so that all of them have the same shape:

def extract_img(img):
     target = np.ones((32,128))*255
     new_shape1 = 32/img.shape[0]
     new_shape2 = 128/img.shape[1]
     final_shape = min(new_shape1, new_shape2)
     new_x = int(img.shape[0]*final_shape)
     new_y = int(img.shape[1]*final_shape)
     img2 = cv2.resize(img, (new_y,new_x ))
     target[:new_x,:new_y] = img2[:,:,0]
     target[new_x:,new_y:]=255
     return 255-target

In the preceding code, we are creating a blank picture (named target). In the next step, we have reshaped the picture to maintain its aspect ratio.

Finally, we have overwritten the rescaled picture on top of the blank one we created, and have returned the picture where the background is in black (255-target).

Read the pictures and store them in a list, as shown in the following code:

filepath = '/content/words.txt'
f = open(filepath)
import cv2
count = 0
x = []
y = []
x_new = []
chars = set()
for line in f:
     if not line or line[0]=='#':
         continue
     try:
         lineSplit = line.strip().split(' ')
         fileNameSplit = lineSplit[0].split('-')
         img_path = '/content/'+fileNameSplit[0]+'/'+fileNameSplit[0] + '-' +              fileNameSplit[1]+'/'+lineSplit[0]+'.png'
         img_word = lineSplit[-1]
         img = cv2.imread(img_path)
         img2 = extract_img(img)
         x_new.append(img2)
         x.append(img)
         y.append(img_word)
         count+=1
     except:
         continue

In the preceding code, we are extracting each picture and also are modifying it per the function that we defined. The input and modified examples for different scenario:

Extract the unique characters in output, shown as follows:

import itertools
list2d = y
charList = list(set(list(itertools.chain(*list2d))))

Create the output ground truth, as demonstrated in the following code:

num_images = 50000

import numpy as np
y2 = []
input_lengths = np.ones((num_images,1))*32
label_lengths = np.zeros((num_images,1))
for i in range(num_images):
     val = list(map(lambda x: charList.index(x), y[i]))
     while len(val)<32:
         val.append(79)
     y2.append(val)
     label_lengths[i] = len(y[i])
     input_lengths[i] = 32

In the preceding code, we are storing the index of each character in an output into a list. Additionally, if the output is less than 32 characters in size, we pad it with 79, which represents the blank value.

Finally, we are also storing the label length (in the ground truth) and also the input length (which is always 32 in size).

Convert the input and output into NumPy arrays, as follows:

x = np.asarray(x_new[:num_images])
y2 = np.asarray(y2)
x= x.reshape(x.shape[0], x.shape[1], x.shape[2],1)

Define the objective, as shown here:

outputs = {'ctc': np.zeros([32])}

We are initializing 32 zeros, as the batch size will be 32. For each value in batch size, we expect the loss value to be zero.

Define the CTC loss function as follows:

def ctc_loss(args):
     y_pred, labels, input_length, label_length = args
     return K.ctc_batch_cost(labels, y_pred, input_length, label_length)

The preceding function takes the predicted values, ground truth (labels) and input, label lengths as input and calculates the CTC loss value.

Define the model, demonstrated as follows:

input_data = Input(name='the_input', shape = (32, 128,1), dtype='float32')

inner = Conv2D(32, (3,3), padding='same')(input_data)
inner = Activation('relu')(inner)
inner = MaxPooling2D(pool_size=(2,2),name='max1')(inner)
inner = Conv2D(64, (3,3), padding='same')(inner)
inner = Activation('relu')(inner)
inner = MaxPooling2D(pool_size=(2,2),name='max2')(inner)
inner = Conv2D(128, (3,3), padding='same')(input_data)
inner = Activation('relu')(inner)
inner = MaxPooling2D(pool_size=(2,2),name='max3')(inner)
inner = Conv2D(128, (3,3), padding='same')(inner)
inner = Activation('relu')(inner)
inner = MaxPooling2D(pool_size=(2,2),name='max4')(inner)
inner = Conv2D(256, (3,3), padding='same')(inner)
inner = Activation('relu')(inner)
inner = MaxPooling2D(pool_size=(4,2),name='max5')(inner)
inner = Reshape(target_shape = ((32,256)), name='reshape')(inner)

In the preceding code, we are building the CNN that converts a picture with 32 x 128 shape into a picture of 32 x 256 in shape:

gru_1 = GRU(256, return_sequences = True, name = 'gru_1')(inner)
gru_2 = GRU(256, return_sequences = True, go_backwards = True, name = 'gru_2')(inner)
mix_1 = add([gru_1, gru_2])
gru_3 = GRU(256, return_sequences = True, name = 'gru_3')(inner)
gru_4 = GRU(256, return_sequences = True, go_backwards = True, name = 'gru_4')(inner)

The architecture of model till the layers defined previously are as follows:

In the preceding code, we are passing the features obtained from CNN into a GRU. The architecture defined previously continues from the preceding graph shown is as follows:

In the following code, we are concatenating the output of two GRUs so that we take both bidirectional GRU and normal GRU-generated features into account:

merged = concatenate([gru_3, gru_4])

The architecture after adding the preceding layer is as follows:

In the following code, we are passing the features of GRU output through a dense layer and applying softmax to get one of the possible 80 values as output:

dense = TimeDistributed(Dense(80))(merged)
y_pred = TimeDistributed(Activation('softmax', name='softmax'))(dense)

The architecture of the model continues as follows:

Initialize the variables that are required for the CTC loss:

from keras.optimizers import Adam
Optimizer = Adam()
labels = Input(name = 'the_labels', shape=[32], dtype='float32')
input_length = Input(name='input_length', shape=[1],dtype='int64')
label_length = Input(name='label_length',shape=[1],dtype='int64')
output = Lambda(ctc_loss, output_shape=(1,),name='ctc')([y_pred, labels, input_length, label_length])

In the preceding code, we are mentioning that y_pred (predicted character values), actual labels, input length, and the label length are the inputs to the CTC loss function.

Build and compile the model as follows:

model = Model(inputs = [input_data, labels, input_length, label_length], outputs= output)
model.compile(loss={'ctc': lambda y_true, y_pred: y_pred}, optimizer = Optimizer)

Note that there are multiple inputs that we are passing to our model. The CTC calculation is as follows:

Create the following vectors of inputs and outputs:

x2 = 1-np.array(x_new[:num_images])/255
x2 = x2.reshape(x2.shape[0],x2.shape[1],x2.shape[2],1)
y2 = np.array(y2[:num_images])
input_lengths = input_lengths[:num_images]
label_lengths = label_lengths[:num_images]

Fit the model on multiple batches of pictures, demonstrated in the following code:

import random

for i in range(100):
     samp=random.sample(range(x2.shape[0]-100),32)
     x3=[x2[i] for i in samp]
     x3 = np.array(x3)
     y3 = [y2[i] for i in samp]
     y3 = np.array(y3)
     input_lengths2 = [input_lengths[i] for i in samp]
     label_lengths2 = [label_lengths[i] for i in samp]
     input_lengths2 = np.array(input_lengths2)
     label_lengths2 = np.array(label_lengths2)
     inputs = {
     'the_input': x3,
     'the_labels': y3,
     'input_length': input_lengths2,
     'label_length': label_lengths2,
     }
     outputs = {'ctc': np.zeros([32])}
     model.fit(inputs, outputs,batch_size = 32, epochs=1, verbose =2)

In the preceding code, we are sampling 32 pictures at a time, converting them into an array, and fitting the model to ensure that the CTC loss is zero.

Note that, we are excluding the last 100 pictures (in x2) from passing as input to model, so that we can test our model's accuracy on that data.

Furthermore, we are looping through the total dataset multiple times, as fetching all pictures into RAM and converting them into an array is very likely to crash the system, due to the huge memory requirement.

The training loss over increasing epochs is as follows:

Predict the output at each time for a test picture, using the following code:

model2 = Model(inputs = input_data, outputs = y_pred)
pred = model2.predict(x2[-5].reshape(1,32,128,1))

pred2 = np.argmax(pred[0,:],axis=1)
out = ""
for i in pred2:
  if(i==79):
    continue
  else:
    out += charList[i]
plt.imshow(x2[k].reshape(32,128))
plt.title('Predicted word: '+out)
plt.grid('off')

In the preceding code, we are discarding the output if the predicted character at a time step is the character of 79.

A test examples and its corresponding predictions (in title) are as follows:

Table of Contents for How to do it...

Create new playlist

Sign In

Sign Up

Table of Contents for
How to do it...