How it works...

In step 1, we preprocessed the raw data by removing any punctuation and non-alphanumeric characters, normalizing all the Unicode characters to ASCII, and converting all the data into lowercase. We created lists of German and English phrases and combined them into a DataFrame for easy data manipulation.

In a sequence-to-sequence model, both the input and output phrases need to be converted into integer sequences of a fixed length. Thus, in step 2, we calculated the number of words in the lengthiest statements from each of these lists, which will be used to pad the sentences in their respective languages in the upcoming steps.

Next, in step 3, we created tokenizers for both the German and English phrases. For working with language models, we broke the input text into tokens. Tokenization provides us with an indexed list that consists of words that have been indexed by their overall frequency in the dataset.

The num_words argument of text_tokenizer() can be used to define the maximum number of words to keep, based on the word frequency in the word index list.

In Step 4, we created a function to map the text, both in German and English, to a specific sequence of integer values by using the texts_to_sequences() function. Each integer represents a particular word in the dictionaries we created in the preceding step. The function also pads these sequences with zeros to make all the sequences of a uniform length, which is essentially the maximum sentence length in that language. Note that the padding='post' argument in the pad_sequences() function pads zeros at the end of each sequence. Next, in Step 5, we split the data into training and testing datasets and encoded them into sequences by applying the custom encode_pad_sequences function that we created in the preceding step.

In Step 6, we defined the sequence-to-sequence model's configuration. We used an encoder-decoder LSTM model, where the input sequence was encoded by an encoder model, followed by a decoder that then decoded the text word by word. The encoder consisted of an embedding layer and an LSTM layer, whereas the decoder model consisted of another LSTM layer followed by a dense layer. The embedding layer in the encoder transformed the input feature space into a latent feature with n dimensions; in our example, it transforms it into 512 latent features. In this architecture, the encoder produced a two-dimensional matrix of outputs, the length of which is equal to the number of memory units in the layer. The decoder expects a 3D input to produce a decoded sequence. For this issue, we used layer_repeat_vector(), which repeats the provided 2D input multiple times to create a 3D output, as expected by the decoder.

In the next step, we compiled the model and trained it using the training data. For model compilation, we used RMSprop as the optimizer and categorical_crossentropy as the loss function. To train the model, we used an 80:20 split for the training and validation datasets, respectively. We trained the model for 50 epochs, with a batch size of 500. Then, we plotted the training loss and validation loss.

In the last step, we predicted the translation in English for a sample German phrase from our test dataset. We created a custom reverse_word_index function in order to create a key-value pair of indexes and words for both German and English. Then, we utilized this function to map the integer sequences' outputs to words.

Table of Contents for How it works...

Create new playlist

Sign In

Sign Up

Table of Contents for
How it works...