How to do it...

Before moving on to the model-building part, we need to preprocess the input data. Let's get started:

We start by cleaning the data by removing any punctuation and non-alphanumeric characters, normalizing all the Unicode characters to ASCII, and converting all the data into lowercase:

data_cleaning <- function(sentence) {
 sentence = gsub('[[:punct:] ]+',' ',sentence)
 sentence = gsub("[^[:alnum:]\-\.\s]", " ", sentence)
 sentence = stringi::stri_trans_general(sentence, "latin-ascii")
 sentence = tolower(sentence)
 sentence
}


sentences <- map(sentences,data_cleaning)

Next, we create two separate lists of German and English phrases and capture the maximum length of statements in each of these. We will use these lengths to pad the sentences:

english_sentences = list()
german_sentences = list()
for(i in 1:length(sentences)){
 current_sentence <- sentences[i]%>%unlist()%>%str_split('	')
 english_sentences <- append(english_sentences,current_sentence[1])
 german_sentences <- append(german_sentences,current_sentence[2]) 
}

Then, we convert the data into a DataFrame so that it can be manipulated easily:

data <- do.call(rbind, Map(data.frame, "German"=german_sentences,"English"=english_sentences))
head(data,10)

The following screenshot shows the input data in the form of a DataFrame:

Now, we can see the maximum number of words in all the sentences in German and English phrases:

german_length = max(sapply(strsplit(as.character(data[,"German"] ), " "), length))
print(paste0("Maximum length of a sentence in German data:",german_length))

eng_length = max(sapply(strsplit(as.character(data[,"English"] ), " "), length))
print(paste0("Maximum length of a sentence in English data:", eng_length))

From the following screenshot, we can infer that the maximum length of a sentence in German is 10, whereas for English, it is 6:

Now, we build a function for tokenization and use it to tokenize the German and English phrases:

tokenization <- function(lines){
 tokenizer = text_tokenizer()
 tokenizer = fit_text_tokenizer(tokenizer,lines)
 return(tokenizer)
}

Here, we prepare the German tokenizer:

german_tokenizer = tokenization(data[,"German"])
german_vocab_size = length(german_tokenizer$word_index) + 1

print(paste0('German Vocabulary Size:',german_vocab_size))

From the following screenshot, we can see that the German vocabulary size is 3,542:

Now, we prepare the English tokenizer:

eng_tokenizer = tokenization(data[,"English"])
eng_vocab_size = length(eng_tokenizer$word_index) + 1

print(paste0('English Vocabulary Size:',eng_vocab_size))

From the following screenshot, we can see that the English vocabulary size is 2,189:

Next, we create a function that will encode the phrases into a sequence of integers and pad the sequences to make each phrase uniform in length:

# Function to encode and pad sequences
encode_pad_sequences <- function(tokenizer, length, lines){
 # Encoding text to integers
 seq = texts_to_sequences(tokenizer,lines)
 # Padding text to maximum length sentence
 seq = pad_sequences(seq, maxlen=length, padding='post')
 return(seq)
}

Next, we divide the data into training and testing datasets and apply the encode_pad_sequences() function we defined in step 4 to these datasets:

train_data <- data[1:9000,]
test_data <- data[9001:10000,]

We prepare the training and test data:

x_train <- encode_pad_sequences(german_tokenizer,german_length,train_data[,"German"])
y_train <- encode_pad_sequences(eng_tokenizer,eng_length,train_data[,"English"])
y_train <- to_categorical(y_train,num_classes = eng_vocab_size)

x_test <- encode_pad_sequences(german_tokenizer,german_length,test_data[,"German"])
y_test <- encode_pad_sequences(eng_tokenizer,eng_length,test_data[,"English"])
y_test <- to_categorical(y_test,num_classes = eng_vocab_size)

Now, we define the model. We initialize a few parameters that will be fed into the model's configuration:

in_vocab = german_vocab_size
out_vocab = eng_vocab_size
in_timesteps = german_length
out_timesteps = eng_length
units = 512
epochs = 70
batch_size = 200

Here, we configure the layers of the model:

model <- keras_model_sequential()
model %>%
 layer_embedding(in_vocab,units, input_length=in_timesteps, mask_zero=TRUE) %>%
 layer_lstm(units = units) %>%
 layer_repeat_vector(out_timesteps)%>%
 layer_lstm(units,return_sequences = TRUE)%>%
 time_distributed(layer_dense(units = out_vocab, activation='softmax'))

Let's have a look at the summary of the model:

summary(model)

The following screenshot shows the summary of the translation model:

Now, we compile the model and train it:

model %>% compile(optimizer = "adam",loss = 'categorical_crossentropy')

Then, we define the callbacks and checkpoints:

model_name <- "model_nmt"

checkpoint_dir <- "checkpoints_nmt"
 dir.create(checkpoint_dir)
 filepath <- file.path(checkpoint_dir, paste0(model_name,"weights.{epoch:02d}-{val_loss:.2f}.hdf5",sep=""))

cp_callback <- list(callback_model_checkpoint(mode = "min",
 filepath = filepath,
 save_best_only = TRUE,
 verbose = 1))

Next, we fit the training data to the model:

model %>% fit(x_train,y_train,epochs = epochs,batch_size = batch_size,validation_split = 0.2,callbacks = cp_callback,verbose = 2)

In this step, we generate predictions for test data:

predicted = model %>% predict_classes(x_test)

Let's create a function that will create a reversed list of key-value pairs of the word index. We will use this to decode the phrases in German and English:

reverse_word_index <- function(tokenizer){
 reverse_word_index <- names(tokenizer$word_index)
 names(reverse_word_index) <- tokenizer$word_index
 return(reverse_word_index)
}

german_reverse_word_index <- reverse_word_index(german_tokenizer)
eng_reverse_word_index <- reverse_word_index(eng_tokenizer)

Let's decode a sample phrase from the test data in German and look at its prediction in English:

index_to_word <- function(data_sample,word_index_dict){
 phrase = list()
 for(i in 1:length(data_sample)){
 index = data_sample[[i]]
 word = word_index_dict[index] 
# word = if(!is.null(word)) word else "?"
 phrase = paste0(phrase," ",word)
 }
 return(phrase)
}

Now, we can print some sample German sentences and their original and predicted translations in English:

cat(paste0("The german sample phrase is -->",index_to_word(x_test[90,],german_reverse_word_index)))
cat('
')
cat(paste0("The actual translation in english is -->",as.character(test_data[90,"English"])))
cat('
')
cat(paste0("The predicted translation in english is -->",index_to_word(predicted[90,],eng_reverse_word_index)))

The following screenshot shows one example of translation being done by our model. We can see that our model did a great job:

Let's have a look at one more translation, as shown in the following code:

cat(paste0("The german sample phrase is -->",index_to_word(x_test[6,],german_reverse_word_index)))
cat('
')
cat(paste0("The actual translation in english is -->",as.character(test_data[6,"English"])))
cat('
')
cat(paste0("The predicted translation in english is -->",index_to_word(predicted[6,],eng_reverse_word_index)))

The following screenshot shows another accurate translation that was done by our model:

Now, let's move on to the nitty-gritty of the model and look at a detailed explanation of how it works.

Table of Contents for How to do it...

Create new playlist

Sign In

Sign Up

Table of Contents for
How to do it...