How to do it...

Before moving on to the model-building part, we need to preprocess the input data. Let's get started:

  1. We start by cleaning the data by removing any punctuation and non-alphanumeric characters, normalizing all the Unicode characters to ASCII, and converting all the data into lowercase:
data_cleaning <- function(sentence) {
sentence = gsub('[[:punct:] ]+',' ',sentence)
sentence = gsub("[^[:alnum:]\-\.\s]", " ", sentence)
sentence = stringi::stri_trans_general(sentence, "latin-ascii")
sentence = tolower(sentence)
sentence
}


sentences <- map(sentences,data_cleaning)
  1. Next, we create two separate lists of German and English phrases and capture the maximum length of statements in each of these. We will use these lengths to pad the sentences:
english_sentences = list()
german_sentences = list()
for(i in 1:length(sentences)){
current_sentence <- sentences[i]%>%unlist()%>%str_split(' ')
english_sentences <- append(english_sentences,current_sentence[1])
german_sentences <- append(german_sentences,current_sentence[2])
}

Then, we convert the data into a DataFrame so that it can be manipulated easily:

data <- do.call(rbind, Map(data.frame, "German"=german_sentences,"English"=english_sentences))
head(data,10)

The following screenshot shows the input data in the form of a DataFrame:

Now, we can see the maximum number of words in all the sentences in German and English phrases:

german_length = max(sapply(strsplit(as.character(data[,"German"] ), " "), length))
print(paste0("Maximum length of a sentence in German data:",german_length))

eng_length = max(sapply(strsplit(as.character(data[,"English"] ), " "), length))
print(paste0("Maximum length of a sentence in English data:", eng_length))

From the following screenshot, we can infer that the maximum length of a sentence in German is 10, whereas for English, it is 6:

  1. Now, we build a function for tokenization and use it to tokenize the German and English phrases:
tokenization <- function(lines){
tokenizer = text_tokenizer()
tokenizer = fit_text_tokenizer(tokenizer,lines)
return(tokenizer)
}

Here, we prepare the German tokenizer:

german_tokenizer = tokenization(data[,"German"])
german_vocab_size = length(german_tokenizer$word_index) + 1

print(paste0('German Vocabulary Size:',german_vocab_size))

From the following screenshot, we can see that the German vocabulary size is 3,542:

Now, we prepare the English tokenizer:

eng_tokenizer = tokenization(data[,"English"])
eng_vocab_size = length(eng_tokenizer$word_index) + 1

print(paste0('English Vocabulary Size:',eng_vocab_size))

From the following screenshot, we can see that the English vocabulary size is 2,189:

  1. Next, we create a function that will encode the phrases into a sequence of integers and pad the sequences to make each phrase uniform in length:
# Function to encode and pad sequences
encode_pad_sequences <- function(tokenizer, length, lines){
# Encoding text to integers
seq = texts_to_sequences(tokenizer,lines)
# Padding text to maximum length sentence
seq = pad_sequences(seq, maxlen=length, padding='post')
return(seq)
}
  1. Next, we divide the data into training and testing datasets and apply the encode_pad_sequences() function we defined in step 4 to these datasets:
train_data <- data[1:9000,]
test_data <- data[9001:10000,]

We prepare the training and test data:

x_train <- encode_pad_sequences(german_tokenizer,german_length,train_data[,"German"])
y_train <- encode_pad_sequences(eng_tokenizer,eng_length,train_data[,"English"])
y_train <- to_categorical(y_train,num_classes = eng_vocab_size)

x_test <- encode_pad_sequences(german_tokenizer,german_length,test_data[,"German"])
y_test <- encode_pad_sequences(eng_tokenizer,eng_length,test_data[,"English"])
y_test <- to_categorical(y_test,num_classes = eng_vocab_size)
  1. Now, we define the model. We initialize a few parameters that will be fed into the model's configuration:
in_vocab = german_vocab_size
out_vocab = eng_vocab_size
in_timesteps = german_length
out_timesteps = eng_length
units = 512
epochs = 70
batch_size = 200

Here, we configure the layers of the model:

model <- keras_model_sequential()
model %>%
layer_embedding(in_vocab,units, input_length=in_timesteps, mask_zero=TRUE) %>%
layer_lstm(units = units) %>%
layer_repeat_vector(out_timesteps)%>%
layer_lstm(units,return_sequences = TRUE)%>%
time_distributed(layer_dense(units = out_vocab, activation='softmax'))

Let's have a look at the summary of the model:

summary(model)

The following screenshot shows the summary of the translation model:

  1. Now, we compile the model and train it:
model %>% compile(optimizer = "adam",loss = 'categorical_crossentropy')

Then, we define the callbacks and checkpoints:

model_name <- "model_nmt"

checkpoint_dir <- "checkpoints_nmt"
dir.create(checkpoint_dir)
filepath <- file.path(checkpoint_dir, paste0(model_name,"weights.{epoch:02d}-{val_loss:.2f}.hdf5",sep=""))

cp_callback <- list(callback_model_checkpoint(mode = "min",
filepath = filepath,
save_best_only = TRUE,
verbose = 1))

Next, we fit the training data to the model:

model %>% fit(x_train,y_train,epochs = epochs,batch_size = batch_size,validation_split = 0.2,callbacks = cp_callback,verbose = 2)
  1. In this step, we generate predictions for test data:
predicted = model %>% predict_classes(x_test)

Let's create a function that will create a reversed list of key-value pairs of the word index. We will use this to decode the phrases in German and English:

reverse_word_index <- function(tokenizer){
reverse_word_index <- names(tokenizer$word_index)
names(reverse_word_index) <- tokenizer$word_index
return(reverse_word_index)
}

german_reverse_word_index <- reverse_word_index(german_tokenizer)
eng_reverse_word_index <- reverse_word_index(eng_tokenizer)

Let's decode a sample phrase from the test data in German and look at its prediction in English:

index_to_word <- function(data_sample,word_index_dict){
phrase = list()
for(i in 1:length(data_sample)){
index = data_sample[[i]]
word = word_index_dict[index]
# word = if(!is.null(word)) word else "?"
phrase = paste0(phrase," ",word)
}
return(phrase)
}

Now, we can print some sample German sentences and their original and predicted translations in English:

cat(paste0("The german sample phrase is -->",index_to_word(x_test[90,],german_reverse_word_index)))
cat(' ')
cat(paste0("The actual translation in english is -->",as.character(test_data[90,"English"])))
cat(' ')
cat(paste0("The predicted translation in english is -->",index_to_word(predicted[90,],eng_reverse_word_index)))

The following screenshot shows one example of translation being done by our model. We can see that our model did a great job:

Let's have a look at one more translation, as shown in the following code:

cat(paste0("The german sample phrase is -->",index_to_word(x_test[6,],german_reverse_word_index)))
cat(' ')
cat(paste0("The actual translation in english is -->",as.character(test_data[6,"English"])))
cat(' ')
cat(paste0("The predicted translation in english is -->",index_to_word(predicted[6,],eng_reverse_word_index)))

The following screenshot shows another accurate translation that was done by our model:

Now, let's move on to the nitty-gritty of the model and look at a detailed explanation of how it works.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset