Before moving on to the model-building part, we need to preprocess the input data. Let's get started:
- We start by cleaning the data by removing any punctuation and non-alphanumeric characters, normalizing all the Unicode characters to ASCII, and converting all the data into lowercase:
data_cleaning <- function(sentence) {
sentence = gsub('[[:punct:] ]+',' ',sentence)
sentence = gsub("[^[:alnum:]\-\.\s]", " ", sentence)
sentence = stringi::stri_trans_general(sentence, "latin-ascii")
sentence = tolower(sentence)
sentence
}
sentences <- map(sentences,data_cleaning)
- Next, we create two separate lists of German and English phrases and capture the maximum length of statements in each of these. We will use these lengths to pad the sentences:
english_sentences = list()
german_sentences = list()
for(i in 1:length(sentences)){
current_sentence <- sentences[i]%>%unlist()%>%str_split(' ')
english_sentences <- append(english_sentences,current_sentence[1])
german_sentences <- append(german_sentences,current_sentence[2])
}
Then, we convert the data into a DataFrame so that it can be manipulated easily:
data <- do.call(rbind, Map(data.frame, "German"=german_sentences,"English"=english_sentences))
head(data,10)
The following screenshot shows the input data in the form of a DataFrame:
Now, we can see the maximum number of words in all the sentences in German and English phrases:
german_length = max(sapply(strsplit(as.character(data[,"German"] ), " "), length))
print(paste0("Maximum length of a sentence in German data:",german_length))
eng_length = max(sapply(strsplit(as.character(data[,"English"] ), " "), length))
print(paste0("Maximum length of a sentence in English data:", eng_length))
From the following screenshot, we can infer that the maximum length of a sentence in German is 10, whereas for English, it is 6:
- Now, we build a function for tokenization and use it to tokenize the German and English phrases:
tokenization <- function(lines){
tokenizer = text_tokenizer()
tokenizer = fit_text_tokenizer(tokenizer,lines)
return(tokenizer)
}
Here, we prepare the German tokenizer:
german_tokenizer = tokenization(data[,"German"])
german_vocab_size = length(german_tokenizer$word_index) + 1
print(paste0('German Vocabulary Size:',german_vocab_size))
From the following screenshot, we can see that the German vocabulary size is 3,542:
Now, we prepare the English tokenizer:
eng_tokenizer = tokenization(data[,"English"])
eng_vocab_size = length(eng_tokenizer$word_index) + 1
print(paste0('English Vocabulary Size:',eng_vocab_size))
From the following screenshot, we can see that the English vocabulary size is 2,189:
- Next, we create a function that will encode the phrases into a sequence of integers and pad the sequences to make each phrase uniform in length:
# Function to encode and pad sequences
encode_pad_sequences <- function(tokenizer, length, lines){
# Encoding text to integers
seq = texts_to_sequences(tokenizer,lines)
# Padding text to maximum length sentence
seq = pad_sequences(seq, maxlen=length, padding='post')
return(seq)
}
- Next, we divide the data into training and testing datasets and apply the encode_pad_sequences() function we defined in step 4 to these datasets:
train_data <- data[1:9000,]
test_data <- data[9001:10000,]
We prepare the training and test data:
x_train <- encode_pad_sequences(german_tokenizer,german_length,train_data[,"German"])
y_train <- encode_pad_sequences(eng_tokenizer,eng_length,train_data[,"English"])
y_train <- to_categorical(y_train,num_classes = eng_vocab_size)
x_test <- encode_pad_sequences(german_tokenizer,german_length,test_data[,"German"])
y_test <- encode_pad_sequences(eng_tokenizer,eng_length,test_data[,"English"])
y_test <- to_categorical(y_test,num_classes = eng_vocab_size)
- Now, we define the model. We initialize a few parameters that will be fed into the model's configuration:
in_vocab = german_vocab_size
out_vocab = eng_vocab_size
in_timesteps = german_length
out_timesteps = eng_length
units = 512
epochs = 70
batch_size = 200
Here, we configure the layers of the model:
model <- keras_model_sequential()
model %>%
layer_embedding(in_vocab,units, input_length=in_timesteps, mask_zero=TRUE) %>%
layer_lstm(units = units) %>%
layer_repeat_vector(out_timesteps)%>%
layer_lstm(units,return_sequences = TRUE)%>%
time_distributed(layer_dense(units = out_vocab, activation='softmax'))
Let's have a look at the summary of the model:
summary(model)
The following screenshot shows the summary of the translation model:
- Now, we compile the model and train it:
model %>% compile(optimizer = "adam",loss = 'categorical_crossentropy')
Then, we define the callbacks and checkpoints:
model_name <- "model_nmt"
checkpoint_dir <- "checkpoints_nmt"
dir.create(checkpoint_dir)
filepath <- file.path(checkpoint_dir, paste0(model_name,"weights.{epoch:02d}-{val_loss:.2f}.hdf5",sep=""))
cp_callback <- list(callback_model_checkpoint(mode = "min",
filepath = filepath,
save_best_only = TRUE,
verbose = 1))
Next, we fit the training data to the model:
model %>% fit(x_train,y_train,epochs = epochs,batch_size = batch_size,validation_split = 0.2,callbacks = cp_callback,verbose = 2)
- In this step, we generate predictions for test data:
predicted = model %>% predict_classes(x_test)
Let's create a function that will create a reversed list of key-value pairs of the word index. We will use this to decode the phrases in German and English:
reverse_word_index <- function(tokenizer){
reverse_word_index <- names(tokenizer$word_index)
names(reverse_word_index) <- tokenizer$word_index
return(reverse_word_index)
}
german_reverse_word_index <- reverse_word_index(german_tokenizer)
eng_reverse_word_index <- reverse_word_index(eng_tokenizer)
Let's decode a sample phrase from the test data in German and look at its prediction in English:
index_to_word <- function(data_sample,word_index_dict){
phrase = list()
for(i in 1:length(data_sample)){
index = data_sample[[i]]
word = word_index_dict[index]
# word = if(!is.null(word)) word else "?"
phrase = paste0(phrase," ",word)
}
return(phrase)
}
Now, we can print some sample German sentences and their original and predicted translations in English:
cat(paste0("The german sample phrase is -->",index_to_word(x_test[90,],german_reverse_word_index)))
cat(' ')
cat(paste0("The actual translation in english is -->",as.character(test_data[90,"English"])))
cat(' ')
cat(paste0("The predicted translation in english is -->",index_to_word(predicted[90,],eng_reverse_word_index)))
The following screenshot shows one example of translation being done by our model. We can see that our model did a great job:
Let's have a look at one more translation, as shown in the following code:
cat(paste0("The german sample phrase is -->",index_to_word(x_test[6,],german_reverse_word_index)))
cat(' ')
cat(paste0("The actual translation in english is -->",as.character(test_data[6,"English"])))
cat(' ')
cat(paste0("The predicted translation in english is -->",index_to_word(predicted[6,],eng_reverse_word_index)))
The following screenshot shows another accurate translation that was done by our model:
Now, let's move on to the nitty-gritty of the model and look at a detailed explanation of how it works.