English to French using NLTK SMT models

We will now look into an example of statistical machine translation using NLTK. We will use translated TED Talks from https://wit3.fbk.eu/mt.php?release=2015-01 as our training and test dataset. The data contains some of the TED Talks in French translated into English. The complete code and data for this example are available under the Chapter10/ directory of this book's code repository. We will use the IBM lexical alignment models, which are simple statistical translation models. These models take a collection of alignment pairs between the source and target languages and compute probabilities of their associations or alignments. We will use the basic IBM Model 1, which performs a one-to-one alignment of the source and target sentences. Therefore, the model produces exactly one target word for each source word without considering any reordering or translation of one source word to multiple words or the dropping of words.

The nltk.translate package provides the implementation of the IBM alignment models. We will first import these and define a function to read English and corresponding French translated data:

from nltk.translate.ibm1 import IBMModel1
from nltk.translate.api import AlignedSent
import dill as pickle
import randomdef

read_sents(filename):
sents = []
c=0
with open(filename,'r') as fi:
for li in fi:
sents.append(li.split())
return sents

The AlignedSent class will be used to supply the French-English alignment data during training. read_sents() reads each line from the input file and converts it into a list of tokens for each sentence. We will now create the alignment data and train the model:

max_count=5000
eng_sents_all = read_sents('data/train_en_lines.txt')
fr_sents_all = read_sents('data/train_fr_lines.txt')
eng_sents = eng_sents_all[:max_count]
fr_sents = fr_sents_all[:max_count]
print("Size of english sentences: ", len(eng_sents))
print("Size of french sentences: ", len(fr_sents))
aligned_text = []
for i in range(len(eng_sents)):
al_sent = AlignedSent(fr_sents[i],eng_sents[i])
aligned_text.append(al_sent)
print("Training smt model")
ibm_model = IBMModel1(aligned_text,5)
print("Training complete")

We use about 5,000 sentences (max_count) as the training data for faster convergence, though you can change this to train on the complete data. We will then create the list of French-English sentence pairs with AlignedSent for model training. After the training, we will look at how the model performs in the translation task:

n_random = random.randint(0,max_count)
fr_sent = fr_sents_all[n_random]
eng_sent_actual_tr = eng_sents_all[n_random]
tr_sent = []
for w in fr_sent:
probs = ibm_model.translation_table[w]
if(len(probs)==0):
continue
sorted_words = sorted([(k,v) for k, v in probs.items()],key=lambda x: x[1], reverse=True)
top_word = sorted_words[1][0]
if top_word is not None:
tr_sent.append(top_word)
print("French sentence: ", " ".join(fr_sent))
print("Translated Eng sentence: ", " ".join(tr_sent))
print("Original translation: ", " ".join(eng_sent_actual_tr))

We can pick a random sentence from the list of French sentences and look up the corresponding English word using translation_table. This table stores the probabilities of an alignment between a given French word and the corresponding English words. We pick the English word that is more likely a translation of the given French word using these alignment probabilities. This lookup is done for all the French words in the original sentence to get the corresponding English phrase in tr_sent. Finally, we print the French sentence, the SMT translated sentence, and the correct translation:

French sentence:  On appelle ça l'accessibilité financière.
Translated Eng sentence:  suggests affordability. works. called called
Original translation:  And it's called affordability.

We can see that the SMT translation can get some of the words correct, such as the word affordability, but the translation is not meaningful compared to the original one. This can be improved by running the training on the whole dataset and increasing the iterations. But it should also be noted that we used a simple model that does not consider word order in the target language. The more complex IBM models, which are 3, 4, and 5, can capture word order as well as fertility (the source language word that does not always have a one-to-one mapping with the target language word). Model 5 uses a Hidden Markov Model (HMM) along with the alignments to provide better translations.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset