Data preparation

First, we will read the source text and the target text, which are in French and English, respectively:

frdata=[]
endata=[]
with open('data/train_fr_lines.txt') as frfile:
    for li in frfile:
        frdata.append(li)
with open('data/train_en_lines.txt') as enfile:
    for li in enfile:
        endata.append(li)
mtdata = pd.DataFrame({'FR':frdata,'EN':endata})
mtdata['FR_len'] = mtdata['FR'].apply(lambda x: len(x.split(' ')))
mtdata['EN_len'] = mtdata['EN'].apply(lambda x: len(x.split(' ')))
print(mtdata['FR'].head(2).values)
print(mtdata['EN'].head(2).values)

Output:

['Voici Bill Lange. Je suis Dave Gallo.
'
 'Nous allons vous raconter quelques histoires de la mer en vidéo.
']
["This is Bill Lange. I'm Dave Gallo.
"
 "And we're going to tell you some stories from the sea here in video.
"]

As we will be using pre-trained embedding vectors, we will first load them to create an index of, word to embeddings dictionary that will be used to prepare the input text data for training:

def build_word_vector_matrix(vector_file):
    embedding_index = {}
    with codecs.open(vector_file, 'r', 'utf-8') as f:
        for i, line in enumerate(f):
            sr = line.split()
    word = sr[0]

    embedding = np.asarray(sr[1:], dtype='float32')
    embedding_index[word] = embedding
    return embedding_index
embeddings_index = build_word_vector_matrix('glove.6B.50d.txt')

Since the encoder and decoder inputs are word identifiers, we will create both word to ID and ID to word mappings that will later be used during training and inference. During training, we will use the word to ID mapping. The ID to word mapping will be used to get the translated text back during inference:

def build_word2id_mapping(word_counts_dict):
    word2int = {} 
    count_threshold = 20
    value = 0
    for word, count in word_counts_dict.items():
        if count &gt;= count_threshold or word in embeddings_index:
            word2int[word] = value
            value += 1
    special_codes = [TOKEN_UNK,TOKEN_PAD,TOKEN_EOS,TOKEN_GO] 
    for code in special_codes:
        word2int[code] = len(word2int)
    int2word = {}
    for word, value in word2int.items():
        int2word[value] = word
    return word2int,int2word

This will transform both the input words and the special tokens, TOKEN_UNK, TOKEN_PAD, TOKEN_EOS, and TOKEN_GO, into corresponding numeric identifiers. These are defined as string UNK, PAD, EOS, and GO, respectively. We will apply the build_word2id_mapping() and build_embeddings() functions to both the English and French texts. Note that we only use words that occur with a frequency more than count_threshold, which is set as 20 in the preceding code:

fr_word2int,fr_int2word = build_word2id_mapping(word_counts_dict_fr)
en_word2int,en_int2word = build_word2id_mapping(word_counts_dict_en)
fr_embeddings_matrix = build_embeddings(fr_word2int)
en_embeddings_matrix = build_embeddings(en_word2int)
print("Length of french word embeddings: ", len(fr_embeddings_matrix))
print("Length of english word embeddings: ", len(en_embeddings_matrix))

Output:

Length of french word embeddings: 19708
Length of english word embeddings: 39614

Next, we will define a function to transform both the source and target phrases into numeric identifiers:

def convert_sentence_to_ids(text, word2int, eos=False):
    wordints = []
    word_count = 0
    for sentence in text:
        sentence2ints = []
        for word in sentence.split():
            word_count += 1
            if word in word2int:
                sentence2ints.append(word2int[word])
            else:
                sentence2ints.append(word2int[TOKEN_UNK])
    if eos:
        sentence2ints.append(word2int[TOKEN_EOS])
    wordints.append(sentence2ints)
return wordints, word_count

As we did previously, we will apply the convert_sentence_to_ids() function to the source and target text:

id_fr, word_count_fr = convert_sentence_to_ids(mtdata_fr, fr_word2int)
id_en, word_count_en = convert_sentence_to_ids(mtdata_en, en_word2int, eos=True)

Since sentences/phrases that have many unknown words or tokens are not useful for training, we will remove them from the set:

en_filtered = []
fr_filtered = []
max_en_length = int(mtdata.EN_len.max())
max_fr_length = int(mtdata.FR_len.max())
min_length = 4
unknown_token_en_limit = 10
unknown_token_fr_limit = 10
for count,text in enumerate(id_en):
    unknown_token_en = unknown_tokens(id_en[count],en_word2int)
    unknown_token_fr = unknown_tokens(id_fr[count],fr_word2int)
    en_len = len(id_en[count])
    fr_len = len(id_fr[count])
    if( (unknown_token_en&gt;unknown_token_en_limit) or (unknown_token_fr&gt;unknown_token_fr_limit) or 
        (en_len&lt;min_length) or (fr_len&lt;min_length) ):
        continue
    fr_filtered.append(id_fr[count])
    en_filtered.append(id_en[count])
print("Length of filtered french/english sentences: ", len(fr_filtered), len(en_filtered) )

Output:

Length of filtered french/english sentences: 200404 200404

Note that we remove sentences that have unknown tokens exceeding unknown_token_en_limit or unknown_token_fr_limit, which is set as 10 in the code. Similarly, we remove sentences that are shorter than 4 words.

Table of Contents for Data preparation

Create new playlist

Sign In

Sign Up

Table of Contents for
Data preparation