Data preparation

First, we will load the original news file and the summaries into a pandas DataFrame:

titledata=[]
artdata=[]
with gzip.open('data/news.txt.gz') as artfile:
    for li in artfile:
        artdata.append(li)
with gzip.open('data/summary.txt.gz') as titlefile:
    for li in titlefile:
        titledata.append(li)
news = pd.DataFrame({'Text':artdata,'Summary':titledata})
news = news.sample(frac=0.1)
news['Text_len'] = news.Text.apply(lambda x: len(x.split()))
news['Summary_len'] = news.Summary.apply(lambda x: len(x.split()))

We will take a look at some sample news Text and Summary:

print(news['Text'].head(2).values)
print(news['Summary'].head(2).values)

Output:
[b'chinese president hu jintao said here monday that china will work with romania to promote bilateral trade and economic cooperation .
'  b'federal reserve policymakers opened a two-day meeting here tuesday to debate us monetary moves , a fed source reported .
'] 

[b'chinese president meets romanian pm
'  b'federal reserve policymakers open two-day meeting
']

For the word embeddings, we will utilize the glove.6B vector corpus. We will load this into an embeddings matrix:

def build_word_vector_matrix(vector_file):
    embedding_index = {}
    with codecs.open(vector_file, 'r', 'utf-8') as f:
        for i, line in enumerate(f):
        sr = line.split()
        if(len(sr)&lt;26):
            continue
    word = sr[0]
    embedding = np.asarray(sr[1:], dtype='float32')
    embedding_index[word] = embedding
    return embedding_index

embeddings_index = 
  build_word_vector_matrix('/Users/i346047/prs/temp/glove.6B.50d.txt')

embeddings_index contains the word to the corresponding word vector mappings. Next, we will create a mapping from words to integer indexes (and vice versa), for all of the words in both the news text and summary:

word2int = {} 
count_threshold = 20
value = 0
for word, count in word_counts_dict.items():
    if count &gt;= count_threshold or word in embeddings_index:
        word2int[word] = value
        value += 1


special_codes = [TOKEN_UNK,TOKEN_PAD,TOKEN_EOS,TOKEN_GO] 

for code in special_codes:
    word2int[code] = len(word2int)

int2word = {}
for word, value in word2int.items():
    int2word[value] = word

Note that we also include special codes for words that are not present in the vocabulary (UNK), padding tokens (PAD), an end of a sentence (EOS), and starting tokens (GO). These are defined by the corresponding constants in the code. The padding token will be used to pad sentences to fixed lengths, in both the news text and summary. The start and end of sentence tokens are prefixed and appended, respectively, to the input news text, during training and inference. Next, we will convert the news text and summaries to integer IDs:

def convert_sentence_to_ids(text, eos=False):
    wordints = []
    word_count = 0
    for sentence in text:
        sentence2ints = []
        for word in sentence.split():
            word_count += 1
            if word in word2int:
                sentence2ints.append(word2int[word])
            else:
                sentence2ints.append(word2int[TOKEN_UNK])
        if eos:
            sentence2ints.append(word2int[TOKEN_EOS])
        wordints.append(sentence2ints)
    return wordints, word_count

Note that for words not in the vocabulary, we have assigned the UNK token ID. Likewise, we will end sentences with the EOS token. Next, we will drop all input texts and summaries not within a specified minimum and maximum sentence length. We will also remove the sentences with a number of unknown words greater than the limit:

news_summaries_filtered = []
news_texts_filtered = []
max_text_length = int(news.Text_len.mean() + news.Text_len.std())
max_summary_length = int(int(news.Summary_len.mean() + news.Summary_len.std()))
min_length = 4
unknown_token_text_limit = 10
unknown_token_summary_limit = 4
for count,text in enumerate(id_texts):
    unknown_token_text = unknown_tokens(id_texts[count])
    unknown_token_summary = unknown_tokens(id_summaries[count])
    text_len = len(id_texts[count])
    summary_len = len(id_summaries[count])
    if((unknown_token_text&gt;unknown_token_text_limit) or         (unknown_token_summary&gt;unknown_token_summary_limit)):
     continue
     if(text_len&lt;min_length or summary_len&lt;min_length or text_len&gt;max_text_length or     summary_len&gt;max_summary_length):
     continue
     news_summaries_filtered.append(id_summaries[count])
     news_texts_filtered.append(id_texts[count])

We have used the minimum and maximum lengths within one standard deviation of the average length for the input text and summaries. The reader can try this with different limits and a different number of unknowns by changing the variables, max_summary_length, max_text_length, unknown_token_text_limit, unknown_token_summary_limit, and and min_length.

Now, we will look at model creation.

Table of Contents for Data preparation

Create new playlist

Sign In

Sign Up

Table of Contents for
Data preparation