How it works...

In Step 1, we preprocessed the raw data in the text and summary by converting all the data into lowercase and replacing contractions such as "don't", "I'm", to, "do not", and "I am", respectively. Then, we removed any punctuation, non-alphanumeric characters, and stopwords such as "I", "me", "you", "for", and so on.

Note that we didn't remove any stopwords from the summary because doing so can lead to changes in the meaning of the summary. For example, a summary that says "not that great" will become "great" after removing any stopwords.

In Step 2, we put special start and end tokens in the summary, which in this case is the target text. The prediction of the first word of the target sequence happens when the model is provided with the start token. The end token signals the end of the sentence.

In the next step, we fixed the length of the source and target sentences that will be used to pad the sequences in the upcoming steps to make them uniform in length. Note that we had decided on the maximum length based on the distribution of the majority of the reviews. Next, in Step 4, we created tokenizers for the source and target phrases. Tokenization provides us with an indexed list that consists of words that have been indexed by their overall frequency in the dataset.

In Step 5, we created a custom function called encode_pad_sequences() to map the reviews and the associated summaries to a specific sequence of integer values. Each integer represents a particular word in the indexed lists that were created in the preceding step. This function also padded the sequences with zeros to make all the sequences of uniform length, which is essentially the maximum length that we had fixed for the source and target texts in Step 3. The padding='post' argument in the pad_sequences() function pads zeros at the end of each sequence. We split the data into training and testing datasets and applied the encode_pad_sequences() function to them. Note that in this step, we created additional input data from the already generated summaries, labelled as y1, to train the model. The target data, y2, is one step ahead of the input data and will not include the start token. The same was done for the validation data.

In Step 6, we configured a stacked LSTM encoder-decoder model architecture. First, we created the encoder network, followed by the decoder network. The encoder LSTM network converts the input reviews into its two state vectors: the hidden state and the cell state. We discarded the output of the encoder and only retained the state information. The decoder LSTM was configured to learn to convert the target sequences, that is, the internal representation of the summary information, into the same sequence but offset by one time step in the future. This type of training is known as teacher forcing. The initial states of the decoder LSTM are the state vectors from the encoder. This is done in order to make the decoder learn to generate targets at time t+1, given the targets at time t, conditioned on the input sequence. 

In Step 7, we compiled the model and trained it while using RMSprop as the optimizer and sparse_categorical_crossentropy as the loss function. This loss function converts the integer sequence into a one-hot vector on the fly and overcomes memory issues. We trained the model for 100 epochs, with a batch size of 200.

In Step 8, we created an inference model that generates summaries of unknown input sequences. In this inference mode, we encoded the input sequences and got the state vectors. Then, we generated a target sequence of size 1, which is essentially the start of the sequence character (start). Next, this target sequence, along with the state vectors, was fed to the decoder to predict the next word. The predicted sample word was appended to the target sequence. The same procedure took place recursively until either we got the end-of-sequence word (end) or we hit the sequence limit. This model architecture gave the decoder an opportunity to leverage previously generated words, along with the source text, as the context to generate the next word. Finally, we predicted the summaries for a few sample reviews.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset