State-of-the-art abstractive text summarization

In this section, we will look at two recent papers that describe enhancements to the model used in our news text summarization example from the previous section.

In the first paper, Abstractive Text Summarization Using Sequence-To-Sequence RNNs and Beyond (https://arxiv.org/abs/1602.06023), from IBM, Ramesh Nallapati, et al., applied the model for neural machine translation to text summarization and achieved better performance, as compared to state-of-the-art systems. This model uses a bidirectional GRU-RNN as an encoder, and a unidirectional GRU-RNN as a decoder. Note that this is the same model architecture that we used in our news summarization example.

The following are the main additional enhancements that they proposed:

  • In addition to word embeddings, they enhanced the input features to include POS tags, named entity tags, and the TF-IDF statistics of the words. This helps to identify the key concepts and entities in the document, improving the summarization text.
  • These additional features are concatenated with the existing word vectors and fed into the encoder.

The following diagram illustrates the use of the additional features - word embeddings (W), parts of speech (POS), named entity tags (NER), and term frequency, inverse-document frequency (TF-IDF):

In order to handle unseen or out-of-vocabulary (OOV) words in the test data, they used a generator-pointer network that copied words from the source document position when activated. The news summarization example that we described earlier simply ignores OOV words and replaces them with UNK tokens, which may not generate coherent summaries.

The architecture, including the pointer-generator network, is illustrated in the following figure:

For source datasets with long document sizes and many sentences, it is also necessary to identify key sentences that should be used for the summary. For this, they used a hierarchical network, with two bidirectional RNNs, on the encoding side. One was used at the word level and the other at the sentence level. More details on this paper can be found at https://arxiv.org/abs/1602.06023.

In their paper, Get to the Point: Summarization with Pointer-Generator Networks, from Google, Abigail See et al. used a pointer-generator network, similar to the approach taken by Ramesh Nallapati et al. But, in addition to handling only OOV words, they also considered the copy distribution and vocabulary distribution. The copy distribution considers the repetitive use of a specific word in the source document, in addition to its attention. This raises the probability of the word being chosen in the summary. The following figure shows how vocabulary distribution and attention distribution are combined to generate the final summary. The detailed paper can be found at https://arxiv.org/abs/1704.04368:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset