Summarization using gensim

Gensim has a summarizer that is based on an improved version of the TextRank algorithm by Rada Mihalcea et al. This is a graph-based algorithm that uses keywords in the document as vertices. The weight of the edges between the keywords is determined based on their co-occurrences in the text. An algorithm, similar to PageRank, is used to determine the importance of the keywords. Finally, a summary is extracted by ranking important sentences containing highly ranked keywords. It is clear, from this description, that TextRank is one example of an extractive summarizer. We will look at a simple example, using the gensim summarizer. As test data, we will use the nltk product review corpus:

from nltk.corpus import product_reviews_1
from gensim.summarization import summarizer

We have also imported the gensim summarizer. We will use it to generate the product review summary:

product_review_raw = product_reviews_1.raw('Apex_AD2600_Progressive_scan_DVD player.txt')
product_summary = summarizer.summarize(product_review_raw,word_count=100)
print("Raw Text Length: ", len(product_review_raw.split()))
print("Summary Length: ", len(product_summary.split()))
print("Summary: ", product_summary)

We have chosen one of the products (DVD player) in the nltk corpus. We will limit the summary to 100 words, as passed in the  word_count  parameter of the summarize function. We will print the original text and the summary text word lengths to see the difference between the two :

Raw Text Length: 13014 
Summary Length: 88
Summary: player[+2]##i bought this apex 2600 dvd player for myself at christmas because it got good reviews as a good value for the money on a variety of different sites . remote[-2]##we 've purchased 3 universal remotes so far-all claiming to work " apex " dvd players and none worked . ##after having bought and been disappointed in another brand of dvd player , i purchased the apex ad2600 from amazon and first of all i should say it was delivered much more quickly than i had expected .

The output shows the summary text as having around 88 words, compared to 13014 words in the original product review. We can also look at the keywords that are extracted by the summarizer:

from gensim.summarization import keywords
keywords(product_review_raw).split(" ")[0:20]

The keywords module will extract the main keywords in the document. We will print the top 20 keywords in the review text:

['players',
 'dvd player',
 'dvds',
 'play',
 'playing',
 'plays',
 'apex',
 'picture',
 'pictures',
 'pictured',
 'remotes',
 'work',
 'works',
 'working',
 'worked',
 'customer',
 'customers',
 'disks types played',
 'problems',
 'problem']

The output also shows that some of the keywords are different tenses of the same word.

Next, we will look at an abstractive summarizer, using deep learning.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset