Exploratory analysis of text

Once we have the tokenized data, one of the basic analyses that is commonly performed is counting words or tokens and their distributions in the document. This will enable us to know more about the main topics in the document. Let's start by analyzing the web text data that comes with NLTK:

>>> import nltk
>>> from nltk.corpus import webtext
>>> webtext_sentences = webtext.sents('firefox.txt')
>>> webtext_words = webtext.words('firefox.txt')
>>> len(webtext_sentences)
1142
>>> len(webtext_words)
102457

Note that we have only loaded the text related to the Firefox discussion forum (firefox.txt), though the web text data has other data, as well (like advertisements and movie script text). The preceding code output gives the number of sentences and words, respectively, in the entire text corpus. We can also get the size of the vocabulary by passing it through a set, as shown in the following code:

>>> vocabulary = set(webtext_words)
>>> len(vocabulary)
8296

To get the frequency distribution of the words in the text, we can utilize the nltk.FreqDist() function, which obtains the top words used in the text, providing a rough idea? of the main topic in the text data, as shown in the following code:

>>> frequency_dist = nltk.FreqDist(webtext_words)
>>> sorted(frequency_dist,key=frequency_dist.__getitem__, reverse=True)[0:30]
['.', 'in', 'to', '"', 'the', "'", 'not', '-', 'when', 'on', 'a', 'is', 't', 'and', 'of', '(', 'page', 'for', 'with', ')', 'window', 'Firefox', 'does', 'from', 'open', ':', 'menu', 'should', 'bar', 'tab']

This gives the top 30 words used in the text, though it is obvious that some of the stop words, such as the, frequently occur in the English language. However, we can also see that words such as Firefox appear because the text we used for analysis comes from a discussion forum about the Firefox browser. We can also look at the frequency distribution of words with a length greater than 3, which will exclude words such as the and is, by using the following code:

>>> large_words = dict([(k,v) for k,v in frequency_dist.items() if len(k)>3])
>>> frequency_dist = nltk.FreqDist(large_words)
>>> frequency_dist.plot(50,cumulative=False)

Here, we filtered all words with a length greater than 3, and we created a dictionary of word-frequency tuples. This will be passed to the NLTK frequency distribution plot. We will now take a look at the plot that follows, which shows the frequency distribution of the words:

This shows the distribution of frequency counts for the top 50 words. From the frequency distribution, we can generate a word cloud, to get an intuitive visualization of the words used in the text. For this, we have to install the wordcloud Python package, as follows:

pip install wordcloud

This will install the wordcloud package, which can generate word clouds by placing words on a canvas randomly, with sizes proportional to their frequency in the text. We will now look at the code for displaying the word cloud:

>>> from wordcloud import WordCloud
>>> wcloud = WordCloud().generate_from_frequencies(frequency_dist)
>>> import matplotlib.pyplot as plt
>>> plt.imshow(wcloud, interpolation='bilinear')
>>> plt.axis("off")
(-0.5, 399.5, 199.5, -0.5)
>>> plt.show()

In the preceding code, we passed in the frequency distribution of words that we obtained earlier, with NLTK. The word cloud generated by the preceding code is shown in the following screenshot: 

Based on our previous example, in the stop words section, we will look at how the distribution changes after we remove the stop words. After the removal of stop words, the word cloud looks more in line with the topic of the text:

In the preceding word cloud, common words such as when, with, from, and so on, have been removed. This can be directly verified in the dictionary of word frequency distribution, using the following code:

>>> words_in_webtext_without_sw = [word for word in webtext_words if word not in sw_l]
>>> 'when' in words_in_webtext_without_sw
False
>>> 'from' in words_in_webtext_without_sw
False

Similarly, we can check for the presence of other words in the frequency distribution dictionary words_in_webtext_without_sw ,that appear in the word cloud before the removal of stop words.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset