The Naïve Bayes classifier using TextBlob

TextBlob is an interesting library that has a collection of tools for text processing purposes. It comes with the API for natural language processing (NLP) tasks, such as classification, noun phrase extraction, part-of-speech tagging, and sentiment analysis.

There are a few steps involved to make sure that one can use TextBlob. Any library that works with NLP needs some corpora; therefore, the following sequence of installation and configuration needs to be done before attempting to use this interesting library:

  • Installing TextBlob (either via conda or pip)
  • Downloading corpora

Installing TextBlob

Using binstar search -t conda textblob, one can find where to install it for anaconda users. More details can be found in Appendix, Go Forth and Explore Visualization.

Downloading corpora

The following command will let one download corpora:

$ python -m textblob.download_corpora

[nltk_data] Downloading package brown to
[nltk_data]     /Users/administrator/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/administrator/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/administrator/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package conll2000 to
[nltk_data]     /Users/administrator/nltk_data...
[nltk_data]   Unzipping corpora/conll2000.zip.
[nltk_data] Downloading package maxent_treebank_pos_tagger to
[nltk_data]     /Users/administrator/nltk_data...
[nltk_data]   Unzipping taggers/maxent_treebank_pos_tagger.zip.
[nltk_data] Downloading package movie_reviews to
[nltk_data]     /Users/administrator/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
Finished.

The Naïve Bayes classifier using TextBlob

TextBlob makes it easy to create custom text classifiers. In order to understand this better, one may need to do some experimentation with their training and test data. In the TextBlob 0.6.0 version, the following classifiers are available:

  • BaseClassifier
  • DecisionTreeClassifier
  • MaxEntClassifier
  • NLTKClassifier *
  • NaiveBayesClassifier
  • PositiveNaiveBayesClassifier

The classifier marked with * is the abstract class that wraps around the nltk.classify module.

For sentiment analysis, one can use the Naive Bayes classifier and train the system with this classifier and textblob.en.sentiments.PatternAnalyzer. A simple example is as follows:

from textblob.classifiers import NaiveBayesClassifier
from textblob.blob import TextBlob

from textblob.classifiers import NaiveBayesClassifier
from textblob.blob import TextBlob

train = [('I like this new tv show.', 'pos'),
  # similar train sentences with sentiments goes here]
test = [ ('I do not enjoy my job', 'neg'),
  # similar test sentences with sentiments goes here]
]

cl = NaiveBayesClassifier(train)
cl.classify("The new movie was amazing.") # shows if pos or neg


cl.update(test)

# Classify a TextBlob
blob = TextBlob("The food was good. But the service was horrible. "
                "My father was not pleased.", classifier=cl)
print(blob)
print(blob.classify())

for sentence in blob.sentences:
    print(sentence)
    print(sentence.classify())

Here is the result that will be displayed when the preceding code is run:

pos
neg
The food was good.
pos
But the service was horrible.
neg
My father was not pleased.
pos

One can read the training data from a file either in the text format or the JSON format. The sample data in the JSON file is shown here:

[
  {"text": "mission impossible three is awesome btw","label": "pos"},
  {"text": "brokeback mountain was beautiful","label":"pos"},
  {"text": " da vinci code is awesome so far","label":"pos"},
  {"text": "10 things i hate about you + a knight's tale * brokeback mountain","label":"neg"},
  {"text": "mission impossible 3 is amazing","label":"pos"},

    {"text": "harry potter = gorgeous","label":"pos"},  
    {"text": "i love brokeback mountain too: ]","label":"pos"},
]

from textblob.classifiers import NaiveBayesClassifier
from textblob.blob import TextBlob
from nltk.corpus import stopwords

stop = stopwords.words('english')

pos_dict={}
neg_dict={}
with open('/Users/administrator/json_train.json', 'r') as fp: 
     cl = NaiveBayesClassifier(fp, format="json")
print "Done Training"

rp = open('/Users/administrator/test_data.txt','r')
res_writer = open('/Users/administrator/results.txt','w')
for line in rp:
    linelen = len(line)
    line = line[0:linelen-1]
    sentvalue = cl.classify(line)
    blob = TextBlob(line)
    sentence = blob.sentences[0]
    for word, pos in sentence.tags:
       if (word not in stop) and (len(word)>3 
            and sentvalue == 'pos'): 
         if pos == 'NN' or pos == 'V':  
           pos_dict[word.lower()] = word.lower()
       if (word not in stop) and (len(word)>3 
            and sentvalue == 'neg'): 
         if pos == 'NN' or pos == 'V':  
           neg_dict[word.lower()] = word.lower()

    res_writer.write(line+" => sentiment "+sentvalue+"
")

    #print(cl.classify(line))
print "Lengths of positive and negative sentiments",len(pos_dict), len(neg_dict)  

Lengths of positive and negative sentiments 203 128 

We can add more training data from the corpus and evaluate the accuracy of the classifier with the following code:

test=[
("mission impossible three is awesome btw",'pos'),
("brokeback mountain was beautiful",'pos'),
("that and the da vinci code is awesome so far",'pos'),
("10 things i hate about you =",'neg'),
("brokeback mountain is a spectacularly beautiful movie",'pos'),
("mission impossible 3 is amazing",'pos'),
("the actor who plays harry potter sucks",'neg'),
("harry potter = gorgeous",'pos'),
('The beer was good.', 'pos'),
('I do not enjoy my job', 'neg'),
("I ain't feeling very good today.", 'pos'),
("I feel amazing!", 'pos'),
('Gary is a friend of mine.', 'pos'),
("I can't believe I'm doing this.", 'pos'),
("i went to see brokeback mountain, which is beautiful(",'pos'),
("and i love brokeback mountain too: ]",'pos')
]

print("Accuracy: {0}".format(cl.accuracy(test)))

from nltk.corpus import movie_reviews

reviews = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
new_train, new_test = reviews[0:100], reviews[101:200]

cl.update(new_train)
accuracy = cl.accuracy(test + new_test)
print("Accuracy: {0}".format(accuracy))

# Show 5 most informative features
cl.show_informative_features(4)

The output would be as follows:

Accuracy: 0.973913043478 
Most Informative Features        
contains(awesome) = True         pos : neg    =     51.9 : 1.0 
contains(with) = True            neg : pos    =     49.1 : 1.0 
contains(for) = True             neg : pos    =     48.6 : 1.0 
contains(on) = True              neg : pos    =     45.2 : 1.0 

First, the training set had 250 samples with an accuracy of 0.813 and later it added another 100 samples from movie reviews. The accuracy went up to 0.974. We therefore attempted to use different test samples and plotted the sample size versus accuracy, as shown in the following graph:

The Naïve Bayes classifier using TextBlob
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset