So far our hope was to simply use the words independent of each other with the hope that a bag-of-words approach would suffice. Just from our intuition, however, neutral tweets probably contain a higher fraction of nouns, while positive or negative tweets are more colorful, requiring more adjectives and verbs. What if we could use this linguistic information of the tweets as well? If we could find out how many words in a tweet were nouns, verbs, adjectives, and so on, the classifier could maybe take that into account as well.
Determining the word types is what part of speech (POS) tagging is all about. A POS tagger parses a full sentence with the goal to arrange it into a dependence tree, where each node corresponds to a word and the parent-child relationship determines which word it depends on. With this tree, it can then make more informed decisions; for example, whether the word "book" is a noun ("This is a good book.") or a verb ("Could you please book the flight?").
You might have already guessed that NLTK will also play a role also in this area. And indeed, it comes readily packaged with all sorts of parsers and taggers. The POS tagger we will use, nltk.pos_tag()
, is actually a full-blown classifier trained using manually annotated sentences from the Penn Treebank Project (http://www.cis.upenn.edu/~treebank). It takes as input a list of word tokens and outputs a list of tuples, each element of which contains the part of the original sentence and its part of speech tag:
>>> import nltk >>> nltk.pos_tag(nltk.word_tokenize("This is a good book.")) [('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('good', 'JJ'), ('book', 'NN'), ('.', '.')] >>> nltk.pos_tag(nltk.word_tokenize("Could you please book the flight?")) [('Could', 'MD'), ('you', 'PRP'), ('please', 'VB'), ('book', 'NN'), ('the', 'DT'), ('flight', 'NN'), ('?', '.')]
The POS tag abbreviations are taken from the Penn Treebank Project (adapted from http://americannationalcorpus.org/OANC/penn.html):
POS tag |
Description |
Example |
---|---|---|
|
coordinating conjunction |
|
|
cardinal number |
|
|
determiner |
|
|
existential there |
|
|
foreign word |
|
|
preposition/subordinating conjunction |
|
|
adjective |
|
|
adjective, comparative |
|
|
adjective, superlative |
|
|
list marker |
|
|
modal |
|
|
noun, singular or mass |
|
|
noun plural |
|
|
proper noun, singular |
|
|
proper noun, plural |
|
|
predeterminer |
|
|
possessive ending |
|
|
personal pronoun |
|
|
| |
|
adverb |
|
|
adverb, comparative |
|
|
adverb, superlative |
|
|
particle |
|
|
to |
|
|
interjection |
|
|
verb, base form |
|
|
verb, past tense |
|
|
verb, gerund/present participle |
|
|
verb, past participle |
|
|
verb, singular, present, non-3D |
|
|
verb, third person singular, present |
|
|
wh-determiner |
|
|
wh-pronoun |
|
|
| |
|
wh-abverb |
|
With these tags it is pretty easy to filter the desired tags from the output of pos_tag()
. We simply have to count all the words whose tags start with NN
for nouns, VB
for verbs, JJ
for adjectives, and RB
for adverbs.
While the linguistic information that we discussed earlier will most likely help us, there is something better we can do to harvest it: SentiWordNet (http://sentiwordnet.isti.cnr.it). Simply put, it is a 13 MB file that assigns most of the English words a positive and negative value. In more complicated words, for every synonym set, it records both the positive and negative sentiment values. Some examples are as follows:
POS |
ID |
PosScore |
NegScore |
SynsetTerms |
Description |
---|---|---|---|---|---|
|
|
|
|
|
Marked by care and effort; "made a studious attempt to fix the television set" |
|
|
|
|
|
Marked by lack of attention or consideration or forethought or thoroughness; not careful |
|
|
|
|
|
A prosthesis placed permanently in tissue |
|
|
|
|
|
Form a curl, curve, or kink; "the cigar smoke curled up at the ceiling" |
With the information in the POS column, we will be able to distinguish between the noun "book" and the verb "book". PosScore
and NegScore
together will help us to determine the neutrality of the word, which is 1-PosScore-NegScore
. SynsetTerms
lists all words in the set that are synonyms. The ID
and Description
can be safely ignored for our purpose.
The synset terms have a number appended, because some occur multiple times in different synsets. For example, "fantasize" conveys two quite different meanings, also leading to different scores:
POS |
ID |
PosScore |
NegScore |
SynsetTerms |
Description |
---|---|---|---|---|---|
|
|
|
|
|
Portray in the mind; "he is fantasizing the ideal wife" |
|
|
|
|
|
Indulge in fantasies; "he is fantasizing when he says that he plans to start his own company" |
To find out which of the synsets to take, we would have to really understand the meaning of the tweets, which is beyond the scope of this chapter. The field of research that focuses on this challenge is called
word sense disambiguation. For our task, we take the easy route and simply average the scores over all the synsets in which a term is found. For "fantasize", PosScore
would be 0.1875
and NegScore
would be 0.0625
.
The following function, load_sent_word_net()
, does all that for us, and returns a dictionary where the keys are strings of the form "word type/word", for example "n/implant", and the values are the positive and negative scores:
import csv, collections def load_sent_word_net(): sent_scores = collections.defaultdict(list) with open(os.path.join(DATA_DIR, SentiWordNet_3.0.0_20130122.txt"), "r") as csvfile: reader = csv.reader(csvfile, delimiter=' ', quotechar='"') for line in reader: if line[0].startswith("#"): continue if len(line)==1: continue POS,ID,PosScore,NegScore,SynsetTerms,Gloss = line if len(POS)==0 or len(ID)==0: continue #print POS,PosScore,NegScore,SynsetTerms for term in SynsetTerms.split(" "): # drop number at the end of every term term = term.split("#")[0] term = term.replace("-", " ").replace("_", " ") key = "%s/%s"%(POS,term.split("#")[0]) sent_scores[key].append((float(PosScore), float(NegScore))) for key, value in sent_scores.iteritems(): sent_scores[key] = np.mean(value, axis=0) return sent_scores
Now we have everything in place to create our first vectorizer. The most convenient way to do it is to inherit it from BaseEstimator
. It requires us to implement the following three methods:
get_feature_names()
: This returns a list of strings of the features that we will return in transform()
.fit(document, y=None)
: As we are not implementing a classifier, we can ignore this one and simply return self
.transform(documents)
: This returns numpy.array()
, containing an array of shape (len(documents), len(get_feature_names)
). This means that for every document in documents
, it has to return a value for every feature name in get_feature_names()
.Let us now implement these methods:
sent_word_net = load_sent_word_net() class LinguisticVectorizer(BaseEstimator): def get_feature_names(self): return np.array(['sent_neut', 'sent_pos', 'sent_neg', 'nouns', 'adjectives', 'verbs', 'adverbs', 'allcaps', 'exclamation', 'question', 'hashtag','mentioning']) # we don't fit here but need to return the reference # so that it can be used like fit(d).transform(d) def fit(self, documents, y=None): return self def _get_sentiments(self, d): sent = tuple(d.split()) tagged = nltk.pos_tag(sent) pos_vals = [] neg_vals = [] nouns = 0. adjectives = 0. verbs = 0. adverbs = 0. for w,t in tagged: p, n = 0,0 sent_pos_type = None if t.startswith("NN"): sent_pos_type = "n" nouns += 1 elif t.startswith("JJ"): sent_pos_type = "a" adjectives += 1 elif t.startswith("VB"): sent_pos_type = "v" verbs += 1 elif t.startswith("RB"): sent_pos_type = "r" adverbs += 1 if sent_pos_type is not None: sent_word = "%s/%s"%(sent_pos_type, w) if sent_word in sent_word_net: p,n = sent_word_net[sent_word] pos_vals.append(p) neg_vals.append(n) l = len(sent) avg_pos_val = np.mean(pos_vals) avg_neg_val = np.mean(neg_vals) return [1-avg_pos_val-avg_neg_val, avg_pos_val, avg_neg_val, nouns/l, adjectives/l, verbs/l, adverbs/l] def transform(self, documents): obj_val, pos_val, neg_val, nouns, adjectives, verbs, adverbs = np.array([self._get_sentiments(d) for d in documents]).T allcaps = [] exclamation = [] question = [] hashtag = [] mentioning = [] for d in documents: allcaps.append(np.sum([t.isupper() for t in d.split() if len(t)>2])) exclamation.append(d.count("!")) question.append(d.count("?")) hashtag.append(d.count("#")) mentioning.append(d.count("@")) result = np.array([obj_val, pos_val, neg_val, nouns, adjectives, verbs, adverbs, allcaps, exclamation, question, hashtag, mentioning]).T return result
Nevertheless, using these linguistic features in isolation without the words themselves will not take us very far. Therefore, we have to combine TfidfVectorizer
with the linguistic features. This can be done with scikit-learn's FeatureUnion
class. It is initialized the same way as Pipeline
, but instead of evaluating the estimators in a sequence and each passing the output of the previous one to the next one, FeatureUnion
does it in parallel and joins the output vectors afterwards:
def create_union_model(params=None): def preprocessor(tweet): tweet = tweet.lower() for k in emo_repl_order: tweet = tweet.replace(k, emo_repl[k]) for r, repl in re_repl.iteritems(): tweet = re.sub(r, repl, tweet) return tweet.replace("-", " ").replace("_", " ") tfidf_ngrams = TfidfVectorizer(preprocessor=preprocessor, analyzer="word") ling_stats = LinguisticVectorizer() all_features = FeatureUnion([('ling', ling_stats), ('tfidf',tfidf_ngrams)]) clf = MultinomialNB() pipeline = Pipeline([('all', all_features), ('clf', clf)]) if params: pipeline.set_params(**params) return pipeline
Training and testing on the combined featurizers gives another 0.6 percent improvement on positive versus negative:
== Pos vs. neg == 0.808 0.016 0.892 0.010 == Pos/neg vs. irrelevant/neutral == 0.794 0.009 0.707 0.033 == Pos vs. rest == 0.886 0.006 0.533 0.026 == Neg vs. rest == 0.881 0.012 0.629 0.037
With these results, we probably do not want to use the positive versus rest and negative versus rest classifiers, but instead use first the classifier determining whether the tweet contains sentiment at all ("pos/neg versus irrelevant/neutral") and then, when it does, use the positive versus negative classifier to determine the actual sentiment.