Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Taking the word types into account

So far our hope was to simply use the words independent of each other with the hope that a bag-of-words approach would suffice. Just from our intuition, however, neutral tweets probably contain a higher fraction of nouns, while positive or negative tweets are more colorful, requiring more adjectives and verbs. What if we could use this linguistic information of the tweets as well? If we could find out how many words in a tweet were nouns, verbs, adjectives, and so on, the classifier could maybe take that into account as well.

Determining the word types

Determining the word types is what part of speech (POS) tagging is all about. A POS tagger parses a full sentence with the goal to arrange it into a dependence tree, where each node corresponds to a word and the parent-child relationship determines which word it depends on. With this tree, it can then make more informed decisions; for example, whether the word "book" is a noun ("This is a good book.") or a verb ("Could you please book the flight?").

You might have already guessed that NLTK will also play a role also in this area. And indeed, it comes readily packaged with all sorts of parsers and taggers. The POS tagger we will use, nltk.pos_tag(), is actually a full-blown classifier trained using manually annotated sentences from the Penn Treebank Project (http://www.cis.upenn.edu/~treebank). It takes as input a list of word tokens and outputs a list of tuples, each element of which contains the part of the original sentence and its part of speech tag:

>>> import nltk
>>> nltk.pos_tag(nltk.word_tokenize("This is a good book."))
[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('good', 'JJ'), ('book', 'NN'), ('.', '.')]
>>> nltk.pos_tag(nltk.word_tokenize("Could you please book the flight?"))
[('Could', 'MD'), ('you', 'PRP'), ('please', 'VB'), ('book', 'NN'), ('the', 'DT'), ('flight', 'NN'), ('?', '.')]

The POS tag abbreviations are taken from the Penn Treebank Project (adapted from http://americannationalcorpus.org/OANC/penn.html):

POS tag	Description	Example
`CC`	coordinating conjunction	`or`
`CD`	cardinal number	`2` `second`
`DT`	determiner	`the`
`EX`	existential there	`there are`
`FW`	foreign word	`kindergarten`
`IN`	preposition/subordinating conjunction	`on`, `of`, `like`
`JJ`	adjective	`cool`
`JJR`	adjective, comparative	`cooler`
`JJS`	adjective, superlative	`coolest`
`LS`	list marker	`1)`
`MD`	modal	`could`, `will`
`NN`	noun, singular or mass	`book`
`NNS`	noun plural	`books`
`NNP`	proper noun, singular	`Sean`
`NNPS`	proper noun, plural	`Vikings`
`PDT`	predeterminer	`both the boys`
`POS`	possessive ending	`friend's`
`PRP`	personal pronoun	`I`, `he`, `it`
`PRP$`	possessive pronoun	`my`, `his`
`RB`	adverb	`however`, `usually`, `naturally`, `here`, `good`
`RBR`	adverb, comparative	`better`
`RBS`	adverb, superlative	`best`
`RP`	particle	`give up`
`TO`	to	`to go`, `to him`
`UH`	interjection	`uhhuhhuhh`
`VB`	verb, base form	`take`
`VBD`	verb, past tense	`took`
`VBG`	verb, gerund/present participle	`taking`
`VBN`	verb, past participle	`taken`
`VBP`	verb, singular, present, non-3D	`take`
`VBZ`	verb, third person singular, present	`takes`
`WDT`	wh-determiner	`which`
`WP`	wh-pronoun	`who`, `what`
`WP$`	possessive wh-pronoun	`whose`
`WRB`	wh-abverb	`where`, `when`

With these tags it is pretty easy to filter the desired tags from the output of pos_tag(). We simply have to count all the words whose tags start with NN for nouns, VB for verbs, JJ for adjectives, and RB for adverbs.

Successfully cheating using SentiWordNet

While the linguistic information that we discussed earlier will most likely help us, there is something better we can do to harvest it: SentiWordNet (http://sentiwordnet.isti.cnr.it). Simply put, it is a 13 MB file that assigns most of the English words a positive and negative value. In more complicated words, for every synonym set, it records both the positive and negative sentiment values. Some examples are as follows:

POS	ID	PosScore	NegScore	SynsetTerms	Description
`a`	`00311354`	`0.25`	`0.125`	`studious#1`	Marked by care and effort; "made a studious attempt to fix the television set"
`a`	`00311663`	`0`	`0.5`	`careless#1`	Marked by lack of attention or consideration or forethought or thoroughness; not careful
`n`	`03563710`	`0`	`0`	`implant#1`	A prosthesis placed permanently in tissue
`v`	`00362128`	`0`	`0`	`kink#2 curve#5 curl#1`	Form a curl, curve, or kink; "the cigar smoke curled up at the ceiling"

With the information in the POS column, we will be able to distinguish between the noun "book" and the verb "book". PosScore and NegScore together will help us to determine the neutrality of the word, which is 1-PosScore-NegScore. SynsetTerms lists all words in the set that are synonyms. The ID and Description can be safely ignored for our purpose.

The synset terms have a number appended, because some occur multiple times in different synsets. For example, "fantasize" conveys two quite different meanings, also leading to different scores:

POS	ID	PosScore	NegScore	SynsetTerms	Description
`v`	`01636859`	`0.375`	`0`	`fantasize#2 fantasise#2`	Portray in the mind; "he is fantasizing the ideal wife"
`v`	`01637368`	`0`	`0.125`	`fantasy#1 fantasize#1 fantasise#1`	Indulge in fantasies; "he is fantasizing when he says that he plans to start his own company"

To find out which of the synsets to take, we would have to really understand the meaning of the tweets, which is beyond the scope of this chapter. The field of research that focuses on this challenge is called word sense disambiguation. For our task, we take the easy route and simply average the scores over all the synsets in which a term is found. For "fantasize", PosScore would be 0.1875 and NegScore would be 0.0625.

The following function, load_sent_word_net(), does all that for us, and returns a dictionary where the keys are strings of the form "word type/word", for example "n/implant", and the values are the positive and negative scores:

import csv, collections
def load_sent_word_net():

    sent_scores = collections.defaultdict(list)

    with open(os.path.join(DATA_DIR, 
     SentiWordNet_3.0.0_20130122.txt"), "r") as csvfile:

        reader = csv.reader(csvfile, delimiter='	',
                 quotechar='"')
        for line in reader:
            if line[0].startswith("#"):
                continue
            if len(line)==1:
                continue

            POS,ID,PosScore,NegScore,SynsetTerms,Gloss = line
            if len(POS)==0 or len(ID)==0:
                continue
            #print POS,PosScore,NegScore,SynsetTerms
            for term in SynsetTerms.split(" "):
                # drop number at the end of every term
                term = term.split("#")[0] 
                term = term.replace("-", " ").replace("_", " ")
                key = "%s/%s"%(POS,term.split("#")[0])
                sent_scores[key].append((float(PosScore),
                float(NegScore)))
    for key, value in sent_scores.iteritems():
        sent_scores[key] = np.mean(value, axis=0)

    return sent_scores

Our first estimator

Now we have everything in place to create our first vectorizer. The most convenient way to do it is to inherit it from BaseEstimator. It requires us to implement the following three methods:

get_feature_names(): This returns a list of strings of the features that we will return in transform().
fit(document, y=None): As we are not implementing a classifier, we can ignore this one and simply return self.
transform(documents): This returns numpy.array(), containing an array of shape (len(documents), len(get_feature_names)). This means that for every document in documents, it has to return a value for every feature name in get_feature_names().

Let us now implement these methods:

sent_word_net = load_sent_word_net()

class LinguisticVectorizer(BaseEstimator):
    def get_feature_names(self):
        return np.array(['sent_neut', 'sent_pos', 'sent_neg',
         'nouns', 'adjectives', 'verbs', 'adverbs',
         'allcaps', 'exclamation', 'question', 'hashtag','mentioning'])

    # we don't fit here but need to return the reference
    # so that it can be used like fit(d).transform(d)
    def fit(self, documents, y=None):
        return self

    def _get_sentiments(self, d):
        
        sent = tuple(d.split())
        tagged = nltk.pos_tag(sent)

        pos_vals = []
        neg_vals = []

        nouns = 0.
        adjectives = 0.
        verbs = 0.
        adverbs = 0.

        for w,t in tagged:
            p, n = 0,0
            sent_pos_type = None
            if t.startswith("NN"):
                sent_pos_type = "n"
                nouns += 1
            elif t.startswith("JJ"):
                sent_pos_type = "a"
                adjectives += 1
            elif t.startswith("VB"):
                sent_pos_type = "v"
                verbs += 1
            elif t.startswith("RB"):
                sent_pos_type = "r"
                adverbs += 1

            if sent_pos_type is not None:
                sent_word = "%s/%s"%(sent_pos_type, w)

                if sent_word in sent_word_net:
                    p,n = sent_word_net[sent_word]

            pos_vals.append(p)
            neg_vals.append(n)

        l = len(sent)
        avg_pos_val = np.mean(pos_vals)
        avg_neg_val = np.mean(neg_vals)
        return [1-avg_pos_val-avg_neg_val, 
                avg_pos_val, avg_neg_val,
                nouns/l, adjectives/l, verbs/l, adverbs/l]



    def transform(self, documents):
        obj_val, pos_val, neg_val, nouns, adjectives, 
        verbs, adverbs = np.array([self._get_sentiments(d) 
                             for d in documents]).T

        allcaps = []
        exclamation = []
        question = []
        hashtag = []
        mentioning = []

        for d in documents:
            allcaps.append(np.sum([t.isupper() 
              for t in d.split() if len(t)>2]))

            exclamation.append(d.count("!"))
            question.append(d.count("?"))
            hashtag.append(d.count("#"))
            mentioning.append(d.count("@"))

        result = np.array([obj_val, pos_val, neg_val, 
                           nouns, adjectives, verbs, adverbs, 
                           allcaps, exclamation, question, 
                           hashtag, mentioning]).T

        return result

Putting everything together

Nevertheless, using these linguistic features in isolation without the words themselves will not take us very far. Therefore, we have to combine TfidfVectorizer with the linguistic features. This can be done with scikit-learn's FeatureUnion class. It is initialized the same way as Pipeline, but instead of evaluating the estimators in a sequence and each passing the output of the previous one to the next one, FeatureUnion does it in parallel and joins the output vectors afterwards:

def create_union_model(params=None):
    def preprocessor(tweet):
        tweet = tweet.lower()

        for k in emo_repl_order:
            tweet = tweet.replace(k, emo_repl[k])
        for r, repl in re_repl.iteritems():
            tweet = re.sub(r, repl, tweet)

        return tweet.replace("-", " ").replace("_", " ")

    tfidf_ngrams = TfidfVectorizer(preprocessor=preprocessor,
                                   analyzer="word")
    ling_stats = LinguisticVectorizer()
    all_features = FeatureUnion([('ling', ling_stats), ('tfidf',tfidf_ngrams)])
    clf = MultinomialNB()
    pipeline = Pipeline([('all', all_features), ('clf', clf)])

    if params:
        pipeline.set_params(**params)

    return pipeline

Training and testing on the combined featurizers gives another 0.6 percent improvement on positive versus negative:

== Pos vs. neg ==
0.808    0.016    0.892    0.010    
== Pos/neg vs. irrelevant/neutral ==
0.794    0.009    0.707    0.033    
== Pos vs. rest ==
0.886    0.006    0.533    0.026    
== Neg vs. rest ==
0.881    0.012    0.629    0.037

With these results, we probably do not want to use the positive versus rest and negative versus rest classifiers, but instead use first the classifier determining whether the tweet contains sentiment at all ("pos/neg versus irrelevant/neutral") and then, when it does, use the positive versus negative classifier to determine the actual sentiment.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Taking the word types into account

Create new playlist

Sign In

Sign Up

Taking the word types into account

Determining the word types

Successfully cheating using SentiWordNet

Our first estimator

Putting everything together

Table of Contents for
Taking the word types into account