Our first estimator

Now, we have everything in place to create our own first vectorizer. The most convenient way to do it is to inherit it from BaseEstimator. It requires us to implement the following three methods:

get_feature_names(): This returns a list of strings of the features that we will return in transform().
fit(document, y=None): As we are not implementing a classifier, we can ignore this one and simply return self.
transform(documents): This returns numpy.array(), containing an array of a shape (len(documents), len(get_feature_names)). This means, for every document in documents, it has to return a value for every feature name in get_feature_names().

Here is the implementation:

sent_word_net = load_sent_word_net()
    
 class LinguisticVectorizer(BaseEstimator):
     def get_feature_names(self):
         return np.array(['sent_neut', 'sent_pos', 'sent_neg',
                          'nouns', 'adjectives', 'verbs', 'adverbs',
                          'allcaps', 'exclamation', 'question', 'hashtag', 
                          'mentioning'])
     # we don't fit here but need to return the reference
     # so that it can be used like fit(d).transform(d)
     def fit(self, documents, y=None):
         return self


     def _get_sentiments(self, d):
         sent = tuple(d.split())
         tagged = nltk.pos_tag(sent)

         pos_vals = []
         neg_vals = []

         nouns = 0.
         adjectives = 0.
         verbs = 0.
         adverbs = 0.


         for w,t in tagged:
             p, n = 0, 0
             sent_pos_type = None
             if t.startswith("NN"):
                 sent_pos_type = "n"
                 nouns += 1
             elif t.startswith("JJ"):
                 sent_pos_type = "a"
                 adjectives += 1
             elif t.startswith("VB"):
                 sent_pos_type = "v"
                 verbs += 1
             elif t.startswith("RB"):
                 sent_pos_type = "r"
                 adverbs += 1

             if sent_pos_type is not None:
                 sent_word = "%s/%s" % (sent_pos_type, w)

                 if sent_word in sent_word_net:
                     p,n = sent_word_net[sent_word]

             pos_vals.append(p)
             neg_vals.append(n)

         l = len(sent)
         avg_pos_val = np.mean(pos_vals)
         avg_neg_val = np.mean(neg_vals)
         return [1-avg_pos_val-avg_neg_val, avg_pos_val, avg_neg_val, 
                 nouns/l, adjectives/l, verbs/l, adverbs/l]


    def transform(self, documents):
        obj_val, pos_val, neg_val, nouns, adjectives, 
           verbs, adverbs = np.array([self._get_sentiments(d) 
           for d in documents]).T


        allcaps = []
        exclamation = []
        question = []
        hashtag = []
        mentioning = []

        for d in documents:
            allcaps.append(np.sum([t.isupper() 
                 for t in d.split() if len(t)>2]))

            exclamation.append(d.count("!"))
            question.append(d.count("?"))
            hashtag.append(d.count("#"))
            mentioning.append(d.count("@"))

        result = np.array([obj_val, pos_val, neg_val, nouns, adjectives, 
                           verbs, adverbs, allcaps, exclamation,                          
                           question, hashtag, mentioning]).T

        return result

Table of Contents for Our first estimator

Create new playlist

Sign In

Sign Up

Table of Contents for
Our first estimator