Our first estimator

Now, we have everything in place to create our own first vectorizer. The most convenient way to do it is to inherit it from BaseEstimator. It requires us to implement the following three methods:

  • get_feature_names(): This returns a list of strings of the features that we will return in transform().
  • fit(document, y=None): As we are not implementing a classifier, we can ignore this one and simply return self.
  • transform(documents): This returns numpy.array(), containing an array of a shape (len(documents), len(get_feature_names)). This means, for every document in documents, it has to return a value for every feature name in get_feature_names().

Here is the implementation:

sent_word_net = load_sent_word_net()
    
 class LinguisticVectorizer(BaseEstimator):
def get_feature_names(self):
return np.array(['sent_neut', 'sent_pos', 'sent_neg',
'nouns', 'adjectives', 'verbs', 'adverbs',
'allcaps', 'exclamation', 'question', 'hashtag',
'mentioning'])
# we don't fit here but need to return the reference
# so that it can be used like fit(d).transform(d)
def fit(self, documents, y=None):
return self

def _get_sentiments(self, d):
sent = tuple(d.split())
tagged = nltk.pos_tag(sent)

pos_vals = []
neg_vals = []

nouns = 0.
adjectives = 0.
verbs = 0.
adverbs = 0.

for w,t in tagged:
p, n = 0, 0
sent_pos_type = None
if t.startswith("NN"):
sent_pos_type = "n"
nouns += 1
elif t.startswith("JJ"):
sent_pos_type = "a"
adjectives += 1
elif t.startswith("VB"):
sent_pos_type = "v"
verbs += 1
elif t.startswith("RB"):
sent_pos_type = "r"
adverbs += 1

if sent_pos_type is not None:
sent_word = "%s/%s" % (sent_pos_type, w)

if sent_word in sent_word_net:
p,n = sent_word_net[sent_word]

pos_vals.append(p)
neg_vals.append(n)

l = len(sent)
avg_pos_val = np.mean(pos_vals)
avg_neg_val = np.mean(neg_vals)
return [1-avg_pos_val-avg_neg_val, avg_pos_val, avg_neg_val,
nouns/l, adjectives/l, verbs/l, adverbs/l]

def transform(self, documents):
obj_val, pos_val, neg_val, nouns, adjectives,
verbs, adverbs = np.array([self._get_sentiments(d)
for d in documents]).T

allcaps = []
exclamation = []
question = []
hashtag = []
mentioning = []

for d in documents:
allcaps.append(np.sum([t.isupper()
for t in d.split() if len(t)>2]))

exclamation.append(d.count("!"))
question.append(d.count("?"))
hashtag.append(d.count("#"))
mentioning.append(d.count("@"))

result = np.array([obj_val, pos_val, neg_val, nouns, adjectives,
verbs, adverbs, allcaps, exclamation,
question, hashtag, mentioning]).T

return result
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset