Successfully cheating using SentiWordNet

While linguistic information, as mentioned in the preceding section, will most likely help us, there is something better we can use to harvest it: SentiWordNet (http://sentiwordnet.isti.cnr.it). Simply put, it is a 13 MB file, which we have to download from the site, unzipped, and put it into the data directory of the Jupyter notebook; it assigns most of the English words a positive and negative value. That means that for every synonym set, it records both the positive and negative sentiment values. Some examples are as follows:

POS

ID

PosScore

NegScore

SynsetTerms

Description

a

00311354

0.25

0.125

studious#1

Marked by care and effort; made a studious attempt to fix the television set

a

00311663

0

0.5

careless#1

Marked by lack of attention or consideration or forethought or thoroughness; not careful...

n

03563710

0

0

implant#1

A prosthesis placed permanently in tissue

v

00362128

0

0

kink#2

curve#5

curl#1

Form a curl, curve, or kink; the cigar smoke curled up at the ceiling

 

With the information in the POS column, we will be able to distinguish between the noun book and the verb book. PosScore and NegScore together will help us to determine the neutrality of the word, which is 1-PosScore-NegScore. SynsetTerms lists all words in the set that are synonyms. We can safely ignore the ID and Description columns.

The synset terms have a number appended, because some occur multiple times in different synsets. For example, fantasize conveys two different meanings, which also leads to different scores:

POS

ID

PosScore

NegScore

SynsetTerms

Description

v

01636859

0.375

0

fantasize#2

fantasise#2

Portray in the mind; he is fantasizing the ideal wife

v

01637368

0

0.125

fantasy#1

fantasize#1

fantasise#1

Indulge in fantasies; he is fantasizing when he says he plans to start his own company

 

To find out which of the synsets to take, we will need to really understand the meaning of the tweets, which is beyond the scope of this chapter. The field of research that focuses on this challenge is called word-sense disambiguation.
For our task, we take the easy route and simply average the scores over all the synsets in which a term is found. For fantasize, PosScore will be 0.1875 and NegScore will be 0.0625.

The following function, load_sent_word_net(), does all that for us and returns a dictionary where the keys are strings of the word type/word form, for example, n/implant, and the values are the positive and negative scores:

import csv, collections
    
def load_sent_word_net():
# making our life easier by using a dictionary that
# automatically creates an empty list whenever we access
# a not yet existing key
sent_scores = collections.defaultdict(list)

with open(os.path.join(DATA_DIR, "SentiWordNet_3.0.0_20130122.txt"),
"r") as csvfile:
reader = csv.reader(csvfile, delimiter='t',
quotechar='"')
for line in reader:
if line[0].startswith("#"):
continue
if len(line)==1:
continue

POS, ID, PosScore, NegScore, SynsetTerms, Gloss = line
if len(POS)==0 or len(ID)==0:
continue
for term in SynsetTerms.split(" "):
# drop number at the end of every term
term = term.split("#")[0]
term = term.replace("-", " ").replace("_", " ")
key = "%s/%s"%(POS, term.split("#")[0])
sent_scores[key].append((float(PosScore),
float(NegScore)))
for key, value in sent_scores.items():
sent_scores[key] = np.mean(value, axis=0)

return sent_scores
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset