Chapter 7. Sentiment Analysis – I Am Happy

Sentiment analysis or sentiment generation is one of the tasks in NLP. It is defined as the process of determining the sentiments behind a character sequence. It may be used to determine whether the speaker or the person expressing the textual thoughts is in a happy or sad mood, or it represents a neutral expression.

This chapter will include the following topics:

  • Introducing sentiment analysis
  • Sentiment analysis using NER
  • Sentiment analysis using machine learning
  • Evaluation of the NER system

Introducing sentiment analysis

Sentiment analysis may be defined as a task performed on natural languages. Here, computations are performed on the sentences or words expressed in natural language to determine whether they express a positive, negative, or a neutral sentiment. Sentiment analysis is a subjective task, since it provides the information about the text being expressed. Sentiment analysis may be defined as a classification problem in which classification may be of two types—binary categorization (positive or negative) and multi-class categorization (positive, negative, or neutral). Sentiment analysis is also referred to as text sentiment analysis. It is a text mining approach in which we determine the sentiments or the emotions behind the text. When we combine sentiment analysis with topic mining, then it is referred to as topic-sentiment analysis. Sentiment analysis can be performed using a lexicon. The lexicon could be domain-specific or of a general purpose nature. Lexicon may contain a list of positive expressions, negative expressions, neutral expressions, and stop words. When a testing sentence appears, then a simple look up operation can be performed through this lexicon.

One example of the word list is—Affective Norms for English Words (ANEW). It is an English word list found at the University of Florida. It consists of 1034 words for dominance, valence, and arousal. It was formed by Bradley and Lang. This word list was constructed for academic purposes and not for research purposes. Other variants are DANEW (Dutch ANEW) and SPANEW (Spanish ANEW).

AFINN consists of 2477 words (earlier 1468 words). This word list was formed by Finn Arup Nielson. The main purpose for creating this word list was to perform sentiment analysis for Twitter texts. A valence value ranging from -5 to +5 is allotted to each word.

The Balance Affective word list consists of 277 English words. The valence code ranges from 1 to 4. 1 means positive, 2 means negative, 3 means anxious, and 4 means neutral.

Berlin Affective Word List (BAWL) consists of 2,200 words in German. Another version of BAWL is Berlin Affective Word List Reloaded (BAWL-R) that comprises of additional arousal for words.

Bilingual Finnish Affective Norms comprises 210 British English as well as Finnish nouns. It also comprises taboo words.

Compass DeRose Guide to Emotion Words consists of emotional words in English. This was formed by Steve J. DeRose. Words were classified, but there was no valence and arousal.

Dictionary of Affect in Language (DAL) comprises emotional words that can be used for sentiment analysis. It was formed by Cynthia M. Whissell. So, it is also referred to as Whissell's Dictionary of Affect in Language (WDAL).

General Inquirer consists of many dictionaries. In this, the positive list comprises 1915 words and the negative list comprises 2291 words.

Hu-Liu opinion Lexicon (HL) comprises a list of 6800 words (positive and negative).

Leipzig Affective Norms for German (LANG) is a list that consists of 1000 nouns in German, and the rating has been done based on valence, concreteness, and arousal.

Loughran and McDonald Financial Sentiment Dictionaries were created by Tim Loughran and Bill McDonald. These dictionaries consist of words for financial documents, which are positive, negative, or modal words.

Moors consist of a list of words in Dutch related to dominance, arousal, and valence.

NRC Emotion Lexicon comprises of a list of words developed through Amazon Mechanical Turk by Saif M. Mohammad.

OpinionFinder's Subjectivity Lexicon comprises a list of 8221 words (positive or negative).

SentiSense comprises 2,190 synsets and 5,496 words based on 14 emotional categories.

Warringer comprises 13,915 words in English collected from Amazon Mechanical Turk that are related to dominance, arousal, and valence.

labMT is a word list consisting of 10,000 words.

Let's consider the following example in NLTK, which performs sentiment analysis for movie reviews:

import nltk
import random
from nltk.corpus import movie_reviews
docs = [(list(movie_reviews.words(fid)), cat)
        for cat in movie_reviews.categories()
        for fid in movie_reviews.fileids(cat)]
random.shuffle(docs)

all_tokens = nltk.FreqDist(x.lower() for x in movie_reviews.words())
token_features = all_tokens.keys()[:2000]
print token_features[:100]
    [',', 'the', '.', 'a', 'and', 'of', 'to', "'", 'is', 'in', 's', '"', 'it', 'that', '-', ')', '(', 'as', 'with', 'for', 'his', 'this', 'film', 'i', 'he', 'but', 'on', 'are', 't', 'by', 'be', 'one', 'movie', 'an', 'who', 'not', 'you', 'from', 'at', 'was', 'have', 'they', 'has', 'her', 'all', '?', 'there', 'like', 'so', 'out', 'about', 'up', 'more', 'what', 'when', 'which', 'or', 'she', 'their', ':', 'some', 'just', 'can', 'if', 'we', 'him', 'into', 'even', 'only', 'than', 'no', 'good', 'time', 'most', 'its', 'will', 'story', 'would', 'been', 'much', 'character', 'also', 'get', 'other', 'do', 'two', 'well', 'them', 'very', 'characters', ';', 'first', '--', 'after', 'see', '!', 'way', 'because', 'make', 'life']

def doc_features(doc):
    doc_words = set(doc)
    features = {}
    for word in token_features:
        features['contains(%s)' % word] = (word in doc_words)
    return features

print doc_features(movie_reviews.words('pos/cv957_8737.txt
feature_sets = [(doc_features(d), c) for (d,c) in doc]
train_sets, test_sets = feature_sets[100:], feature_sets[:100]
classifiers = nltk.NaiveBayesClassifier.train(train_sets)
print nltk.classify.accuracy(classifiers, test_sets)

    0.86

classifier.show_most_informative_features(5)

     Most Informative Features
contains(damon) = True              pos : neg    =     11.2 : 1.0
contains(outstanding) = True        pos : neg    =     10.6 : 1.0
contains(mulan) = True              pos : neg    =      8.8 : 1.0
contains(seagal) = True             neg : pos    =      8.4 : 1.0
contains(wonderfully) = True        pos : neg    =      7.4 : 1.0

Here, it is checked whether the informative features are present in the document or not.

Consider another example of semantic analysis. First, the preprocessing of text is performed. In this, individual sentences are identified in a given text. Then, tokens are identified in the sentences. Each token further comprises three entities, namely, word, lemma, and tag.

Let's see the following code in NLTK for the preprocessing of text:

importnltk

class Splitter(object):
def __init__(self):
self.nltk_splitter = nltk.data.load('tokenizers/punkt/english.pickle')
self.nltk_tokenizer = nltk.tokenize.TreebankWordTokenizer()

def split(self, text):
sentences = self.nltk_splitter.tokenize(text)
tokenized_sentences = [self.nltk_tokenizer.tokenize(sent) for sent in sentences]
return tokenized_sentences
classPOSTagger(object):
def __init__(self):
pass

def pos_tag(self, sentences):

pos = [nltk.pos_tag(sentence) for sentence in sentences]
pos = [[(word, word, [postag]) for (word, postag) in sentence] for sentence in pos]
returnpos

The lemmas generated will be same as the word forms. Tags are the POS tags. Consider the following code, which generates three tuples for each token, that is, word, lemma, and the POS tag:

text = """Why are you looking disappointed. We will go to restaurant for dinner."""
splitter = Splitter()
postagger = POSTagger()
splitted_sentences = splitter.split(text)
print splitted_sentences
[['Why','are','you','looking','disappointed','.'], ['We','will','go','to','restaurant','for','dinner','.']]

pos_tagged_sentences = postagger.pos_tag(splitted_sentences)

print pos_tagged_sentences
[[('Why','Why',['WP']),('are','are',['VBZ']),('you','you',['PRP']),('looking','looking',['VB']),('disappointed','disappointed',['VB']),('.','.',['.'])],[('We','We',['PRP']),('will','will',['VBZ']),('go','go',['VB']),('to','to',['TO']),('restaurant','restaurant',['NN']),('for','for',['IN']),('dinner','dinner',['NN']),('.','.',['.'])]]

We can construct two kinds of dictionary consisting of positive and negative expressions. We can then perform tagging on our processed text using dictionaries.

Let's consider the following NLTK code for tagging using dictionaries:

classDictionaryTagger(object):
def __init__(self, dictionary_paths):
files = [open(path, 'r') for path in dictionary_paths]
dictionaries = [yaml.load(dict_file) for dict_file in files]
map(lambda x: x.close(), files)
self.dictionary = {}
self.max_key_size = 0
forcurr_dict in dictionaries:
for key in curr_dict:
if key in self.dictionary:
self.dictionary[key].extend(curr_dict[key])
else:
self.dictionary[key] = curr_dict[key]
self.max_key_size = max(self.max_key_size, len(key))

def tag(self, postagged_sentences):
return [self.tag_sentence(sentence) for sentence in postagged_sentences]

def tag_sentence(self, sentence, tag_with_lemmas=False):
tag_sentence = []
        N = len(sentence)
ifself.max_key_size == 0:
self.max_key_size = N
i = 0
while (i< N):
j = min(i + self.max_key_size, N) #avoid overflow
tagged = False
while (j >i):
expression_form = ' '.join([word[0] for word in sentence[i:j]]).lower()
expression_lemma = ' '.join([word[1] for word in sentence[i:j]]).lower()
iftag_with_lemmas:
literal = expression_lemma
else:
literal = expression_form
if literal in self.dictionary:
    is_single_token = j - i == 1
original_position = i
i = j
taggings = [tag for tag in self.dictionary[literal]]
tagged_expression = (expression_form, expression_lemma, taggings)
ifis_single_token: #if the tagged literal is a single token, conserve its previous taggings:
original_token_tagging = sentence[original_position][2]
tagged_expression[2].extend(original_token_tagging)
tag_sentence.append(tagged_expression)
tagged = True
else:
                    j = j - 1
if not tagged:
tag_sentence.append(sentence[i])
i += 1
return tag_sentence

Here, words in the preprocessed text are tagged as positive or negative with the help of dictionaries.

Let's see the following code in NLTK, which can be used to compute the number of positive expressions and negative expressions:

def value_of(sentiment):
if sentiment == 'positive': return 1
if sentiment == 'negative': return -1
return 0
def sentiment_score(review):    
return sum ([value_of(tag) for sentence in dict_tagged_sentences for token in sentence for tag in token[2]])

The nltk.sentiment.util module is used in NLTK to perform sentiment analysis using Hu-Liu lexicon. This module counts the number of positive, negative, and neutral expressions, with the help of the lexicon, and then decides on the basis of majority counts whether the text consist of a positive, negative, or neutral sentiment. The words which are not available in the lexicon are considered neutral.

Sentiment analysis using NER

NER is the process of finding named entities and then categorizing named entities into different named entity classes. NER can be performed using different techniques, such as the Rule-based approach, List look up approach, and Statistical approaches (Hidden Markov Model, Maximum Entropy Markov Model, Support Vector Machine, Conditional Random Fields, and Decision Trees).

If named entities are identified in the list, then they may be removed or filtered out from the sentences. Similarly, stop words may also be removed. Now, sentiment analysis may be performed on the remaining words, since named entities are words that do not contribute to sentiment analysis.

Sentiment analysis using machine learning

The nltk.sentiment.sentiment_analyzer module in NLTK is used to perform sentiment analysis. It is based on machine learning techniques.

Let's see the following code of the nltk.sentiment.sentiment_analyzer module in NLTK:

from __future__ import print_function
from collections import defaultdict

from nltk.classify.util import apply_features, accuracy as eval_accuracy
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import (BigramAssocMeasures, precision as eval_precision,
    recall as eval_recall, f_measure as eval_f_measure)

from nltk.probability import FreqDist

from nltk.sentiment.util import save_file, timer
class SentimentAnalyzer(object):
    """
    A tool for Sentiment Analysis which is based on machine learning techniques.
    """
    def __init__(self, classifier=None):
        self.feat_extractors = defaultdict(list)
        self.classifier = classifier

Consider the following code, which will return all the words (duplicates) from a text:

     def all_words(self, documents, labeled=None):
        all_words = []
        if labeled is None:
            labeled = documents and isinstance(documents[0], tuple)
        if labeled == True:
            for words, sentiment in documents:
                all_words.extend(words)
        elif labeled == False:
            for words in documents:
                all_words.extend(words)
        return all_words

Consider the following code, which will apply the feature extraction function to the text:

def apply_features(self, documents, labeled=None):

        return apply_features(self.extract_features, documents, labeled)

Consider the following code, which will return the word's features:

def unigram_word_feats(self, words, top_n=None, min_freq=0):
        unigram_feats_freqs = FreqDist(word for word in words)
        return [w for w, f in unigram_feats_freqs.most_common(top_n)
                if unigram_feats_freqs[w] > min_freq]

The following code returns the bigram features:

def bigram_collocation_feats(self, documents, top_n=None, min_freq=3,
                                 assoc_measure=BigramAssocMeasures.pmi):
        finder = BigramCollocationFinder.from_documents(documents)
        finder.apply_freq_filter(min_freq)
        return finder.nbest(assoc_measure, top_n)

Let's see the following code, which can be used to classify a given instance using the available feature set:

def classify(self, instance):
        instance_feats = self.apply_features([instance], labeled=False)
        return self.classifier.classify(instance_feats[0])

Let's see the following code, which can be used for the extraction of features from the text:

def add_feat_extractor(self, function, **kwargs):
        self.feat_extractors[function].append(kwargs)

def extract_features(self, document):
        all_features = {}
        for extractor in self.feat_extractors:
            for param_set in self.feat_extractors[extractor]:
                feats = extractor(document, **param_set)
            all_features.update(feats)
        return all_features

Let's see the following code that can be used to perform training on the training file. Save_classifier is used to save the output in a file:

def train(self, trainer, training_set, save_classifier=None, **kwargs):
        print("Training classifier")
        self.classifier = trainer(training_set, **kwargs)
        if save_classifier:
            save_file(self.classifier, save_classifier)

        return self.classifier

Let's see the following code that can be used to perform testing and performance evaluation of our classifier using test data:

def evaluate(self, test_set, classifier=None, accuracy=True, f_measure=True,
                 precision=True, recall=True, verbose=False):
        if classifier is None:
            classifier = self.classifier
        print("Evaluating {0} results...".format(type(classifier).__name__))
        metrics_results = {}
        if accuracy == True:
            accuracy_score = eval_accuracy(classifier, test_set)
            metrics_results['Accuracy'] = accuracy_score

        gold_results = defaultdict(set)
        test_results = defaultdict(set)
        labels = set()
        for i, (feats, label) in enumerate(test_set):
            labels.add(label)
            gold_results[label].add(i)
            observed = classifier.classify(feats)
            test_results[observed].add(i)

        for label in labels:
            if precision == True:
                precision_score = eval_precision(gold_results[label],
                    test_results[label])
                metrics_results['Precision [{0}]'.format(label)] = precision_score
            if recall == True:
                recall_score = eval_recall(gold_results[label],
                    test_results[label])
                metrics_results['Recall [{0}]'.format(label)] = recall_score
            if f_measure == True:
                f_measure_score = eval_f_measure(gold_results[label],
                    test_results[label])
                metrics_results['F-measure [{0}]'.format(label)] = f_measure_score

        if verbose == True:
            for result in sorted(metrics_results):
                print('{0}: {1}'.format(result, metrics_results[result]))

        return metrics_results

Twitter can be considered as one of the most popular blogging services that is used to create messages referred to as tweets. These tweets comprise words that are either related to positive, negative, or neutral sentiments.

For performing sentiment analysis, we can use machine learning classifiers, statistical classifiers, or automated classifiers, such as the Naive Bayes Classifier, Maximum Entropy Classifier, Support Vector Machine Classifier, and so on.

These machine learning classifiers or automated classifiers are used to perform supervised classification, since they require training data for classification.

Let's see the following code in NLTK for feature extraction:

stopWords = []

#If there is occurrence of two or more same character, then replace it with the character itself.
def replaceTwoOrMore(s):
    pattern = re.compile(r"(.)1{1,}", re.DOTALL)
    return pattern.sub(r"11", s)
def getStopWordList(stopWordListFileName):
    # This function will read the stopwords from a file and builds a list.
    stopWords = []
    stopWords.append('AT_USER')
    stopWords.append('URL')

    fp = open(stopWordListFileName, 'r')
    line = fp.readline()
    while line:
        word = line.strip()
        stopWords.append(word)
        line = fp.readline()
    fp.close()
    return stopWords

def getFeatureVector(tweet):
    featureVector = []
    #Tweets are  firstly split into words
    words = tweet.split()
    for w in words:
        #replace two or more with two occurrences
        w = replaceTwoOrMore(w)
        #strip punctuation
        w = w.strip(''"?,.')
        #Words begin with alphabet is checked.
        val = re.search(r"^[a-zA-Z][a-zA-Z0-9]*$", w)
        #If there is a stop word, then it is ignored.
        if(w in stopWords or val is None):
            continue
        else:
            featureVector.append(w.lower())
    return featureVector
#end

#Tweets are read one by one and then processed.
fp = open('data/sampleTweets.txt', 'r')
line = fp.readline()

st = open('data/feature_list/stopwords.txt', 'r')
stopWords = getStopWordList('data/feature_list/stopwords.txt')

while line:
    processedTweet = processTweet(line)
    featureVector = getFeatureVector(processedTweet)
    print featureVector
    line = fp.readline()
#end loop
fp.close()

#Tweets are read one by one and then processed.
inpTweets = csv.reader(open('data/sampleTweets.csv', 'rb'), delimiter=',', quotechar='|')
tweets = []
for row in inpTweets:
    sentiment = row[0]
    tweet = row[1]
    processedTweet = processTweet(tweet)
    featureVector = getFeatureVector(processedTweet, stopWords)
    tweets.append((featureVector, sentiment));

#Features Extraction takes place using following method
def extract_features(tweet):
    tweet_words = set(tweet)
    features = {}
    for word in featureList:
        features['contains(%s)' % word] = (word in tweet_words)
    return features

During the training of a classifier, the input to the machine learning algorithm is a label and features. Features are obtained from the feature extractor when the input is given to the feature extractor. During prediction, a label is provided as an output of a classifier model and the input of the classifier model is the features that are obtained using the feature extractor. Let's have a look at a diagram explaining the same process:

Sentiment analysis using machine learning

Now, let's see the following code that can be used to perform sentiment analysis using the Naive Bayes Classifier:

NaiveBClassifier = nltk.NaiveBayesClassifier.train(training_set)
# Testing the classifiertestTweet = 'I liked this book on Sentiment Analysis a lot.'
processedTestTweet = processTweet(testTweet)
print NaiveBClassifier.classify(extract_features(getFeatureVector(processedTestTweet)))
testTweet = 'I am so badly hurt'
processedTestTweet = processTweet(testTweet)
print NBClassifier.classify(extract_features(getFeatureVector(processedTestTweet)))

Let's see the following code on sentiment analysis using maximum entropy:

MaxEntClassifier = nltk.classify.maxent.MaxentClassifier.train(training_set, 'GIS', trace=3, 
                    encoding=None, labels=None, sparse=True, gaussian_prior_sigma=0, max_iter = 10)
testTweet = 'I liked the book on sentiment analysis a lot'
processedTestTweet = processTweet(testTweet)
print MaxEntClassifier.classify(extract_features(getFeatureVector(processedTestTweet)))
print MaxEntClassifier.show_most_informative_features(10)

Evaluation of the NER system

Performance metrics or evaluation helps to show the performance of an NER system. The outcome of an NER tagger may be defined as the response and the interpretation of human beings as the answer key. So, we will provide the following definitions:

  • Correct: If the response is exactly the same as the answer key
  • Incorrect: If the response is not the same as the answer key
  • Missing: If the answer key is found tagged, but the response is not tagged
  • Spurious: If the response is found tagged, but the answer key is not tagged

The performance of an NER-based system can be judged by using the following parameters:

  • Precision (P): P=Correct/(Correct+Incorrect+Missing)
  • Recall (R): R=Correct/(Correct+Incorrect+Spurious)
  • F-Measure: F-Measure = (2*P*R)/(P+R)

Let's see the code for NER using the HMM:

#*******   Function to find all tags in corpus  **********

def find_tag_set(tra_lines):
global tag_set

tag_set = [ ]

for line in tra_lines:
tok = line.split()
for t in tok:
wd = t.split("/")
if not wd[1] in tag_set:
tag_set.append(wd[1])

return

#*******   Function to find frequency of each tag in tagged corpus  **********

defcnt_tag(tr_ln):
global start_li
global li
global tag_set
global c
global line_cnt
global lines

lines = tr_ln

start_li = [ ]   # list of starting tags

find_tag_set(tr_ln)


line_cnt  = 0
for line in lines:
tok = line.split()
x = tok[0].split("/")
if  not x[1] in start_li:
start_li.append(x[1])
line_cnt = line_cnt + 1

find_freq_tag()

find_freq_srttag()

return

def find_freq_tag():
global tag_cnt
global tag_set
tag_cnt={}
i = 0
for w in tag_set:
cal_freq_tag(tag_set[i])
i = i + 1
tag_cnt.update({w:freq_tg})
return


defcal_freq_tag(tg):
global freq_tg
global lines
freq_tg = 0

for line in lines:
freq_tg = freq_tg + line.count(tg)

return

#*******   Function to find frequency of each starting tag in tagged corpus  **********


def find_freq_srttag():
global lst
lst = {}        # start probability

i  = 0
for w in start_li:
        cc = freq_srt_tag(start_li[i])
prob = cc / line_cnt

lst.update({start_li[i]:prob})
i = i + 1
return
def freq_srt_tag(stg):
global lines
freq_srt_tg = 0

for line in lines:
tok = line.split()
ifstg in tok[0]:
freq_srt_tg = freq_srt_tg + 1

return(freq_srt_tg)

import tkinter as tk
import vit
import random
import cal_start_p
import calle_prob
import trans_mat
import time
import trans
import dict5
from tkinter import *
from tkinter import ttk
from tkinter.filedialog import askopenfilename
from tkinter.messagebox import showerror
import languagedetect1
import languagedetect3
e_dict = dict()
t_dict = dict()

def calculate1(*args):
import listbox1
def calculate2(*args):
import listbox2
def calculate3(*args):
import listbox3


def dispdlg():
global file_name
root = tk.Tk()
root.withdraw()
file_name = askopenfilename()
return

def tranhmm():
ttk.Style().configure("TButton", padding=6, relief="flat",background="Pink",foreground="Red")
ttk.Button(mainframe, text="BROWSE", command=find_train_corpus).grid(column=7, row=5, sticky=W)

# The following code will be used to display or accept the testing corpus from the user.
def testhmm():
ttk.Button(mainframe, text="Develop a new testing Corpus", command=calculate3).grid(column=9, row=5, sticky=E)

ttk.Button(mainframe, text="BROWSE", command=find_obs).grid(column=9, row=7, sticky=E)


#In HMM, We require parameters such as Start Probability, Transition Probability and Emission Probability. The following code is used to calculate emission probability matrix

def cal_emit_mat():
global emission_probability
global corpus
global tlines

calle_prob.m_prg(e_dict,corpus,tlines)

emission_probability = e_dict

return

# to calculate states


def cal_states():
global states
global tlines

cal_start_p.cnt_tag(tlines)

states = cal_start_p.tag_set

return

# to take observations


def find_obs():
global observations
global test_lines
global tra
global w4
global co
global tra
global wo1
global wo2
global testl
global wo3
global te
global definitionText
global definitionScroll
global dt2
global ds2
global dt11
global ds11

wo3=[ ]
woo=[ ]
wo1=[ ]
wo2=[ ]
    co=0
w4=[ ]
if(flag2!=0):
definitionText11.pack_forget()
definitionScroll11.pack_forget()
dt1.pack_forget()
ds1.pack_forget()
dispdlg()
f = open(file_name,"r+",encoding = 'utf-8')
test_lines = f.readlines()
f.close()
fname="C:/Python32/file_name1"




for x in states:
if not x in start_probability:
start_probability.update({x:0.0})
for line in test_lines:
ob = line.split()
observations = ( ob )         




fe=open("C:Python32output3_file","w+",encoding = 'utf-8')
fe.write("")
fe.close()
ff=open("C:Python32output4_file","w+",encoding = 'utf-8')


ff.write("")
ff.close()
ff7=open("C:Python32output5_file","w+",encoding = 'utf-8')
ff7.write("")
ff7.close()
ff8=open("C:Python32output6_file","w+",encoding = 'utf-8')
ff8.write("")
ff8.close()
ff81=open("C:Python32output7_file","w+",encoding = 'utf-8')
ff81.write("")
ff81.close()
dict5.search_obs_train_corpus(file1,fname,tlines,test_lines,observations, states, start_probability, transition_probability, emission_probability)

f20 = open("C:Python32output5_file","r+",encoding = 'utf-8')
te = f20.readlines()
tee=f20.read()
f = open(fname,"r+",encoding = 'utf-8')
train_llines = f.readlines()



ds11 = Scrollbar(root)
dt11 = Text(root, width=10, height=20,fg='black',bg='pink',yscrollcommand=ds11.set)
ds11.config(command=dt11.yview)
dt11.insert("1.0",train_llines)
dt11.insert("1.0","
")
dt11.insert("1.0","
")
dt11.insert("1.0","******TRAINING SENTENCES******")


     # an example of how to add new text to the text area
dt11.pack(padx=10,pady=150)
ds11.pack(padx=10,pady=150)

ds11.pack(side=LEFT, fill=BOTH)
dt11.pack(side=LEFT, fill=BOTH, expand=True)

ds2 = Scrollbar(root)
dt2 = Text(root, width=10, height=10,fg='black',bg='pink',yscrollcommand=ds2.set)
ds2.config(command=dt2.yview)
dt2.insert("1.0",test_lines)
dt2.insert("1.0","
")
dt2.insert("1.0","
")
dt2.insert("1.0","*********TESTING SENTENCES*********")   



     # an example of how to add new text to the text area
dt2.pack(padx=10,pady=150)
ds2.pack(padx=10,pady=150)

ds2.pack(side=LEFT, fill=BOTH)
dt2.pack(side=LEFT, fill=BOTH, expand=True)


definitionScroll = Scrollbar(root)
definitionText = Text(root, width=10, height=10,fg='black',bg='pink',yscrollcommand=definitionScroll.set)
definitionScroll.config(command=definitionText.yview)        
definitionText.insert("1.0",te)
definitionText.insert("1.0","
")
definitionText.insert("1.0","
")
definitionText.insert("1.0","*********OUTPUT*********")



     # an example of how to add new text to the text area
definitionText.pack(padx=10,pady=150)
definitionScroll.pack(padx=10,pady=150)

definitionScroll.pack(side=LEFT, fill=BOTH)
definitionText.pack(side=LEFT, fill=BOTH, expand=True)

l = tk.Label(root, text="NOTE:*****The Entities which are not tagged in Output are not Named Entities*****" , fg='black', bg='pink')
l.place(x = 500, y = 650, width=500, height=25)



    #ttk.Button(mainframe, text="View Parameters", command=parame).grid(column=11, row=10, sticky=E)
    #definitionText.place(x= 19, y = 200,height=25)

f20.close()

f14 = open("C:Python32output2_file","r+",encoding = 'utf-8')
testl = f14.readlines()
for lines in testl:
toke = lines.split()
for t in toke:
w4.append(t)
f14.close()
f12 = open("C:Python32output_file","w+",encoding = 'utf-8')
f12.write("")
f12.close()


ttk.Button(mainframe, text="SAVE OUTPUT", command=save_output).grid(column=11, row=7, sticky=E)
ttk.Button(mainframe, text="NER EVALUATION", command=evaluate).grid(column=13, row=7, sticky=E)
ttk.Button(mainframe, text="REFRESH", command=ref).grid(column=15, row=7, sticky=E)

return
def ref():
root.destroy()
import new1
return

Let's see the following code in Python, which will be used to evaluate the output produced by NER using HMM:

def evaluate():
global wDict
global woe
global woe1
global woe2
woe1=[ ]
woe=[ ]
woe2=[ ]
ws=[ ]
wDict = {}
i=0
    j=0
    k=0
sp=0
f141 = open("C:Python32output1_file","r+",encoding = 'utf-8')
tesl = f141.readlines()
for lines in tesl:
toke = lines.split()
for t in toke:
ws.append(t)
if t in wDict: wDict[t] += 1
else: wDict[t] = 1
for line in tlines:
tok = line.split()

for t in tok:
wd = t.split("/")
if(wd[1]!='OTHER'):
if t in wDict: wDict[t] += 1
else: wDict[t] = 1
print ("words  in train corpus ",wDict)
for key  in wDict:
i=i+1
print("total words in Dictionary are:",i)
for line in train_lines:
toe=line.split()
for t1 in toe:
if '/' not in t1:
sp=sp+1
woe2.append(t1)
print("Spurious words are")
for w in woe2:
print(w)
print("Total spurious words are:",sp) 
for l in te:
to=l.split()
for t1 in to:
if '/' in t1:
                #print(t1)
if t1 in ws or t1 in wDict:
woe.append(t1)
                    j=j+1
if t1 not in wDict:
wdd=t1.split("/")
ifwdd[0] not in woe2:
woe1.append(t1)
                        k=k+1
print("Word found in Dict are:")
for w in woe:
print(w)
print("Word not found in Dict are:")
for w in woe1:
print(w)
print("Total correctly tagged words are:",j)
print("Total incorrectly tagged words are:",k)
pr=(j)/(j+k)
re=(j)/(j+k+sp)                       
f141.close()
root=Tk()
root.title("NER EVALUATION")
root.geometry("1000x1000")

ds21 = Scrollbar(root)
dt21 = Text(root, width=10, height=10,fg='black',bg='pink',yscrollcommand=ds21.set)
ds21.config(command=dt21.yview)
dt21.insert("1.0",(2*pr*re)/(pr+re))
dt21.insert("1.0","
")
dt21.insert("1.0","F-MEASURE=")
dt21.insert("1.0","
")
dt21.insert("1.0","F-MEASURE=(2*PRECISION*RECALL)/(PRECISION+RECALL)")
dt21.insert("1.0","
")
dt21.insert("1.0","
")
dt21.insert("1.0",re)
dt21.insert("1.0","RECALL=")
dt21.insert("1.0","
")
dt21.insert("1.0","RECALL= CORRECT/(CORRECT +INCORRECT +SPURIOUS)")
dt21.insert("1.0","
")
dt21.insert("1.0","
")
dt21.insert("1.0",pr)
dt21.insert("1.0","PRECISION=")
dt21.insert("1.0","
")
dt21.insert("1.0","PRECISION= CORRECT/(CORRECT +INCORRECT +MISSING)")
dt21.insert("1.0","
")
dt21.insert("1.0","
")
dt21.insert("1.0","Total No. of Missing words are: 0")
dt21.insert("1.0","
")
dt21.insert("1.0","
")
dt21.insert("1.0",sp)
dt21.insert("1.0","Total No. of Spurious Words are:")
dt21.insert("1.0","
")
for w in woe2:
dt21.insert("1.0",w)
dt21.insert("1.0"," ")
dt21.insert("1.0","Total Spurious Words are:")
dt21.insert("1.0","
")
dt21.insert("1.0","
")
dt21.insert("1.0",k)
dt21.insert("1.0","Total No. of Incorrectly tagged words are:")
dt21.insert("1.0","
")
for w in woe1:
dt21.insert("1.0",w)
dt21.insert("1.0"," ")
dt21.insert("1.0","Total Incorrectly tagged words are:")
dt21.insert("1.0","
")
dt21.insert("1.0","
")
dt21.insert("1.0",j)
dt21.insert("1.0","Total No. of Correctly tagged words are:")
dt21.insert("1.0","
")
for w in woe:
dt21.insert("1.0",w)
dt21.insert("1.0"," ")
dt21.insert("1.0","Total Correctly tagged words are:")
dt21.insert("1.0","
")
dt21.insert("1.0","
")
dt21.insert("1.0","***************PERFORMANCE EVALUATION OF NERHMM***************")   



     # an example of how to add new text to the text area
dt21.pack(padx=5,pady=5)
ds21.pack(padx=5,pady=5)

ds21.pack(side=LEFT, fill=BOTH)
dt21.pack(side=LEFT, fill=BOTH, expand=True)
root.mainloop()
return
def save_output():
    #dispdlg()
f = open("C:Python32save","w+",encoding = 'utf-8')
f20 = open("C:Python32output5_file","r+",encoding = 'utf-8')
te = f20.readlines()
for t in te:
f.write(t)
f.close()
f20.close()

# to calculate start probability matrix

def cal_srt_prob():
global start_probability

start_probability = cal_start_p.lst

return

# to print vitarbi parameter if required

def pr_param():
l1 = tk.Label(root, text="HMM Training is going on.....Don't Click any Button!!",fg='black',bg='pink')
l1.place(x = 300, y = 150,height=25)

print("states")
print(states)
print(" ")
print(" ")
print("start probability")
print(start_probability)
print(" ")
print(" ")
print("transition probability")
print(transition_probability)
print(" ")
print(" ")
print("emission probability")
print(emission_probability)
l1 = tk.Label(root, text="                                                                                                         ")
l1.place(x = 300, y = 150,height=25)
global flag1
    flag1=0
global flag2
    flag2=0
ttk.Button(mainframe, text="View Parameters", command=parame).grid(column=7, row=5, sticky=W)
return

def parame():
global flag2
    flag2=flag1+1
global definitionText11
global definitionScroll11
definitionScroll11 = Scrollbar(root)
definitionText11 = Text(root, width=10, height=10,fg='black',bg='pink',yscrollcommand=definitionScroll11.set)

    #definitionText.place(x= 19, y = 200,height=25)
definitionScroll11.config(command=definitionText11.yview)

definitionText11.delete("1.0", END)   # an example of how to delete all current text
definitionText11.insert("1.0",emission_probability )
definitionText11.insert("1.0","
")
definitionText11.insert("1.0","Emission Probability")
definitionText11.insert("1.0","
")
definitionText11.insert("1.0",transition_probability)
definitionText11.insert("1.0","Transition Probability")
definitionText11.insert("1.0","
")
definitionText11.insert("1.0",start_probability)
definitionText11.insert("1.0","Start Probability")


     # an example of how to add new text to the text area
definitionText11.pack(padx=10,pady=175)
definitionScroll11.pack(padx=10,pady=175)

definitionScroll11.pack(side=LEFT, fill=BOTH)
definitionText11.pack(side=LEFT, fill=BOTH, expand=True)

return

# to calculate transition probability matrix


def cat_trans_prob():
global transition_probability
global corpus
global tlines

trans_mat.main_prg(t_dict,corpus,tlines)

transition_probability = t_dict
return

def find_train_corpus():
global train_lines
global tlines
global c
global corpus
global words1
global w1
global train1
global fname
global file1
global ds1
global dt1
global w21
words1=[ ]
    c=0
w1=[ ]
w21=[ ]
f11 = open("C:Python32output1_file","w+",encoding='utf-8')
f11.write("")
f11.close()
fr = open("C:Python32output_file","w+",encoding='utf-8')
fr.write("")
fr.close()
fgl=open("C:Python32ladetect1","w+",encoding = 'utf-8')
fgl.write("")
fgl.close()

fgl=open("C:Python32ladetect","w+",encoding = 'utf-8')
fgl.write("")
fgl.close()
dispdlg()
f = open(file_name,"r+",encoding = 'utf-8')
train_lines = f.readlines()

ds1 = Scrollbar(root)
dt1 = Text(root, width=10, height=10,fg='black',bg='pink',yscrollcommand=ds1.set)
ds1.config(command=dt1.yview)
dt1.insert("1.0",train_lines)
dt1.insert("1.0","
")
dt1.insert("1.0","
")
dt1.insert("1.0","*********TRAINING SENTENCES*********")


     # an example of how to add new text to the text area
dt1.pack(padx=10,pady=175)
ds1.pack(padx=10,pady=175)

ds1.pack(side=LEFT, fill=BOTH)
dt1.pack(side=LEFT, fill=BOTH, expand=True)
fname="C:/Python32/file_name1"
f = open(file_name,"r+",encoding = 'utf-8')
    file1=file_name
p = open(fname,"w+",encoding = 'utf-8')

corpus = f.read()
for line in train_lines:
tok = line.split()
for t in tok:
n=t.split()

le=len(t)
i=0
            j=0
for n1 in n:
while(j<le):

if(n1[j]!='/'):
i=i+1
                        j=j+1
else:
                        j=j+1
if(i==le):
p.write(t)
p.write("/OTHER ")          #Handling Spurious words
else:
p.write(t)
p.write(" ")

p.write("
")       



p.close()
fname="C:/Python32/file_name1"
f00 = open(fname,"r+",encoding = 'utf-8')
tlines = f00.readlines()
for line in tlines:
tok = line.split()
for t in tok:
wd = t.split("/")
if(wd[1]!='OTHER'):
if not wd[0] in words1:
words1.append(wd[0])
w1.append(wd[1])
f00.close()

f157 = open("C:Python32input_file","w+",encoding='utf-8')
f157.write("")
f157.close()
f1 = open("C:Python32input_file","w+",encoding='utf-8')   #input_file has list of Named Entities of training file
for w in words1:
f1.write(w)
f1.write("
")
f1.close()
fr=open("C:Python32detect","w+",encoding = 'utf-8')
fr.write("")
fr.close()


f.close()
f.close()


cal_states()
cal_emit_mat()
cal_srt_prob()
cat_trans_prob()
pr_param()


return

root=Tk()
root.title("NAMED ENTITY RECOGNITION IN NATURAL LANGUAGES USING HIDDEN MARKOV MODEL")
root.geometry("1000x1000")

mainframe = ttk.Frame(root, padding="20 20 12 12")
mainframe.grid(column=0, row=0, sticky=(N, W, E, S))

b=StringVar()
a=StringVar()

ttk.Style().configure("TButton", padding=6, relief="flat",background="Pink", foreground="Red")
ttk.Button(mainframe, text="ANNOTATION", command=calculate1).grid(column=5, row=3, sticky=W)

ttk.Button(mainframe, text="TRAIN HMM", command=tranhmm).grid(column=7, row=3, sticky=E)

ttk.Button(mainframe, text="TEST HMM", command=testhmm).grid(column=9, row=3, sticky=E)

ttk.Button(mainframe, text="HELP", command=hmmhelp).grid(column=11, row=3, sticky=E)


# To call viterbi for particular observations find in find_obs

def call_vitar():
global test_lines
global train_lines
global corpus
global observations
global states
global start_probability
global transition_probability
global emission_probability

find_train_corpus()

cal_states()
find_obs()
cal_emit_mat()
cal_srt_prob()
cat_trans_prob()

   # print("Vitarbi Parameters are for selected corpus")
   # pr_param()

    # -----------------To add all states not in start probability ----------------

for x in states:
if not x in start_probability:
start_probability.update({x:0.0})


for line in test_lines:

ob = line.split()
observations = ( ob )
print(" ")
print(" ")
print(line)
print("**************************")
print(vit.viterbi(observations, states, start_probability, transition_probability, emission_probability),bg='Pink',fg='Red')
return



root.mainloop()

The preceding code in Python shows how NER is performed using the HMM, and how an NER system is evaluated using performance metrics (Precision, Recall and F-Measure).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset