Delta TFIDF

The problem with TFIDF is it fails to differentiate between words from the perspective of implicit sentiments. During the calculation of TFIDF, no knowledge of the document sentiment is added. Hence, it may not serve as a good differentiating feature for sentiment classification.

Delta TFIDF was proposed by Justin Martineau and Tim Finin in their paper Delta TFIDF: An Improved Feature Space for Sentiment Analysis: http://ebiquity.umbc.edu/_file_directory_/papers/446.pdf.

Delta TFIDF is calculated for each word and document combination as follows:

  • Finds Ctd, the number of times the word occurs in the document, term frequency of the word.
  • Finds nt, the number of negative documents in which the word has occurred.
  • Finds pt, the number of positive documents in which the word has occurred.

Now, Delta TFIDF for a word w in a document d is:

 Ctd * log ( nt / pt)

Let us calculate Delta TFIDF in R:

> dtm <- get.dtm('text','id', tweet.final, "weightTf")
> dtm
<<DocumentTermMatrix (documents: 368, terms: 1234)>>
Non-/sparse entries: 2804/451308
Sparsity : 99%
Maximal term length: 19

We get the document term matrix for our whole corpus. The rows are our document and the columns are our vocabulary. We use term frequency for our weighting scheme. Our dtm is very sparse. There are a lot of cells with zero values. Let us throw away some terms and try to reduce the sparsity of our document term matrix.

> dtm <- removeSparseTerms(dtm, 0.98)
> dtm
<<DocumentTermMatrix (documents: 368, terms: 58)>>
Non-/sparse entries: 934/20410
Sparsity : 96%
Maximal term length: 11

> dtm.mat <- as.matrix(dtm)

After reducing the sparseness, we convert our document term matrix to a matrix object.

Now let us split our data into a positive tweets dataset and a negative tweets dataset and get their respective document term matrices:

dtm.pos <- get.dtm('text','id', tweet.final[tweet.final$sentiment == 'Positive',],"weightBin")
dtm.neg <- get.dtm('text','id', tweet.final[tweet.final$sentiment == 'Negative',],"weightBin")

dtm.pos.mat <- as.matrix(dtm.pos)
dtm.neg.mat <- as.matrix(dtm.neg)

dtm.post.mat and dtm.neg.mat are the matrix representations of the positive and the negative tweet corpuses.

Let us find the document frequencies of words in both the corpuses:

pos.words.df <- colSums(dtm.pos.mat)
neg.words.df <- colSums(dtm.neg.mat)

By summing up the columns, we get the document frequencies. pos.words.df contains the words and their document frequencies in the positive corpus.

Similarly, neg.words.df contains the words and their document frequencies in the negative corpus.

Let us get all the unique words and the document IDs:

> tot.features <- colnames(dtm.mat)
> doc.ids <- rownames(dtm.mat)

We have all the information needed to calculate our final score.

Let us calculate the Delta TFIDF:

c.dtm.mat <- dtm.mat

for( i in 1:length(tot.features)){
for ( j in 1:length(doc.ids)){
# Number of times the term has occured in the document
ctd <- dtm.mat[doc.ids[j], tot.features[i]]
# Number for documents in pos data with the term
pt <- pos.words.df[tot.features[i]]
# Number for documents in pos data with the term
nt <- neg.words.df[tot.features[i]]
score <- ctd * log( nt / pt)
if(is.na(score)){
score <- 0
}
c.dtm.mat[doc.ids[j], tot.features[i]] <- score
}

}

Our dtm.mat has the term frequency for each word and document. We use pos.words.df and neg.words.df to find the document frequency of a word in both positive and negative corpuses. Finally, we update c.dtm.mat with the new score.

We have calculated the Delta TFIDF. This brings us to the end of this section. We have prepared our data to be consumed by the model. Let us proceed to build our sentiment classification model.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset