Term-frequeny inverse document frequency (TFIDF)

The term frequency (tf) for a given word w is the number of times the word occurs in document d. Conveniently, we can write it as tf(w, d). In our case, this is how many times it appears in a tweet. Term frequencies are called local weights. They indicate the importance of a word in a document. Many times, they are normalized by dividing it by the number of words in a document, that is the length of the document. The higher the value of tf of a word in a document, the higher the importance.

The document frequency or df for a given word w is the number of documents in which the word has occurred. Document frequency is considered the global weight.

Inverse document frequency or idf is calculated as follows:

idf( w, d) = log (number of documents in the corpus / 1 + document frequency of the word)

Term-frequency inverse document frequency or TFIDF is the product of tf and idf. They define the importance of a word in our corpus. According to this scheme, the most frequently occurring words are not given high importance since they don't carry enough information to differentiate one tweet from another.

Let us peek at our document term matrix:

> dtm.mat[1:2,10:15]
    Terms
Docs android art artstationhq ashtonsummers attack away
   1       0   0            0             0      1    0
   2       0   0            0             0      1    0

You can see the TFIDF value for the word fire in document 2 is 0.27.

Introduction to Information Retrieval, by Christopher Manning, is a good place to start to dig deeper into word weighting schemes. You can get it at https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf.

TFIDF is a great weighting scheme. It has been successfully used in many text mining projects and information retrieval projects. We are going to use a modified version of TFIDF called Delta TFIDF for our feature generation.

Table of Contents for Term-frequeny inverse document frequency (TFIDF)

Create new playlist

Sign In

Sign Up

Table of Contents for
Term-frequeny inverse document frequency (TFIDF)