TFIDF

This is the product of a term frequency and inverse document frequency:

TFIDF is a very popular weighting metric used in text mining.

To begin with, we separate our data into two data frames:

> title.df <- data.subset[,c('ID','TITLE')]
> others.df <- data.subset[,c('ID','PUBLISHER','CATEGORY')]

title.df stores the title and the article ID. others.df stores the article ID, publisher, and category.

 

We will be using the tm package in R to work with our text data:

library(tm)
title.reader <- readTabular(mapping=list(content="TITLE", id="ID"))
corpus <- Corpus(DataframeSource(title.df), readerControl=list(reader=title.reader))

We create a data frame reader using readTabular. Next, we use the Corpus function to create our text corpus. To Corpus, we pass our title.df data frame, and the title.reader data frame reader.

Our next step is to do some processing of the text data:

> getTransformations()
[1] "removeNumbers" "removePunctuation" "removeWords" "stemDocument" "stripWhitespace"
>

Calling getTransformation shows us the list of available functions that can be used to transform the text:

corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)

corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeWords, stopwords("english"))

We remove punctuation, numbers, unnecessary white spaces, and stop words from our articles. Finally, we convert our text to lowercase.

Punctuation, numbers, and whitespace may not be a good feature to distinguish one article from another. Hence, we remove them.

Let's look at the list of stopwords("english") in the tm package:

stopwords("english")

[1] "i" "me" "my" "myself" "we" "our" "ours"
[8] "ourselves" "you" "your" "yours" "yourself" "yourselves" "he"
[15] "him" "his" "himself" "she" "her" "hers" "herself"
[22] "it" "its" "itself" "they" "them" "their" "theirs"
[29] "themselves" "what" "which" "who" "whom" "this" "that"
[36] "these" "those" "am" "is" "are" "was" "were"
[43] "be" "been" "being" "have" "has" "had" "having"
[50] "do" "does" "did" "doing" "would" "should" "could"
[57] "ought" "i'm" "you're" "he's" "she's" "it's" "we're"
[64] "they're" "i've" "you've" "we've" "they've" "i'd" "you'd"
[71] "he'd" "she'd" "we'd" "they'd" "i'll" "you'll" "he'll"
[78] "she'll" "we'll" "they'll" "isn't" "aren't" "wasn't" "weren't"
[85] "hasn't" "haven't" "hadn't" "doesn't" "don't" "didn't" "won't"

These words will be present again in most of the English text, no matter what the content is. They cannot act as a good feature to distinguish our articles. Hence, we remove such words.

We want our algorithms to treat the words "dog" and "Dog" the same way. Hence, we bring them to lowercase.

Furthermore, we could apply stemming to our words. Stemming reduces the word to its root form.

Finally, let's proceed to build our document term matrix:

 > dtm <- DocumentTermMatrix(corpus, control=list(wordlenth = c(3,10) ,weighting = "weightTfIdf"))
> dtm
<<DocumentTermMatrix (documents: 2638, terms: 6628)>>
Non-/sparse entries: 18317/17466347
Sparsity : 100%
Maximal term length: 21

> inspect(dtm[1:5,10:15])
<<DocumentTermMatrix (documents: 5, terms: 6)>>
Non-/sparse entries: 0/30
Sparsity : 100%
Maximal term length: 9
Sample :
Terms
Docs abbey abbvie abc abcs abdul abenomics
180407 0 0 0 0 0 0
306465 0 0 0 0 0 0
371436 0 0 0 0 0 0
38081 0 0 0 0 0 0
410152 0 0 0 0 0 0
>

We use the DocumentTermMatrix function to create our matrix. We pass our text corpus and also pass a list to the parameter control. Inside the list, we say that we are interested only in words with length, so the number of characters between 3 and 10. For our cell values in our matrix, we want them to be TFIDF.

We can inspect the created document term matrix by calling the inspect function.

Having created a document term matrix, let's create the cosine distance between the articles:

sim.score <- tcrossprod_simple_triplet_matrix(dtm)/(sqrt( row_sums(dtm^2) %*% t(row_sums(dtm^2)) ))

In the preceding code, we take a document term matrix and return a document matrix.  We will be calling this a similarity matrix going forward. The cell values of this matrix correspond to the cosine similarity between two documents. The previous line of code implements this cosine similarity.

The cosine similarity between two vectors A and B of length n is given as:

Look at the numerator of this equation. It's the inner product of both the vectors. In a vector space model, the inner product of two vectors gives the similarity between them. We divide this inner product by the l2 norm of the those individual vector. This makes the score bounded between -1 and 1. The similarity is hence a number between -1 and 1, where -1 indicates that two documents are completely different from each other. If both the vectors have non-negative numbers, then the similarity score is bounded between 0 and 1.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset