Jaccard's distance

While ranking the matched articles, we want to also include the category and publisher columns.

Let's proceed to include those columns:

target.publisher <- match.refined[1,]$PUBLISHER
target.category <- match.refined[1,]$CATEGORY
target.polarity <- match.refined[1,]$polarity

target.title <- match.refined[1,]$TITLE

We need the publisher, category, and the sentiment details of the document we are searching for. Fortunately, the first row of our match.refined data frame stores all the details related to 38081. We retrieve those values from there.

For the rest of the articles, we need to find out if they match the publisher and category of document 38081:

match.refined$is.publisher <- match.refined$PUBLISHER == target.publisher
match.refined$is.publisher <- as.numeric(match.refined$is.publisher)

Now we can go into match.refined and create a new column called is.publisher, a Boolean column to say if the article's publisher is same as the publisher for the one we are searching for.

Now for the category:

match.refined$is.category <- match.refined$CATEGORY == target.category
match.refined$is.category <- as.numeric(match.refined$is.category)

Repeat the same for the category. We have created a new column called is.category to store the category match.

With the two new columns, we can calculate the Jaccard's distance between document 38081 and all the other documents in the match.refined data frame, as shown in the following code block:

match.refined$jaccard <- (match.refined$is.publisher + match.refined$is.category)/2
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset