Jaccards distance/index

The Jaccard index measures the similarity between two sets, and is a ratio of the size of the intersection and the size of the union of the participating sets. Here we have only have two elements, one for publisher and one for category, so our union is 2. The numerator, by adding the two Boolean variable, we get the intersection.

 

Finally, we also calculate the absolute difference (Manhattan distance) in the polarity values between the articles in the search results and our search article. We do a min/max normalization of the difference score as follows:

match.refined$polaritydiff <- abs(target.polarity - match.refined$polarity$sentiment)

range01 <- function(x){(x-min(x))/(max(x)-min(x))}
match.refined$polaritydiff <- range01(unlist(match.refined$polaritydiff))

We proceed to do some cleaning:

> head(match.refined)
ID cosine TITLE
1 419826 1.0000000 Report: iWatch Expected at Sept. 9 iPhone Event
2 137901 0.5000000 Local shops stocked with limited-edition LPs for event
3 113526 0.5000000 Blood Moon Event Will Begin Tonight
4 202272 0.5000000 Kim Kardashian attends USC SHOAH Foundation event dedicated to Armenian ...
5 420093 0.5000000 Apple iPad Air 2 To House 2 GB Of RAM; Apple iWatch Likely To Function ...
6 273675 0.4082483 More iWatch release hints, HealthKit lays groundwork
PUBLISHER CATEGORY polarity.element_id polarity.sentence_id polarity.word_count
1 PC Magazine t 1 1 7
2 Huntington Herald Dispatch e 2 1 8
3 Design \& Trend t 3 1 6
4 Armenpress.am e 4 1 10
5 International Business Times AU t 5 1 13
6 Product Reviews t 6 1 7
polarity.sentiment is.publisher is.category jaccard polaritydiff
1 0.00000000 1 1 1.0 0.00000000
2 0.00000000 0 0 0.0 0.00000000
3 0.00000000 0 1 0.5 0.00000000
4 0.28460499 0 0 0.0 0.17400235
5 -0.06933752 0 1 0.5 0.04239171
6 0.15118579 0 1 0.5 0.09243226
> match.refined$is.publisher = NULL
> match.refined$is.category = NULL
> match.refined$polarity = NULL
> match.refined$sentiment = NULL
> head(match.refined)
ID cosine TITLE
1 419826 1.0000000 Report: iWatch Expected at Sept. 9 iPhone Event
2 137901 0.5000000 Local shops stocked with limited-edition LPs for event
3 113526 0.5000000 Blood Moon Event Will Begin Tonight
4 202272 0.5000000 Kim Kardashian attends USC SHOAH Foundation event dedicated to Armenian ...
5 420093 0.5000000 Apple iPad Air 2 To House 2 GB Of RAM; Apple iWatch Likely To Function ...
6 273675 0.4082483 More iWatch release hints, HealthKit lays groundwork
PUBLISHER CATEGORY jaccard polaritydiff
1 PC Magazine t 1.0 0.00000000
2 Huntington Herald Dispatch e 0.0 0.00000000
3 Design \& Trend t 0.5 0.00000000
4 Armenpress.am e 0.0 0.17400235
5 International Business Times AU t 0.5 0.04239171
6 Product Reviews t 0.5 0.09243226

We remove some of the unwanted fields from the match.refined data frame. Finally, we have the ID, cosine distance, title, publisher, category, Jaccard score, and the polarity difference.

The last step is ranking these results.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset