Searching

Having created our similarity matrix, we can leverage that matrix to find a match for any given document. Let's see how to leverage this matrix to perform the search function in this section.

Once again, the block diagram of step 2 is presented as follows:

We will be using the sim.score created in the previous step to perform the search.

Let's say we want to find similar articles to article 38081:

> match.docs <- sim.score["38081",]
> match.docs
38081 306465 371436 410152 180407 311113 263442 171310 116144 70584
1.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.1428571 0.1543033
228128 128325 263795 230506 326375 136203 166993 158814 417839 220118
0.0000000 0.0000000 0.0000000 0.1690309 0.0000000 0.0000000 0.0000000 0.1259882 0.0000000 0.0000000
276048 307643 38069 349240 192743 131763 156247 16642 354055 410578
0.1336306 0.0000000 0.3779645 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
196045 393546 35625 370930 41315 35049 104981 276610 196153 367915
0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000

We go to our match.doc similarity matrix and pick up row 38081. Now, this row has all the other articles and their similarity scores.

Let's now take this row and make a data frame:

> match.df <- data.frame(ID = names(match.docs), cosine = match.docs, stringsAsFactors=FALSE)

> match.df$ID <- as.integer(match.df$ID)
> head(match.df)
ID cosine
38081 38081 1
306465 306465 0
371436 371436 0
410152 410152 0
180407 180407 0
311113 311113 0

Our match.df data frame now contains all the matching documents for 38081 and their cosine scores. No wonder the first row is 38081; it has to match itself perfectly.

But as we said before, we are going to recommend only the top 30 matches:

> match.refined<-head(match.df[order(-match.df$cosine),],30)
> head(match.refined)
ID cosine
38081 38081 1.0000000
38069 38069 0.3779645
231136 231136 0.2672612
334088 334088 0.2672612
276011 276011 0.2519763
394401 394401 0.2390457
>

So let's order our match.df data frame in descending order of cosine similarity and extract the top 30 matches using the head function.

Now that we have the matching documents, we need to present them in a ranked order. In order to rank the results, we are going to calculate some additional measures and use fuzzy logic to get the final ranking score.

Before we go ahead and calculate the additional measures, let's merge title.df and other.df with match.refined:

> match.refined <- inner_join(match.refined, title.df)
Joining, by = "ID"
> match.refined <- inner_join(match.refined, others.df)
Joining, by = "ID"

> head(match.refined)

ID cosine TITLE
1 38081 1.0000000 PRECIOUS-Gold ticks lower, US dollar holds near peak
2 38069 0.3779645 PRECIOUS-Bullion drops nearly 1 pct on dollar, palladium holds near 2-1/2-yr high
3 231136 0.2672612 Dollar steady near 3-1/2 month lows vs. yen, Aussie weaker
4 334088 0.2672612 Canadian dollar falls amid lower than expected GDP data
5 276011 0.2519763 Gold holds near four-month low as ECB move on rates awaited
6 394401 0.2390457 Dollar Tree Will Buy Competitor Family Dollar For $8.5 Billion
PUBLISHER CATEGORY
1 Reuters b
2 Reuters b
3 NASDAQ b
4 CTV News b
5 Business Standard b
6 The Inquisitr b

We have all the information and the cosine similarity in one data frame now.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset