Searching

Having created our similarity matrix, we can leverage that matrix to find a match for any given document. Let's see how to leverage this matrix to perform the search function in this section.

Once again, the block diagram of step 2 is presented as follows:

We will be using the sim.score created in the previous step to perform the search.

Let's say we want to find similar articles to article 38081:

> match.docs <- sim.score["38081",]
> match.docs
    38081    306465    371436    410152    180407    311113    263442    171310    116144     70584 
1.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.1428571 0.1543033 
   228128    128325    263795    230506    326375    136203    166993    158814    417839    220118 
0.0000000 0.0000000 0.0000000 0.1690309 0.0000000 0.0000000 0.0000000 0.1259882 0.0000000 0.0000000 
   276048    307643     38069    349240    192743    131763    156247     16642    354055    410578 
0.1336306 0.0000000 0.3779645 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 
   196045    393546     35625    370930     41315     35049    104981    276610    196153    367915 
0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000

We go to our match.doc similarity matrix and pick up row 38081. Now, this row has all the other articles and their similarity scores.

Let's now take this row and make a data frame:

> match.df <- data.frame(ID = names(match.docs), cosine = match.docs, stringsAsFactors=FALSE)

> match.df$ID <- as.integer(match.df$ID)
> head(match.df)
           ID cosine
38081   38081      1
306465 306465      0
371436 371436      0
410152 410152      0
180407 180407      0
311113 311113      0

Our match.df data frame now contains all the matching documents for 38081 and their cosine scores. No wonder the first row is 38081; it has to match itself perfectly.

But as we said before, we are going to recommend only the top 30 matches:

> match.refined<-head(match.df[order(-match.df$cosine),],30)
> head(match.refined)
           ID    cosine
38081   38081 1.0000000
38069   38069 0.3779645
231136 231136 0.2672612
334088 334088 0.2672612
276011 276011 0.2519763
394401 394401 0.2390457
>

So let's order our match.df data frame in descending order of cosine similarity and extract the top 30 matches using the head function.

Now that we have the matching documents, we need to present them in a ranked order. In order to rank the results, we are going to calculate some additional measures and use fuzzy logic to get the final ranking score.

Before we go ahead and calculate the additional measures, let's merge title.df and other.df with match.refined:

> match.refined <- inner_join(match.refined, title.df)
Joining, by = "ID"
> match.refined <- inner_join(match.refined, others.df)
Joining, by = "ID"

> head(match.refined)

      ID    cosine                                                                             TITLE
1  38081 1.0000000                              PRECIOUS-Gold ticks lower, US dollar holds near peak
2  38069 0.3779645 PRECIOUS-Bullion drops nearly 1 pct on dollar, palladium holds near 2-1/2-yr high
3 231136 0.2672612                        Dollar steady near 3-1/2 month lows vs. yen, Aussie weaker
4 334088 0.2672612                           Canadian dollar falls amid lower than expected GDP data
5 276011 0.2519763                       Gold holds near four-month low as ECB move on rates awaited
6 394401 0.2390457                    Dollar Tree Will Buy Competitor Family Dollar For $8.5 Billion
          PUBLISHER CATEGORY
1           Reuters        b
2           Reuters        b
3            NASDAQ        b
4          CTV News        b
5 Business Standard        b
6     The Inquisitr        b

We have all the information and the cosine similarity in one data frame now.

Table of Contents for Searching

Create new playlist

Sign In

Sign Up

Table of Contents for
Searching