News aggregator use case and data

We have 1,000 news articles from different publishers. Each article belongs to a different category: technical, entertainment, and others. Our case is to alleviate the cold start problem faced by our customers. Simply put, what do we recommend to a customer when we don't have any information about his preferences? We are either looking at the customer for the first time or we don't have any mechanism set up yet to capture customer interaction with our products/items.

This data is a subset of the news aggregator dataset from https://archive.ics.uci.edu/ml/datasets/News+Aggregator.

A subset of the data is stored in a csv file.

Let's quickly look at the data provided:

> library(tidyverse)
> library(tidytext)
> library(tm)
> library(slam)
> 
> 
> cnames <- c('ID' , 'TITLE' , 'URL' , 
+             'PUBLISHER' , 'CATEGORY' , 
+             'STORY' , 'HOSTNAME' ,'TIMESTAMP')
> 
> data <- read_tsv('newsCorpus.csv', 
+                    col_names = cnames,
+                    col_types = cols(
+                    ID = col_integer(),
+                    TITLE = col_character(),
+                    URL = col_character(),
+                    PUBLISHER = col_character(),
+                    CATEGORY = col_character(),
+                    STORY = col_character(),
+                    HOSTNAME = col_character(),
+                    TIMESTAMP = col_double()
+                  )
+                  )
> 
> 
> head(data)
# A tibble: 6 x 8
      ID                                                      TITLE
   <int>                                                      <chr>
1 273675       More iWatch release hints, HealthKit lays groundwork
2 356956    Burger King debuts Proud Whopper to support LGBT rights
3 143853             A-Sides and B-Sides: Record Store Day Lives On
4 376630                         Smallpox virus found on NIH campus
5 160274 iPhone 6 Specs Leak: Curved Glass Display; Q3 Release Date
6 273554                   New Valve VR headset crops up in testing
# ... with 6 more variables: URL <chr>, PUBLISHER <chr>, CATEGORY <chr>, STORY <chr>, HOSTNAME <chr>,
#   TIMESTAMP <dbl>

Every article has the following columns:

ID: A unique identifier
TITLE: The title of the article (free text)
URL: The article's URL
PUBLISHER: Publisher of the article
CATEGORY: Some categorization under which the articles are grouped
STORY: An ID for the group of stories the article belongs to
HOSTNAME: Hostname of the URL
TIMESTAMP: Timestamp published

We use the cnames vector to define these headings as we read the file using the read_tsv function. Further inside read_tsv, while defining the column types, we also specify the variable type for each of these columns.

The following are some distinct publishers and categories:

> data %>% group_by(PUBLISHER) %>% summarise()

# A tibble: 2,991 x 1
                 PUBLISHER
                     <chr>
 1                 1011now
 2                  10News
 3                    10TV
 4             123Jump.com
 5           12NewsNow.Com
 6               13WHAM-TV
 7       13abc Action News
 8 14 News WFIE Evansville
 9       "24\/7 Wall St."
10     "2DayFM \(blog\)"
# ... with 2,981 more rows

> data %>% group_by(CATEGORY) %>% summarise()
# A tibble: 4 x 1
  CATEGORY
     <chr>
1        b
2        e
3        m
4        t
>

There are four categories and around 2,900 publishers.

Let's look a little closer at our publishers:

> publisher.count <- data.frame(data %>% group_by(PUBLISHER) %>% summarise(ct =n()))

> head(publisher.count)
      PUBLISHER ct
1       1011now  1
2        10News  4
3          10TV  2
4   123Jump.com  1
5 12NewsNow.Com  3
6     13WHAM-TV  3

> dim(publisher.count)
[1] 2991    2

> dim(publisher.count[publisher.count$ct <= 10,])
[1] 2820    2

We first find the number of articles under each publisher. Looks like a lot of publishers have very few articles. Let's validate it to see the number of publishers with less than 10 articles. We can see 2,820 publishers, out of 2,991, have less than ten articles.

Let's get the top 100 publishers by looking at the number of articles they have published:

> publisher.top <- head(publisher.count[order(-publisher.count$ct),],100)

> head(publisher.top)
            PUBLISHER ct
1937          Reuters 90
309      Businessweek 58
1548           NASDAQ 49
495  Contactmusic.com 48
540        Daily Mail 47
882        GlobalPost 47
>

We can see that Reuters tops the list. We have retained only the articles from the top 100 publishers list for our exercise. Data frame publisher.top has the top 100 publishers.

For our top 100 publishers, let's now get their articles and other information:

> data.subset <- inner_join(publisher.top, data)
Joining, by = "PUBLISHER"

> head(data.subset)
  PUBLISHER ct     ID                                                       TITLE
1   Reuters 90  38081        PRECIOUS-Gold ticks lower, US dollar holds near peak
2   Reuters 90 306465                UKs FTSE rallies as Rolls-Royce races higher
3   Reuters 90 371436 US economic growth to continue at modest pace - Feds Lacker
4   Reuters 90 410152             Traders pare bets on earlier 2015 Fed rate hike
5   Reuters 90 180407        FOREX-Dollar slides broadly, bullish data helps euro
6   Reuters 90 311113 Fitch Publishes Sector Credit Factors for Japanese Insurers
                                                                                                      URL
1                         http://in.reuters.com/article/2014/03/24/markets-precious-idINL4N0ML03U20140324
2                  http://www.reuters.com/article/2014/06/19/markets-britain-stocks-idUSL6N0P01DM20140619
3                                  http://in.reuters.com/article/2014/07/08/usa-fed-idINW1N0OF00M20140708
4                      http://www.reuters.com/article/2014/08/01/us-usa-fed-futures-idUSKBN0G144U20140801
5                            http://in.reuters.com/article/2014/05/06/markets-forex-idINL6N0NS25P20140506
6 http://in.reuters.com/article/2014/06/24/fitch-publishes-sector-credit-factors-fo-idINFit69752320140624
  CATEGORY                         STORY        HOSTNAME    TIMESTAMP
1        b df099bV_5_nKjKMqxhiVh1yCmHe3M  in.reuters.com 1.395753e+12
2        b dShvKWlyRq_Z3pM1C1lhuwYEY5MvM www.reuters.com 1.403197e+12
3        b dNJB5f4GzH0jTlMeEyWcKVpMod5UM  in.reuters.com 1.404897e+12
4        b dunL-T5pNDVbTpMZnZ-3oAUKlKybM www.reuters.com 1.406926e+12
5        b d8DabtTlhPalvyMKxQ7tSGkTnN_9M  in.reuters.com 1.399369e+12
6        b d3tIMfB2mg-9MZM4G_jGTEiRVl3jM  in.reuters.com 1.403634e+12
> dim(data.subset)
[1] 2638    9
>

We join our top 100 publishers data frame publisher.top with data, get all the details for our top 100 publishers. Our data.subset has a total of 2,638 articles.

Having looked at our data, let's now move on to design our recommendation engine.

Table of Contents for News aggregator use case and data

Create new playlist

Sign In

Sign Up

Table of Contents for
News aggregator use case and data