News aggregator use case and data

We have 1,000 news articles from different publishers. Each article belongs to a different category: technical, entertainment, and others. Our case is to alleviate the cold start problem faced by our customers. Simply put, what do we recommend to a customer when we don't have any information about his preferences? We are either looking at the customer for the first time or we don't have any mechanism set up yet to capture customer interaction with our products/items.

This data is a subset of the news aggregator dataset from https://archive.ics.uci.edu/ml/datasets/News+Aggregator.

A subset of the data is stored in a csv file.

Let's quickly look at the data provided:

> library(tidyverse)
> library(tidytext)
> library(tm)
> library(slam)
>
>
> cnames <- c('ID' , 'TITLE' , 'URL' ,
+ 'PUBLISHER' , 'CATEGORY' ,
+ 'STORY' , 'HOSTNAME' ,'TIMESTAMP')
>
> data <- read_tsv('newsCorpus.csv',
+ col_names = cnames,
+ col_types = cols(
+ ID = col_integer(),
+ TITLE = col_character(),
+ URL = col_character(),
+ PUBLISHER = col_character(),
+ CATEGORY = col_character(),
+ STORY = col_character(),
+ HOSTNAME = col_character(),
+ TIMESTAMP = col_double()
+ )
+ )
>
>
> head(data)
# A tibble: 6 x 8
ID TITLE
<int> <chr>
1 273675 More iWatch release hints, HealthKit lays groundwork
2 356956 Burger King debuts Proud Whopper to support LGBT rights
3 143853 A-Sides and B-Sides: Record Store Day Lives On
4 376630 Smallpox virus found on NIH campus
5 160274 iPhone 6 Specs Leak: Curved Glass Display; Q3 Release Date
6 273554 New Valve VR headset crops up in testing
# ... with 6 more variables: URL <chr>, PUBLISHER <chr>, CATEGORY <chr>, STORY <chr>, HOSTNAME <chr>,
# TIMESTAMP <dbl>

Every article has the following columns:

  • ID: A unique identifier
  • TITLE: The title of the article (free text)
  • URL: The article's URL
  • PUBLISHER: Publisher of the article
  • CATEGORY: Some categorization under which the articles are grouped
  • STORY: An ID for the group of stories the article belongs to
  • HOSTNAME: Hostname of the URL
  • TIMESTAMP: Timestamp published

We use the cnames vector to define these headings as we read the file using the read_tsv function. Further inside read_tsv, while defining the column types, we also specify the variable type for each of these columns.

The following are some distinct publishers and categories:

> data %>% group_by(PUBLISHER) %>% summarise()

# A tibble: 2,991 x 1
PUBLISHER
<chr>
1 1011now
2 10News
3 10TV
4 123Jump.com
5 12NewsNow.Com
6 13WHAM-TV
7 13abc Action News
8 14 News WFIE Evansville
9 "24\/7 Wall St."
10 "2DayFM \(blog\)"
# ... with 2,981 more rows

> data %>% group_by(CATEGORY) %>% summarise()
# A tibble: 4 x 1
CATEGORY
<chr>
1 b
2 e
3 m
4 t
>

There are four categories and around 2,900 publishers.

Let's look a little closer at our publishers:

> publisher.count <- data.frame(data %>% group_by(PUBLISHER) %>% summarise(ct =n()))

> head(publisher.count)
PUBLISHER ct
1 1011now 1
2 10News 4
3 10TV 2
4 123Jump.com 1
5 12NewsNow.Com 3
6 13WHAM-TV 3

> dim(publisher.count)
[1] 2991 2

> dim(publisher.count[publisher.count$ct <= 10,])
[1] 2820 2

We first find the number of articles under each publisher. Looks like a lot of publishers have very few articles. Let's validate it to see the number of publishers with less than 10 articles. We can see 2,820 publishers, out of 2,991, have less than ten articles.

Let's get the top 100 publishers by looking at the number of articles they have published:

> publisher.top <- head(publisher.count[order(-publisher.count$ct),],100)

> head(publisher.top)
PUBLISHER ct
1937 Reuters 90
309 Businessweek 58
1548 NASDAQ 49
495 Contactmusic.com 48
540 Daily Mail 47
882 GlobalPost 47
>

We can see that Reuters tops the list. We have retained only the articles from the top 100 publishers list for our exercise. Data frame publisher.top has the top 100 publishers.

For our top 100 publishers, let's now get their articles and other information:

> data.subset <- inner_join(publisher.top, data)
Joining, by = "PUBLISHER"

> head(data.subset)
PUBLISHER ct ID TITLE
1 Reuters 90 38081 PRECIOUS-Gold ticks lower, US dollar holds near peak
2 Reuters 90 306465 UKs FTSE rallies as Rolls-Royce races higher
3 Reuters 90 371436 US economic growth to continue at modest pace - Feds Lacker
4 Reuters 90 410152 Traders pare bets on earlier 2015 Fed rate hike
5 Reuters 90 180407 FOREX-Dollar slides broadly, bullish data helps euro
6 Reuters 90 311113 Fitch Publishes Sector Credit Factors for Japanese Insurers
URL
1 http://in.reuters.com/article/2014/03/24/markets-precious-idINL4N0ML03U20140324
2 http://www.reuters.com/article/2014/06/19/markets-britain-stocks-idUSL6N0P01DM20140619
3 http://in.reuters.com/article/2014/07/08/usa-fed-idINW1N0OF00M20140708
4 http://www.reuters.com/article/2014/08/01/us-usa-fed-futures-idUSKBN0G144U20140801
5 http://in.reuters.com/article/2014/05/06/markets-forex-idINL6N0NS25P20140506
6 http://in.reuters.com/article/2014/06/24/fitch-publishes-sector-credit-factors-fo-idINFit69752320140624
CATEGORY STORY HOSTNAME TIMESTAMP
1 b df099bV_5_nKjKMqxhiVh1yCmHe3M in.reuters.com 1.395753e+12
2 b dShvKWlyRq_Z3pM1C1lhuwYEY5MvM www.reuters.com 1.403197e+12
3 b dNJB5f4GzH0jTlMeEyWcKVpMod5UM in.reuters.com 1.404897e+12
4 b dunL-T5pNDVbTpMZnZ-3oAUKlKybM www.reuters.com 1.406926e+12
5 b d8DabtTlhPalvyMKxQ7tSGkTnN_9M in.reuters.com 1.399369e+12
6 b d3tIMfB2mg-9MZM4G_jGTEiRVl3jM in.reuters.com 1.403634e+12
> dim(data.subset)
[1] 2638 9
>

We join our top 100 publishers data frame publisher.top with data, get all the details for our top 100 publishers. Our data.subset has a total of 2,638 articles.

Having looked at our data, let's now move on to design our recommendation engine.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset