Reading the training data

We can access the reuter_50_50 dataset by going to Data folder from the link that we provided for the UCI Machine Learning Repository. From here, we can download the C50.zip folder. When unzipped, it contains a C50 folder containing C50train and C50test folders. First, we will read the text files from the C50train folder using the following code:

# Reading Reuters train data
setwd("~/Desktop/C50/C50train")
temp = list.files(pattern="*.*")
k <- 1; tr <- list(); trainx <- list(); trainy <- list()
for (i in 1:length(temp)) {for (j in 1:50)
{ trainy[k] <- temp[i]
k <- k+1}
author <- temp[i]
files <- paste0("~/Desktop/C50/C50train/", author, "/*")
tr <- readtext(files)
trainx <- rbind(trainx, tr)}
trainx <- trainx$text

With the help of the preceding code, we can read data on 2,500 articles from the C50train folder into trainx and also save information about the author's names into trainy. We start by setting the working directory to the C50train folder using the setwd function. The C50train folder contains 50 folders named after 50 authors, and each folder contains 50 articles written by the corresponding author. We assign a value of 1 to k and initiate tr, trainx, and trainy as a list. Then, we create a loop so that the author's name is stored in trainy, which contains the author's names for each article, and so that trainx contains the corresponding articles written by the authors. Note that, after reading data on these 2,500 text files, trainx also contains information about file names. Using the last line of code, we retain data on only 2,500 texts and remove information about the file names that we will not need.

Now, let's look at the content of text file 901 from the train data using the following code:

# Text file 901
trainx[901]
[1] "Drug discovery specialist Chiroscience Group plc said on Monday it is testing two anti-cancer compounds before deciding which will go forward into human trials before the end of the year. Both are MMP inhibitors, the same novel class of drug as British Biotech Plc's potential blockbuster Marimastat, which are believed to stop cancer cells from spreading. In an interview, chief executive John Padfield said Chiroscience hoped to have its own competitor to Marimastat in early trials next year and Phase III trials in 1998."

# Author
trainy[901]
[[1]]
[1] "JonathanBirt"

From the preceding code and output, we can make the following observations:

  • The test file 901 in trainx contains certain news items about drug trials by the Chiroscience Group
  • The author of this short article is Jonathan Birt

Having read the text files and author names for the training data, we can repeat this process for the test data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset