Reading the test data

Now, we will read the text files from the C50test folder located within the C50 folder. We will use the following code to do so:

# Reuters test data
setwd("~/Desktop/C50/C50test")
temp = list.files(pattern="*.*")
k <- 1; tr <- list(); testx <- list(); testy <- list()
for (i in 1:length(temp)) {for (j in 1:50)
{ testy[k] <- temp[i]
k <- k+1}
author <- temp[i]
files <- paste0("~/Desktop/C50/C50test/", author, "/*")
tr <- readtext(files)
testx <- rbind(testx, tr)}
testx <- testx$text

Here, we can see that the only change in this code is that we are creating testx and testy based on the test data located within the C50test folder. We read 2,500 articles from the C50test folder into testx and save information about the author's names into testy. Once again, we use the last line of code to retain data on only 2,500 texts from the test data and remove information on file names, which isn't required for our analysis.

Now that we've created the training and test data, we will carry out data preprocessing so that we can develop an author classification model.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset