Now, we will read the text files from the C50test folder located within the C50 folder. We will use the following code to do so:
# Reuters test data
setwd("~/Desktop/C50/C50test")
temp = list.files(pattern="*.*")
k <- 1; tr <- list(); testx <- list(); testy <- list()
for (i in 1:length(temp)) {for (j in 1:50)
{ testy[k] <- temp[i]
k <- k+1}
author <- temp[i]
files <- paste0("~/Desktop/C50/C50test/", author, "/*")
tr <- readtext(files)
testx <- rbind(testx, tr)}
testx <- testx$text
Here, we can see that the only change in this code is that we are creating testx and testy based on the test data located within the C50test folder. We read 2,500 articles from the C50test folder into testx and save information about the author's names into testy. Once again, we use the last line of code to retain data on only 2,500 texts from the test data and remove information on file names, which isn't required for our analysis.
Now that we've created the training and test data, we will carry out data preprocessing so that we can develop an author classification model.