Data partitioning

At the time of training the model, we use validation_split, which uses a specified percentage of training data to assess validation errors. The training data in this example contains data of the first 50 articles from the first author, followed by 50 articles from the second author, and so on. If we specify validation_split as 0.2, the model will be trained based on the first 80% (or 2,000) articles from the first 40 authors, and the last 20% (or 500) articles written by the last 10 authors will be used for assessing validation errors. This will cause no input from the last 10 authors to be used in the model training. To overcome this problem, we will randomly partition the training data into train and validation data using the following code:

# Data partition
trainx_org <- trainx
testx_org <- testx
set.seed(1234)
ind <- sample(2, nrow(trainx), replace = T, prob=c(.8, .2))
trainx <- trainx_org[ind==1, ]
validx <- trainx_org[ind==2, ]
trainy <- trainy_org[ind==1]
validy <- trainy_org[ind==2]

As we can see, to partition the data into train and validation data, we have used an 80:20 split. We also used the set.seed function for repeatability purposes.

After partitioning the train data, we will carry out one-hot encoding on the labels, which helps us represent the correct author with a value of one, and all the other authors with a value of zero.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset