Partitioning the data

Next, we will partition this data into training and test datasets. To carry out data partitioning, we use the following code:

# Data partition 
set.seed(1234)
ind <- sample(2, nrow(data), replace = T, prob=c(.7, .3))
training <- data[ind==1, 1:21]
test <- data[ind==2, 1:21]
trainingtarget <- data[ind==1, 22]
testtarget <- data[ind==2, 22]

As you can see from the preceding code, to obtain the same samples in the training and test datasets for repeatability purposes, we use set.seed with a specific number, which in this case is 1234. This will ensure that the reader can also obtain the same samples in the training and test data. For data partitioning, a 70:30 split is used here, but any other ratio can be used too. In machine learning applications, this is a commonly used step to ensure that the prediction model works well with unseen data that is stored in the form of test data. Training data is used for developing the model and test data is used to assess the performance of the model. Sometimes, a prediction model may perform very well or even perfectly well with training data; however, when it is evaluated with test data that has not been seen by the model, the performance may turn out to be very disappointing. In machine learning, this problem is termed as over fitting the model. Test data helps to assess and ensure that the prediction model can be reliably implemented for making the appropriate decisions.

We use training and test names to store independent variables and we use trainingtarget and testtarget names to store target variables stored in the 22nd column of the dataset. After data partitioning, we will have 1,523 observations in the training data and the remaining 603 observations will be in the test data. Note that although we use a 70:30 split here, the actual ratio after data partitioning may not be exactly 70:30.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset