Model selection and training

When creating and training a model, here are a few things that you need to consider.

You need to choose the appropriate machine learning algorithm for the task at hand, which will be representative of the data you are working with. You will then split this into 2-3 subsets of data: training, validation, and test. The rules for the correct proportions vary based upon the amount of data you are working with. For example, if you have 10,000 rows of data, then perhaps 20% to training and 80% to test is good. But if you have 10⁸ rows of data, perhaps 5% training and 95% test is better.
There is one rule that you must always follow to the letter. Whatever fractionality you decide to use for your test, train and validation sets, ALL THE DATA MUST COME FROM THE SAME DATASET. This is so very important. You never want to take some data from one dataset to train on, and then data from a completely different dataset to test on. That will just lead to frustration. Always accumulate huge datasets to train, test and validate on!
Validation data can be used to validate your test data prior to using the test data set. Some people use it, some don't. However you split your data up, you will always have a data set to train with, and a set to test with. The goal of your algorithm must be to be flexible enough to handle data it has not previously seen, and you can't do that if you are testing with the same set of data you are developing against. Following are the two ways that the data can be split. The two approaches show how you can separate test and train sets (one with a cross validation set and the other without one):

Table of Contents for Model selection and training

Create new playlist

Sign In

Sign Up

Table of Contents for
Model selection and training