Machine learning application flow

We have looked at the methods that machine learning has and how these methods recognize patterns. In this section, we'll see which flow is taken, or has to be taken, by data mining using machine learning. A decision boundary is set based on the model parameters in each of the machine learning methods, but we can't say that adjusting the model parameters is the only thing we have to care about. There is another troublesome problem, and it is actually the weakest point of machine learning: feature engineering. Deciding which features are to be created from raw data, that is, the analysis subject, is a necessary step in making an appropriate classifier. And doing this, which is the same as adjusting the model parameters, also requires a massive amount of trial and error. In some cases, feature engineering requires far more effort than deciding a parameter.

Thus, when we simply say "machine learning," there are certain tasks that need to be completed in advance as preprocessing to build an appropriate classifier to deal with actual problems. Generally speaking, these tasks can be summarized as follows:

  • Deciding which machine learning method is suitable for a problem
  • Deciding what features should be used
  • Deciding which setting is used for model parameters

Only when these tasks are completed does machine learning become valuable as an application.

So, how do you decide the suitable features and parameters? How do you get a machine to learn? Let's first take a look at the following diagram as it might be easier for you to grasp the whole picture of machine learning. This is the summary of a learning flow:

Machine learning application flow

As you can see from the preceding image, the learning phase of machine learning can be roughly divided into these two steps:

  • Training
  • Testing

Literally, model parameters are renewed and adjusted in the training phase and the machine examines the merit of a model in the test phase. We have no doubt that the research or experiment will hardly ever succeed with just one training and one test set. We need to repeat the process of training → test, training → test … until we get the right model.

Let's consider the preceding flowchart in order. First, you need to divide the raw data into two: a training dataset and a test dataset. What you need to be very careful of here is that the training data and the test data are separated. Let's take an example so you can easily imagine what this means: you are trying to predict the daily price of S&P 500 using machine learning with historical price data. (In fact, predicting the prices of financial instruments using machine learning is one of the most active research fields.)

Given that you have historical stock price data from 2001 to 2015 as raw data, what would happen if you performed the training with all the data from 2001 to 2015 and similarly performed the test for the same period? The situation would occur that even if you used simple machine learning or feature engineering, the probability of getting the right prediction would be 70%, or even higher at 80% or 90%. Then, you might think: What a great discovery! The market is actually that simple! Now I can be a billionaire!

But this would end as short-lived elation. The reality doesn't go that well. If you actually start investment management with that model, you wouldn't get the performance you were expecting and would be confused. This is obvious if you think about it and pay a little attention. If a training dataset and a test dataset are the same, you do the test with the data for which you already know the answer. Therefore, it is a natural consequence to get high precision, as you have predicted a correct answer using a correct answer. But this doesn't make any sense for a test. If you would like to evaluate the model properly, be sure to use data with different time periods, for example, you should use the data from 2001 to 2010 for the training dataset and 2011 to 2015 for the test. In this case, you perform the test using the data you don't know the answer for, so you can get a proper prediction precision rate. Now you can avoid going on your way to bankruptcy, believing in investments that will never go well.

So, it is obvious that you should separate a training dataset and a test dataset but you may not think this is a big problem. However, in the actual scenes of data mining, the case often occurs that we conduct an experiment with the same data without such awareness, so please be extra careful. We've talked about this in the case of machine learning, but it also applies to deep learning.

If you divide a whole dataset into two datasets, the first dataset to be used is the training dataset. To get a better precision rate, we first need to think about creating features in the training dataset. This feature engineering partly depends on human experience or intuition. It might take a long time and a lot of effort before you can choose the features to get the best results. Also, each machine learning method has different types of data formats of features to be accepted because the theory of models and formulas are unique to each method. As an example, we have a model that can only take an integer, a model that can only take a non-negative number/value, and a model that can only take real numbers from 0 to 1. Let's look back at the previous example of stock prices. Since the value of the price varies a lot within a broader range, it may be difficult to make a prediction with a model that can only take an integer.

Additionally, we have to be careful to ensure that there is compatibility between the data and the model. We don't say we can't use a model that can take all the real numbers from 0 if you would like to use a stock price as is for features. For example, if you divide all the stock price data by the maximum value during a certain period, the data range can fit into 0-1, hence you can use a model that can only take real numbers from 0 to 1. As such, there is a possibility that you can apply a model if you slightly change the data format. You need to keep this point in mind when you think about feature engineering. Once you create features and decide which method of machine learning to apply, then you just need to examine it.

In machine learning, features are, of course, important variables when deciding on the precision of a model; however, a model itself, in other words a formula within the algorithm, also has parameters. Adjusting the speed of learning or adjusting how many errors to be allowed are good examples of this. The faster the learning speed, the less time it takes to finish the calculation, hence it's better to be fast. However, making the learning speed faster means that it only provides solutions in brief. So, we should be careful not to lose our expected precision rates. Adjusting the permissible range of errors is effective for the case where a noise is blended in the data. The standard by which a machine judges "is this data weird?" is decided by humans.

Each method, of course, has a set of peculiar parameters. As for neural networks, how many neurons there should be in one of the parameters is a good example. Also, when we think of the kernel trick in SVM, how we set the kernel function is also one of the parameters to be determined. As you can see, there are so many parameters that machine learning needs to define, and which parameter is best cannot be found out in advance. In terms of how we define model parameters in advance, there is a research field that focuses on the study of parameters.

Therefore, we need to test many combinations of parameters to examine which combination can return the best precision. Since it takes a lot of time to test each combination one by one, the standard flow is to test multiple models with different parameter combinations in concurrent processing and then compare them. It is usually the case that a range of parameters that should be set to some extent is decided, so it's not that the problem can't be solved within a realistic time frame.

When the model that can get good precision is ready in the training dataset, next comes the test step. The rough flow of the test is to apply the same feature engineering applied to the training dataset and the same model parameters respectively and then verify the precision. There isn't a particularly difficult step in the test. The calculation doesn't take time either. It's because finding a pattern from data, in other words optimizing a parameter in a formula, creates a calculation cost. However, once a parameter adjustment is done, then the calculation is made right away as it only applies the formula to new datasets. The reason for performing a test is, simply put, to examine whether a model is too optimized by the training dataset. What does this mean? Well, in machine learning, there are two patterns where a training set goes well but a test set doesn't.

The first case is incorrect optimization by classifying noisy data blended into a training dataset. This can be related to the adjustment of a permissible range of errors mentioned earlier in this chapter. Data in the world is not usually clean. It can be said that there is almost no data that can be properly classified into clean patterns. The prediction of stock prices is a good example again. Stock prices usually repeat moderate fluctuations from previous stock prices, but sometimes they suddenly surge or drop sharply. And, there is, or should be, no regularity in this irregular movement. Another case is if you would like to predict the yield of a crop for a country; the data of the year affected by abnormal weather should be largely different from the normal years' data. These examples are extreme and easy to understand, but most for a data in the real world also contains noises, making it difficult to classify data into proper patterns. If you just do training without adjusting the parameters of machine learning, the model forces it to classify the noise data into a pattern. In this case, data from the training dataset might be classified correctly, but since noise data in the training dataset is also classified and the noise doesn't exist in the test dataset, the predictability in a test should be low.

The second case is incorrect optimizing by classifying data that is characteristic only in a training dataset. For example, let's think about making an app of English voice inputs. To build your app, you should prepare the data of pronunciation for various words as a training dataset. Now, let's assume you prepared enough voice data of British English native speakers and were able to create a high precision model that could correctly classify the pronunciation in the training dataset. The next step is a test. Since it's a test, let's use the voice data of American English native speakers for the means of providing different data. What would be the result then? You probably wouldn't get good precision. Furthermore, if you try the app to recognize the pronunciation of non-native speakers of English, the precision would be much lower. As you know, English has different pronunciations for different areas. If you don't take this into consideration and optimize the model with the training data set of British English, even though you may get a good result in the training set, you won't get a good result in the test set and it won't be useful for the actual application.

These two problems occur because the machine learning model learns from a training dataset and fits into the dataset too much. This problem is literally called the overfitting problem, and you should be very careful to avoid it. The difficulty of machine learning is that you have to think about how to avoid this overfitting problem besides the feature engineering. These two problems, overfitting and feature engineering, are partially related because poor feature engineering would fail into overfitting.

To avoid the problem of overfitting, there's not much to do except increase the amount of data or the number of tests. Generally, the amount of data is limited, so the methods of increasing the number of tests are often performed. The typical example is K-fold cross-validation. In K-fold cross-validation, all the data is divided into K sets at the beginning. Then, one of the datasets is picked as a test dataset and the rest, K-1, are put as training datasets. Cross-validation performs the verification on each dataset divided into K for K times, and the precision is measured by calculating the average of these K results. The most worrying thing is that both a training dataset and a test dataset may happen to have good precision by chance; however, the probability of this accident can be decreased in K-fold cross-validation as it performs a test several times. You can never worry too much about overfitting, so it's necessary that you verify results carefully.

Well, you have now read through the flow of training and test sets and learned key points to be kept in mind. These two mainly focus on data analysis. So, for example, if your purpose is to pull out the meaningful information from the data you have and make good use of it, then you can go through this flow. On the other hand, if you need an application that can cope with a further new model, you need an additional process to make predictions with a model parameter obtained in a training and a test set. As an example, if you would like to find out some information from a dataset of stock prices and analyze and write a market report, the next step would be to perform training and test sets. Or, if you would like to predict future stock prices based on the data and utilize it as an investment system, then your purpose would be to build an application using a model obtained in a training and a test set and to predict a price based on the data you can get anew every day, or from every period you set. In the second case, if you would like to renew the model with the data that is newly added, you need to be careful to complete the calculation of the model building by the time the next model arrives.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset