Evaluating the model

The preceding output shows the transition from one state to another, and for the 0.1 coverage, the QLearning model had 15,750 transitions for 126 states to reach goal state 37 with optimal rewards. Therefore, the training set is quite small and only a few buckets have actual values. So we can understand that the size of the training set has an impact on the number of states. QLearning will converge too fast for a small training set (like what we have for this example).

However, for a larger training set, QLearning will take time to converge; it will provide at least one value for each bucket created by the approximation. Also, by seeing those values, it is difficult to understand the relation between Q-values and states.

So what if we can see the Q-values per state? Why not! We can see them on a scatter plot:

Figure 11: Q-value per state

Now let us display the profile of the log of the Q-value (QLData.value) as the recursive search (or training) progress for different episodes or epochs. The test uses a learning rate α = 0.1 and a discount rate γ = 0.9 (see more in the deployment section):

Figure 12: Profile of the logarithmic Q-Value for different epochs during Q-learning training

The preceding chart illustrates the fact that the Q-value for each profile is independent of the order of the epochs during training. However, the number of iterations to reach the goal state depends on the initial state selected randomly in this example. To get more insights, inspect the output on your editor or access the API endpoint at http://localhost:9000/api/compute (see following). Now, what if we display the distribution of values in the model and display the estimated Q-value for the best policy on a Scatter plot for the given configuration parameters?

Figure 13: Maximum reward with quantization 32 with the QLearning

The final evaluation consists of evaluating the impact of the learning rate and discount rate on the coverage of the training:

Figure 14: Impact of the learning rate and discount rate on the coverage of the training

The coverage decreases as the learning rate increases. This result confirms the general rule of using learning rate < 0.2. A similar test to evaluate the impact of the discount rate on the coverage is inconclusive. We could have thousands of such configuration parameters with different choices and combinations. So, what if we can wrap the whole application as a Scala web app similar to what we did in Chapter 3, High-Frequency Bitcoin Price Prediction from Historical Data? I guess it would not be that bad an idea. So let us dive into it.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset