QLearning model creation and training

The QLearning class encapsulates the Q-learning algorithm, more specifically the action-value updating equation. It is a data transformation of type ETransform (we will see this later on) with an explicit configuration of type QLConfig. This class is a generic parameterized class that implements the QLearning algorithm. The Q-learning model is initialized and trained during the instantiation of the class so it can be in the correct state for runtime prediction.

Therefore, the class instances have only two states: successfully trained and failed training (we'll see this later).

The implementation does not assume that every episode (or training cycle) will be successful. At the completion of training, the ratio of labels over the initial training set is computed. The client code is responsible for evaluating the quality of the model by testing the ratio (see the model evaluation section).

The constructor takes the configuration of the algorithm (that is, config), the search space (that is, qlSpace), and the policy (that is, qlPolicy) parameters and creates a Q-learning algorithm:

final class QLearning[T](conf: QLConfig,qlSpace: QLSpace[T],qlPolicy: QLPolicy)
    extends ETransform[QLState[T], QLState[T]](conf) with Monitor[Double]

The model is automatically created effectively if the minimum coverage is reached (or trained) during instantiation of the class, which is essentially a Q-learning model.

The following train() method is applied to each episode with randomly generated initial states. Then it computes the coverage (based on the minCoverage configuration value supplied by the conf object) as the number of episodes for each the goal state was reached:

private def train: Option[QLModel] = Try {
    val completions = Range(0, conf.numEpisodes).map(epoch => 
        if (heavyLiftingTrain (-1)) 1 else 0)
        .sum
        completions.toDouble / conf.numEpisodes
        }
    .filter(_ > conf.minCoverage).map(new QLModel(qlPolicy, _)).toOption;

In the preceding code block, the heavyLiftingTrain(state0: Int) method does the heavy lifting at each episode (or epoch). It triggers the search by selecting either the initial state state 0 or a random generator r with a new seed, if state0 is < 0.

At first, it gets all the states adjacent to the current state, and then it selects the most rewarding of the list of adjacent states. If the next most rewarding state is a goal state, we are done. Otherwise, it recomputes the policy value for the state transition using the reward matrix (that is, QLPolicy.R).

For the recomputation, it applies the Q-learning updating formula by updating the Q-Value for the policy; then it invokes the search method with the new state and incremented iterator. Let's see the body of this method:

private def heavyLiftingTrain(state0: Int): Boolean = {
    @scala.annotation.tailrec
    def search(iSt: QLIndexedState[T]): QLIndexedState[T] = {
        val states = qlSpace.nextStates(iSt.state)
        if (states.isEmpty || iSt.iter >= conf.episodeLength)
            QLIndexedState(iSt.state, -1)
        else {
            val state = states.maxBy(s => qlPolicy.EQ(iSt.state.id, s.id))
            if (qlSpace.isGoal(state))
                QLIndexedState(state, iSt.iter)

            else {
                val fromId = iSt.state.id
                val r = qlPolicy.R(fromId, state.id)
                val q = qlPolicy.Q(fromId, state.id)
                val nq = q + conf.alpha * (r + conf.gamma * qlSpace.maxQ(state, qlPolicy) - q)
                count(QVALUE_COUNTER, nq)
                qlPolicy.setQ(fromId, state.id, nq)
                search(QLIndexedState(state, iSt.iter + 1))
                }
            }
        }

val finalState = search(QLIndexedState(qlSpace.init(state0), 0))
if (finalState.iter == -1)
    false
else
    qlSpace.isGoal(finalState.state)
    }
}

As a list of policies and training coverage is given, let us get the trained model:

private[this] val model: Option[QLModel] = train

Note that the preceding model is trained using the input data (see the class QLPolicy) used for training the Q-learning algorithm using the inline getInput() method:

def getInput: Seq[QLInput] = qlPolicy.input

Now we need to do one of the most important steps that will be used in our options trading application. Therefore, we need to retrieve the model for Q-learning as an option:

@inline
finaldef getModel: Option[QLModel] = model

The overall application fails if the model is not defined (see the validateConstraints() method for validation):

@inline
finaldef isModel: Boolean = model.isDefined
override def toString: String = qlPolicy.toString + qlSpace.toString

Then, a recursive computation of the next most rewarding state is performed using Scala tail recursion. The idea is to search among all states and recursively select the state with the most awards given for the best policy.

@scala.annotation.tailrec
private def nextState(iSt: QLIndexedState[T]): QLIndexedState[T] = {
            val states = qlSpace.nextStates(iSt.state)
            if (states.isEmpty || iSt.iter >= conf.episodeLength)
                iSt
            else {
                val fromId = iSt.state.id
                val qState = states.maxBy(s => model.map(_.bestPolicy.EQ(fromId, s.id)).getOrElse(-1.0))
                nextState(QLIndexedState[T](qState, iSt.iter + 1))
        }
}

In the preceding code block, the nextState() method retrieves the eligible states that can be transitioned to from the current state. Then it extracts the state, qState, with the most rewarding policy by incrementing the iteration counter. Finally, it returns the states if there are no more states or if the method does not converge within the maximum number of allowed iterations supplied by the config.episodeLength parameter.

Tail recursion: In Scala, tail recursion is a very effective construct used to apply an operation to every item of a collection. It optimizes the management of the function stack frame during recursion. The annotation triggers a validation of the condition necessary for the compiler to optimize function calls.

Finally, the configuration of the Q-learning algorithm, QLConfig, specifies:

The learning rate, alpha
The discount rate, gamma
The maximum number of states (or length) of an episode, episodeLength
The number of episodes (or epochs) used in training, numEpisodes
The minimum coverage required to select the best policy, minCoverage

These are shown as follows:

case class QLConfig(alpha: Double,gamma: Double,episodeLength: Int,numEpisodes: Int,minCoverage: Double) 
extends Config {
import QLConfig._check(alpha, gamma, episodeLength, numEpisodes, minCoverage)}

Now we are almost done, except that the validation is not completed. However, let us first see the companion object for the configuration of the Q-learning algorithm. This singleton defines the constructor for the QLConfig class and validates its parameters:

private[scalaml] object QLConfig {
        private val NO_MIN_COVERAGE = 0.0
        private val MAX_EPISODES = 1000

        private def check(alpha: Double,gamma: Double,
                          episodeLength: Int,numEpisodes: Int,
                          minCoverage: Double): Unit = {
                    require(alpha > 0.0 && alpha < 1.0,s"QLConfig found alpha: $alpha required 
                            > 0.0 and < 1.0")
                    require(gamma > 0.0 && gamma < 1.0,s"QLConfig found gamma $gamma required 
                           > 0.0 and < 1.0")
                    require(numEpisodes > 2 && numEpisodes < MAX_EPISODES,s"QLConfig found 
                            $numEpisodes $numEpisodes required > 2 and < $MAX_EPISODES")
                    require(minCoverage >= 0.0 && minCoverage <= 1.0,s"QLConfig found $minCoverage 
                            $minCoverage required > 0 and <= 1.0")
        }

Excellent! We have seen how to implement the QLearning algorithm in Scala. However, as I said, the implementation is based on openly available sources, and the training may not always converge. One important consideration for such an online model is validation. A commercial application (or even a fancy Scala web app, which we will be covering in the next section) may require multiple types of validation mechanisms regarding the states transition, reward, probability, and Q-value matrices.

Table of Contents for QLearning model creation and training

Create new playlist

Sign In

Sign Up

Table of Contents for
QLearning model creation and training