Model-free RL

In the previous recipe, Model-based RL using MDPtoolbox, we followed a model-based approach to solve an RL problem. Model-based approaches become impractical as the state and action space grows. On the other hand, model-free reinforcement algorithms rely on trial-and-error interaction of the agent with the environment representing the problem in hand. In this recipe, we will use a model-free approach to implement RL using the ReinforcementLearning package in R. This package utilizes a popular model-free algorithm known as Q-learning. It is an off-policy algorithm due to the fact that it explores the environment and exploits the current knowledge at the same time.

Q-learning guarantees to converge to an optimal policy, but to achieve so, it relies on continuous interactions between an agent and its environment, which makes it computationally heavy. This algorithm looks forward to the next state and observes the maximum possible reward for all the possible actions in that state. Then, it utilizes this knowledge to update the action-value information of the respective actions in the current state with a specific learning rate, α. The algorithm tries to learn an optimal evaluation function known as the Q-function, which maps each state and action pair to a value. It is denoted as Q: S × A => V, where V is the value of future rewards for an action, , which is executed in the state, .