Getting ready

In this section, we will implement the Q-learning algorithm in R. The simultaneous exploration of the surrounding environment and exploitation of existing knowledge is termed off-policy convergence. For example, an agent in a particular state first explores all the possible actions of transitioning into next states and observes the corresponding rewards, and then exploits current knowledge to update the existing state-action value using the action generating the maximum possible reward.

The Q learning returns a 2D Q-table of the size of the number of states x the number of actions. The values in the Q-table are updated based on the following formula, where Q denotes the value of state s and action a, r' denotes the reward of the next state for a selected action a, Υ denotes the discount factor, and α denotes the learning rate:

The framework for Q-learning is shown in the following figure:

Framework of Q-learning
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset