How it works...

In step 1, we defined the possible set of states and actions for this problem. To work with a model-free RL, we need to create a function that mimics the behavior of the environment. In step 2, we formulated the problem by creating a function called gridExampleEnvironment(), which takes a state-action pair as input and generates a list of the next state and the associated reward. In step 3, we used the sampleExperience() function to generate dynamic state-action transition tuples by querying the environment we created in the preceding step. The input arguments to this function are the number of samples, the environment function, and the set of states and actions. This function returns a dataframe that contains the experienced observation sequences from the environment.

Once the observation sequence data has been generated, the agent learns an optimal policy based on this data. To achieve this, in step 4, we used the ReinforcementLearning() function. We can pass a few more arguments to this function to customize the learning behavior of the agent.

The arguments are as follows:

alpha: This is the learning rate, α, which varies between 0 and 1. The higher the value of this parameter, the quicker learning is.
gamma: This is the discount factor, γ, which can be set to any value between 0 and 1. It determines the importance of future rewards. Lower gamma values will make the agent short-sighted by considering only immediate rewards, whereas higher gamma values will make the agent strive for greater long-term rewards.
epsilon: This parameter epsilon, ε, determines the exploration mechanism in ε-greedy action selection and can be set between 0 to 1.
iter: This parameter represents the number of repeated learning iterations the agent passes through the training dataset. It is set to 1 by default.

We saw that the result of the learning process contains the state-action table; that is, the Q-value of each state-action pair and an optimal policy with the best possible action in each state. In addition, we also got the overall reward for the policy.

Table of Contents for How it works...

Create new playlist

Sign In

Sign Up

Table of Contents for
How it works...