How it works...

In step 1, we defined action probability matrices for each action. We can interpret this as the probability of transitioning from a current state to the next state using an action. Let's say that if the agent is in state 2 and tries to go LEFT, there is a 90% probability that the agent will transition to state 1.

The following image represents the transition probability matrix for the LEFT action:

In step 2, we defined a reward matrix; that is, a scalar reward that's given to an agent for transitioning from the current state to the next one.

The following image represents the reward matrix:

In the last step, we solved the RL problem by applying the policy iteration algorithm to solve the discounted MDP. The mdp_policy_iteration() function returns V, which is the optimal value function; the policy, which is the optimal policy; iter, which is the number of iterations; and time, which is the CPU time taken by the program. The policy iteration algorithm stops when two successive policies are identical. We can also specify the number of iterations by passing a value to the max_iter argument.

Table of Contents for How it works...

Create new playlist

Sign In

Sign Up

Table of Contents for
How it works...