How to do it...

In the Getting ready section, we defined our RL problem. We know that to solve a model-based RL problem, we need a transition probability matrix and a reward matrix:

Let's start by defining the transition probability matrices. We start off by defining the probabilities for all the actions at each state. The sum of the probabilities in each row sums up to 1:

# Up 
up=matrix(c( 0.9, 0.1, 0, 0,
 0.2, 0.7, 0.1, 0,
 0, 0, 0.1, 0.9,
 0, 0, 0, 1),
 nrow=4,ncol=4,byrow=TRUE)

# Down 
down=matrix(c(0.1, 0, 0, 0.9,
 0, 0.8, 0.2, 0,
 0, 0.2, 0.8, 0,
 0, 0, 0.8, 0.2),
 nrow=4,ncol=4,byrow=TRUE)

# Left 
left=matrix(c(1, 0, 0, 0,
 0.9, 0.1, 0, 0,
 0, 0.8, 0.2, 0,
 0, 0, 0, 1),
 nrow=4,ncol=4,byrow=TRUE)

# Right 
right=matrix(c(0.1, 0.9, 0, 0,
 0.1, 0.2, 0.7, 0,
 0, 0, 0.9, 0.1,
 0, 0, 0, 1),
 nrow=4,ncol=4,byrow=TRUE)

Now, let's put all the actions into one single list:

actions = list(up=up, down=down, left=left, right=right)
actions

The following screenshot displays the state action matrix for each action:

Now, let's define the rewards and penalties. The only reward is on entering state 4; every other step incurs a penalty of -1:

rewards=matrix(c( -1, -1, -1, -1,
 -1, -1, -1, -1,
 -1, -1, -1, -1,
 100, 100, 100, 100),
 nrow=4,ncol=4,byrow=TRUE)

rewards

The following screenshot shows the reward matrix:

Now, we can solve the problem using the mdp_policy_iteration() function. This function takes the transition probability and the rewards matrices as inputs, along with a discount factor:

solved_MDP=mdp_policy_iteration(P=actions, R=rewards, discount = 0.2)
solved_MDP

The following screenshot shows the result of solving the problem:

Let's have a look at the policy given by the policy iteration algorithm:

solved_MDP$policy 
names(actions)[solved_MDP$policy]

The following screenshot shows the policy for our problem:

The policy depicted in the previous screenshot lets us take the best action in respective states 1,2,3,4. For example the best action in state 1 is down, in state 2 is left, in state 3 is up and state 4 is up. We can get the values at each step using the following code:

solved_MDP$V

The following screenshot shows the values at each step of our policy:

Here, we can see that at the last step, the value of our policy is 125.

Table of Contents for How to do it...

Create new playlist

Sign In

Sign Up

Table of Contents for
How to do it...