How to do it...

In the Getting ready section, we defined our RL problem. We know that to solve a model-based RL problem, we need a transition probability matrix and a reward matrix:

  1. Let's start by defining the transition probability matrices. We start off by defining the probabilities for all the actions at each state. The sum of the probabilities in each row sums up to 1:
# Up 
up=matrix(c( 0.9, 0.1, 0, 0,
0.2, 0.7, 0.1, 0,
0, 0, 0.1, 0.9,
0, 0, 0, 1),
nrow=4,ncol=4,byrow=TRUE)

# Down
down=matrix(c(0.1, 0, 0, 0.9,
0, 0.8, 0.2, 0,
0, 0.2, 0.8, 0,
0, 0, 0.8, 0.2),
nrow=4,ncol=4,byrow=TRUE)

# Left
left=matrix(c(1, 0, 0, 0,
0.9, 0.1, 0, 0,
0, 0.8, 0.2, 0,
0, 0, 0, 1),
nrow=4,ncol=4,byrow=TRUE)

# Right
right=matrix(c(0.1, 0.9, 0, 0,
0.1, 0.2, 0.7, 0,
0, 0, 0.9, 0.1,
0, 0, 0, 1),
nrow=4,ncol=4,byrow=TRUE)

Now, let's put all the actions into one single list:

actions = list(up=up, down=down, left=left, right=right)
actions

The following screenshot displays the state action matrix for each action:

  1. Now, let's define the rewards and penalties. The only reward is on entering state 4; every other step incurs a penalty of -1:
rewards=matrix(c( -1, -1, -1, -1,
-1, -1, -1, -1,
-1, -1, -1, -1,
100, 100, 100, 100),
nrow=4,ncol=4,byrow=TRUE)

rewards

The following screenshot shows the reward matrix:

  1. Now, we can solve the problem using the mdp_policy_iteration() function. This function takes the transition probability and the rewards matrices as inputs, along with a discount factor:
solved_MDP=mdp_policy_iteration(P=actions, R=rewards, discount = 0.2)
solved_MDP

The following screenshot shows the result of solving the problem:

Let's have a look at the policy given by the policy iteration algorithm:

solved_MDP$policy 
names(actions)[solved_MDP$policy]

The following screenshot shows the policy for our problem:

The policy depicted in the previous screenshot lets us take the best action in respective states 1,2,3,4. For example the best action in state 1 is down, in state 2 is left, in state 3 is up and state 4 is up. We can get the values at each step using the following code:

solved_MDP$V

The following screenshot shows the values at each step of our policy:

Here, we can see that at the last step, the value of our policy is 125

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset