Markov decision processes and Bellman equations

Markov decision process (MDP) formally describes an environment for reinforcement learning. Where:

  • Environment is fully observable
  • Current state completely characterizes the process (which means the future state is entirely dependent on the current state rather than historic states or values)
  • Almost all RL problems can be formalized as MDPs (for example, optimal control primarily deals with continuous MDPs)

Central idea of MDP: MDP works on the simple Markovian property of a state; for example, St+1 is entirely dependent on latest state St rather than any historic dependencies. In the following equation, the current state captures all the relevant information from the history, which means the current state is a sufficient statistic of the future:

An intuitive sense of this property can be explained with the autonomous helicopter example: the next step is for the helicopter to move either to the right, left, to pitch, or to roll, and so on, entirely dependent on the current position of the helicopter, rather than where it was five minutes before.

Modeling of MDP: RL problems models the world using MDP formulation as a five tuple (S, A, {Psa}, y, R)

  • S - Set of States (set of possible orientations of the helicopter)
  • A - Set of Actions (set of all possible positions that can pull the control stick)
  • Psa - State transition distributions (or state transition probability distributions) provide transitions from one state to another and the respective probabilities needed for the Markov process:
  • γ - Discount factor:
  • R - Reward function (maps set of states to real numbers, either positive or negative):

Returns are calculated by discounting the future rewards until terminal state is reached.

Bellman Equations for MDP: Bellman equations are utilized for the mathematical formulation of MDP, which will be solved to obtain the optimal policies of the environment. Bellman equations are also known as dynamic programming equations and are a necessary condition for the optimality associated with the mathematical optimization method that is known as dynamic programming. Bellman equations are linear equations which can be solvable for the entire environment. However, the time complexity for solving these equations is O (n3), which becomes computationally very expensive when the number of states in an environment is large; and sometimes, it is not feasible to explore all the states because the environment itself is very large. In those scenarios, we need to look at other ways of solving problems.

In Bellman equations, value function can be decomposed into two parts:

  • Immediate reward Rt+1, from the successor state you will end up with
  • Discounted value of successor states yv(St+1) you will get from that timestep onwards:

Grid world example of MDP: Robot navigation tasks live in the following type of grid world. An obstacle is shown the cell (2,2), through which the robot can't navigate. We would like the robot to move to the upper-right cell (4,3) and when it reaches that position, the robot will get a reward of +1. The robot should avoid the cell (4,2), as, if it moved into that cell, it would receive a-1 reward.

Robot can be in any of the following positions:

  • 11 States - (except cell (2,2), in which we have got an obstacle for the robot)
  • A = {N-north, S-south, E-east, W-west}

In the real world, robot movements are noisy, and a robot may not be able to move exactly where it has been asked to. Examples might include that some of its wheels slipped, its parts were loosely connected, it had incorrect actuators, and so on. When asked to move by 1 meter, it may not be able to move exactly 1 meter; instead, it may move 90-105 centimeters, and so on.

In a simplified grid world, stochastic dynamics of a robot can be modeled as follows. If we command the robot to go north, there is a 10% chance that the robot could drag towards the left and a 10 % chance that it could drag towards the right. Only 80 percent of the time it may actually go north. When a robot bounces off the wall (including obstacles) and just stays at the same position, nothing happens:

Every state in this grid world example is represented by (x, y) coordinates. Let's say it is at state (3,1) and we asked the robot to move north, then the state transition probability matrices are as follows:

The probability of staying in the same position is 0 for the robot.

As we know, that sum of all the state transition probabilities sums up to 1:

Reward function:

For all the other states, there are small negative reward values, which means it charges the robot for battery or fuel consumption when running around the grid, which creates solutions that do not waste moves or time while reaching the goal of reward +1, which encourages the robot to reach the goal as quickly as possible with as little fuel used as possible.

The world ends when the robot reaches either +1 or -1 states. No more rewards are possible after reaching any of these states; these can be called absorbing states. These are zero-cost absorbing states and the robot stays there forever.

MDP working model:

  • At state S0
  • Choose a0
  • Get to S1 ~ Ps0, a0
  • Choose a1
  • Get to S2 ~ Ps1, a1
  • and so on ....

After a while, it takes all the rewards and sums up to obtain:

Discount factor models an economic application, in which one dollar earned today is more valuable than one dollar earned tomorrow.

The robot needs to choose actions over time (a0, a1, a2, ....) to maximize the expected payoff:

Over the period, a reinforcement learning algorithm learns a policy which is a mapping of actions for each state, which means it is a recommended action, which the robot needs to take based on the state in which it exists:

Optimal Policy for Grid World: Policy maps from states to actions, which means that, if you are in a particular state, you need to take this particular action. The following policy is the optimal policy which maximizes the expected value of the total payoff or sum of discounted rewards. Policy always looks into the current state rather than previous states, which is the Markovian property:

One tricky thing to look at is at the position (3,1): optimal policy shows to go left (West) rather than going (north), which may have a fewer number of states; however, we have an even riskier state that we may step into. So, going left may take longer, but it safely arrives at the destination without getting into negative traps. These types of things can be obtained from computing, which do not look obvious to humans, but a computer is very good at coming up with these policies:

Define: Vπ, V*, π*

Vπ = For any given policy π, value function is Vπ : S -> R such that Vπ (S) is expected total payoff starting in state S, and execute π

Random policy for grid world: The following is an example of a random policy and its value functions. This policy is a rather bad policy with negative values. For any policy, we can write down the value function for that particular policy:

In simple English, Bellman equations illustrate that the value of the current state is equal to the immediate reward and discount factor applied to the expected total payoff of new states (S') multiplied by their probability to take action (policy) into those states.

Bellman equations are used to solve value functions for a policy in close form, given fixed policy, how to solve the value function equations.

Bellman equations impose a set of linear constraints on value functions. It turns out that we solve the value function at any state S by solving a set of linear equations.

Example of Bellman equations with a grid world problem:

The chosen policy for cell (3,1) is to move north. However, we have stochasticity in the system that about 80 percent of the time it moves in the said direction, and 20% of the time it drifts sideways, either left (10 percent) or right (10 percent).

Similar equations can be written for all the 11 states of the MDPs within the grid. We can obtain the following metrics, from which we will solve all the unknown values, using a system of linear equation methods:

  • 11 equations
  • 11 unknown value function variables
  • 11 constraints

This is solving an n variables with n equations problem, for which we can find the exact form of a solution using a system of equations easily to get an exact solution for V (π) for the entire closed form of the grid, which consists of all the states.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset