Model-based RL using MDPtoolbox

RL is a general-purpose framework for artificial intelligence. It is used to solve sequential decision-making problems. In RL, the computer is given a goal to achieve and it learns how to accomplish that goal by learning from interactions with its environment. A typical RL setup consists of five components, known as the Agent, Environment, Action, State, and Reward.

In RL, an agent interacts with the environment using an action from a set of Actions (A). Based on the action taken by the agent, the environment transitions from an initial state to a new state, where each state belongs to a set of States within the environment. The transition generates a feedback Reward signal (a scalar quantity) from the environment. The reward is an estimate of the agent's performance, and the reward value depends on the current state and the action that's performed. This process of an agent making choices about actions and the environment responding with new states and a reward associated with it continues until the agent learns an optimal behavior to reach the terminal state from any initial state, maximizing the cumulative reward, .

The following diagram is an illustration of an RL setup:

can be expressed as follows:

Here, G_t is the cumulative reward,
γ is the discount factor,
and t is the timestep

The RL setup follows particular assumptions. First, the agent interacts with the environment sequentially; then, second, the time-space is discrete; finally, the transitions follow the Markov property; that is, the environment's future state, , depends only on the current state, s. A Markov process is a memoryless random process; that is, it's a sequence of random states with the Markov property. It is a framework for modeling the decision-making process. The Markov decision process (MDP) specifies a mathematical structure to find a solution to an RL problem.

It is a tuple of five components—(S, A, P, R,γ):

S: Set of states, .
A: Set of actions, .
P: Transition probability, , which is the probability of reaching after taking action in state .
R: Reward function, , which is the expected reward when moving from to using action .
: The discount factor. z

Using MDP, we can find a policy, , that maximizes the expected long-term reward, , with discount factor, (it defines an applicable discount for the future rewards). A policy defines the best action that an agent should take based on the current state. It maps actions to states. The function that estimates the long-term reward for an agent starting from a state, s, and following a policy, , is known as a value function.

There are two types of value functions:

The state-value function, : For an MDP, it is the expected return for an agent beginning from state s and following the policy, :

The action-value function, : For an MDP, it is the expected return for an agent beginning from state s, taking an action, a, and then following the policy, :

Among all the possible value functions, an optimal value function exists, which yields the highest expected reward for all states. The policy that corresponds to the optimal value function is known as the optimal policy.

and can be expressed as follows:

An optimal policy can be found by maximizing over . An optimal policy can be described as follows:

Using the Bellman equation, we can find the optimal value function. The Bellman expectation equation defines a value function as the sum of the immediate reward that's received for transitioning from the current state, s, using action, a, and the expected reward from the next state, s':

The Bellman optimality equations for and is given as follows:

Also, the optimal state-value and action-value functions are recursively related by the Bellman optimality equations, as per the following equation:

From this, we get the following:

There are many ways to solve Bellman optimality equations, such as value iteration, policy iteration, SARSA, and Q-learning. RL techniques can be categorized into model-based and model-free approaches. Model-based algorithms depend on explicit models of the environment that provide the state transition probabilities, as well as the representation of the environment in the form of MDPs. These MDPs can be solved by various algorithms, such as value iteration and policy iteration.

On the other hand, model-free algorithms do not rely on any explicit knowledge about the environment representing the problem. Instead, they try to learn an optimal policy based on the dynamic interaction of the agent with the environment. In this recipe, we will solve an RL problem using a model-based policy iteration algorithm.

Table of Contents for Model-based RL using MDPtoolbox

Create new playlist

Sign In

Sign Up

Table of Contents for
Model-based RL using MDPtoolbox