Performing model-based learning

As the name suggests, the learning is augmented using a predefined model. Here, the model is represented in the form of transition probabilities and the key objective is to determine the optimal policy and value functions using these predefined model attributes (that is, TPMs). The policy is defined as a learning mechanism of an agent, traversing across multiple states. In other words, identifying the best action of an agent in a given state, to traverse to a next state, is termed a policy.

The objective of the policy is to maximize the cumulative reward of transitioning from the start state to the destination state, defined as follows, where P(s) is the cumulative policy P from a start state s, and R is the reward of transitioning from state st to state st+1 by performing an action at.

The value function is of two types: the state-value function and the state-action value function. In the state-value function, for a given policy, it is defined as an expected reward to be in a particular state (including start state), whereas in the state-action value function, for a given policy, it is defined as an expected reward to be in a particular state (including the start state) and undertake a particular action.

Now, a policy is said to be optimal provided it returns the maximum expected cumulative reward, and its corresponding states are termed optimal state-value functions or its corresponding states and actions are termed optimal state-action value functions.

In model-based learning, the following iterative steps are performed in order to obtain an optimum policy, as shown in the following figure:

Iterative steps to find an optimum policy

In this section, we shall evaluate the policy using the state-value function. In each iteration, the policies are dynamically evaluated using the Bellman equation, as follows, where Vi denotes the value at iteration i, P denotes an arbitrary policy of a given state s and action a, T denotes the transition probability from state s to state s' due to an action a, R denotes the reward at state s' while traversing from the state s post an action a, and denotes a discount factor in the range of (0,1). The discount factor ensures higher importance to starting learning steps than later.

