Policy

In RL lingo, we call a strategy policy. The goal of RL is to discover a good strategy. One of the most common ways to solve it is by observing the long-term consequences of actions in each state. The short-term consequence is easy to calculate: it's just the reward. Although performing an action yields an immediate reward, it is not always a good idea to greedily choose the action with the best reward. That is a lesson in life too, because the most immediate best thing to do may not always be the most satisfying in the long run. The best possible policy is called the optimal policy, and it is often the holy grail of RL, as shown in Figure 3, which shows the optimal action, given any state:

Figure 3: A policy defines an action to be taken in a given state

We have seen one type of policy where the agent always chooses the action with the greatest immediate reward, called greedy policy. Another simple example of a policy is arbitrarily choosing an action, called random policy. If you come up with a policy to solve a, RL problem, it is often a good idea to double-check that your learned policy performs better than both the random and the greedy policies.

In addition, we will see how to develop another robust policy called policy gradients, where a neural network learns a policy for picking actions by adjusting its weights through gradient descent using feedback from the environment. We will see that, although both the approaches are used, policy gradient is more direct and optimistic.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset