Reinforcement learning basics

Before we deep dive into the details of reinforcement learning, I would like to cover some of the basics necessary for understanding the various nuts and bolts of RL methodologies. These basics appear across various sections of this chapter, which we will explain in detail whenever required:

  • Environment: This is any system that has states, and mechanisms to transition between states. For example, the environment for a robot is the landscape or facility it operates.
  • Agent: This is an automated system that interacts with the environment.
  • State: The state of the environment or system is the set of variables or features that fully describe the environment.
  • Goal or absorbing state or terminal state: This is the state that provides a higher discounted cumulative reward than any other state. A high cumulative reward prevents the best policy from being dependent on the initial state during training. Whenever an agent reaches its goal, we will finish one episode.
  • Action: This defines the transition between states. The agent is responsible for performing, or at least recommending an action. Upon execution of the action, the agent collects a reward (or punishment) from the environment.
  • Policy: This defines the action to be selected and executed for any state of the environment. In other words, policy is the agent's behavior; it is a map from state to action. Policies could be either deterministic or stochastic.
  • Best policy: This is the policy generated through training. It defines the model in Q-learning and is constantly updated with any new episode.
  • Rewards: This quantifies the positive or negative interaction of the agent with the environment. Rewards are usually immediate earnings made by the agent reaching each state.
  • Returns or value function: A value function (also called returns) is a prediction of future rewards of each state. These are used to evaluate the goodness/badness of the states, based on which, the agent will choose/act on for selecting the next best state:
  • Episode: This defines the number of steps necessary to reach the goal state from an initial state. Episodes are also known as trials.
  • Horizon: This is the number of future steps or actions used in the maximization of the reward. The horizon can be infinite, in which case, the future rewards are discounted in order for the value of the policy to converge.
  • Exploration versus Exploitation: RL is a type of trial and error learning. The goal is to find the best policy; and at the same time, remain alert to explore some unknown policies. A classic example would be treasure hunting: if we just go to the locations greedily (exploitation), we fail to look for other places where hidden treasure might also exist (exploration). By exploring the unknown states, and by taking chances, even when the immediate rewards are low and without losing the maximum rewards, we might achieve greater goals. In other words, we are escaping the local optimum in order to achieve a global optimum (which is exploration), rather than just a short-term focus purely on the immediate rewards (which is exploitation). Here are a couple of examples to explain the difference:
    • Restaurant selection: By exploring unknown restaurants once in a while, we might find a much better one than our regular favorite restaurant:
      • Exploitation: Going to your favorite restaurant
      • Exploration: Trying a new restaurant
    • Oil drilling example: By exploring new untapped locations, we may get newer insights that are more beneficial that just exploring the same place:
      • Exploitation: Drill for oil at best known location
      • Exploration: Drill at a new location
  • State-Value versus State-Action Function: In action-value, Q represents the expected return (cumulative discounted reward) an agent is to receive when taking Action A in State S, and behaving according to a certain policy π(a|s) afterwards (which is the probability of taking an action in a given state).

In state-value, the value is the expected return an agent is to receive from being in state s behaving under a policy π(a|s). More specifically, the state-value is an expectation over the action-values under a policy:

  • On-policy versus off-policy TD control: An off-policy learner learns the value of the optimal policy independently of the agent's actions. Q-learning is an off-policy learner. An on-policy learner learns the value of the policy being carried out by the agent, including the exploration steps.
  • Prediction and control problems: Prediction talks about how well I do, based on the given policy: meaning, if someone has given me a policy and I implement it, how much reward I will get for that. Whereas, in control, the problem is to find the best policy so that I can maximize the reward.
  • Prediction: Evaluation of the values of states for a given policy.

For the uniform random policy, what is the value function for all states?

  • Control: Optimize the future by finding the best policy.

What is the optimal value function over all possible policies, and what is the optimal policy?

Usually, in reinforcement learning, we need to solve the prediction problem first, in order to solve the control problem after, as we need to figure out all the policies to figure out the best or optimal one.

  • RL Agent Taxonomy: An RL agent includes one or more of the following components:
    • Policy: Agent's behavior function (map from state to action); Policies can be either deterministic or stochastic
    • Value function: How good is each state (or) prediction of expected future reward for each state
    • Model: Agent's representation of the environment. A model predicts what the environment will do next:
      • Transitions: p predicts the next state (that is, dynamics):
      • Rewards: R predicts the next (immediate) reward

Let us explain the various categories possible in RL agent taxonomy, based on combinations of policy and value, and model individual components with the following maze example. In the following maze, you have both the start and the goal; the agent needs to reach the goal as quickly as possible, taking a path to gain the total maximum reward and the minimum total negative reward. Majorly five categorical way this problem can be solved:

  • Value based
  • Policy based
  • Actor critic
  • Model free
  • Model based
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset