Chapter 11. In conclusion: A review and roadmap

In this final chapter we’ll first take a moment to briefly review what we’ve learned, highlighting and distilling what we think are the most important skills and concepts to take away. We have covered the fundamentals of reinforcement learning and if you have made it this far and have engaged with the projects, you’re well-positioned to implement many other algorithms and techniques.

This book is a course on the fundamentals of deep reinforcement learning, not a textbook or reference. That means we could not possibly have introduced all there is to know about DRL, and we had to make tough choices about what to leave out. There are a number of exciting topics in DRL we wished we could have included, and there are some topics that, despite being “industry standards,” were inappropriate to include in a project-focused introductory book like this one. However, we wanted to leave you with a roadmap for where to go from here with your new skills.

In the second part of this chapter, we’ll introduce at a high-level some topics, techniques, and algorithms in DRL that are worth knowing if you’re serious about continuing in the DRL field. We didn’t cover these areas because most of them involve advanced mathematics that we do not expect readers of this book to be familiar with, and we did not have the space to teach more mathematics.

11.1. What did we learn?

Deep reinforcement learning is the combination of deep learning and reinforcement learning. Reinforcement learning is a framework for solving control tasks, which are problems in which an agent can take actions that lead to positive or negative rewards, given some environment. The environment is the universe in which the agent acts. The agent can either have full access to the state of the environment, or it may only have partial access to the state of the environment, which is called partial observability. The environment evolves in discrete time steps according to some dynamical rules, and at each time step the agent takes an action that may influence the next state. After taking each action, the agent receives feedback in the form of a reward signal. We described a mathematical formalization of this called the Markov decision process (MDP).

An MDP is a mathematical structure that includes a set of states, S, that the environment can be in, and a set of actions, A, that the agent can take, which may depend on the particular state of the environment. There is a reward function, R(st,at,st+1), that produces the reward signal, given the transition from the current state to the next state and the agent’s action. The environment may evolve deterministically or stochastically, but in either case, the agent initially does not know the dynamical rules of the environment, so all state transitions must be described probabilistically from the perspective of the agent.

We therefore have a conditional probability distribution over next states, st+1, given the current state and the action taken by the agent at the current time step, denoted Pr(St+1|st,at). The agent follows some policy, π, which is a function that maps a probability distribution over actions given the current state st: π:S → Pr(A). The objective of the agent is to take actions that maximize the time-discounted cumulative rewards over some time horizon. The time-discounted cumulative reward is termed the return (often denoted with the character G or R), and is equal to

The return, Gt at time t, is equal to the sum of discounted rewards for each time step until the end of the episode (for episodic environments) or until the sequence converges for non-episodic environments. The γ factor is a parameter in the interval (0,1) and is the discount rate that determines how quickly the sequence will converge, and thus how much the future is discounted. A discount rate close to 1 will mean that future rewards are given similar weight to immediate rewards (optimizing over the long term), whereas a low discount rate leads to preferring only short-term time horizons.

A derived concept from this basic MDP framework is that of a value function. A value function assigns a value to either states or state-action pairs (i.e., a value for taking an action in a given state), with the former being called a state-value function (or often just the value function) and the latter being the action value or Q function. The value of a state is simply the expected return given that the agent starts in that state and follows some policy π, so value functions implicitly depend on the policy. Similarly, the action value or Q value of a state-action pair is the expected return given that the agent takes the action in that state and follows the policy π until the end. A state that puts the agent in close position to win a game, for example, would be assigned a high state value assuming the underlying policy was reasonable. We denote the value function as Vπ(s), with the subscript π indicating the dependence of the value on the underlying policy, and the Q function as Qπ(s,a), although we often drop the π subscript for convenience.

Right now we understand the function Qπ(s,a) as being some sort of black box that tells us the exact expected rewards for state action a in state s, but, of course, we do not have access to such an all-knowing function; we have to estimate it. In this book we used neural networks to estimate value functions and policy functions, although any suitable function could work. In the case of a neural-based Qπ(s,a), we trained neural networks to predict the expected rewards. The value functions are defined and approximated recursively, such that Qπ(s,a) is updated as

Vπ(s) = rt + γVπ(s′)

where s′ refers to the next state, or st+1. For example, in Gridworld, landing on the goal tile results in +10, landing on the pit results in –10, and losing the game, and all other non-terminal moves, are penalized at –1. If the agent is two steps away from the winning goal tile, the final state reduces to Vπ(s3) = 10. Then, if γ = 0.9, the previous state is valued at Vπ(s2) = r2 + 0.9Vπ(s3) = –1 + 9 = 8. The move before that must be Vπ(s1) = r1 + 0.9Vπ(s2) = –1 + 0.8 × 8 = 5.4. As you can see, states farther from the winning state are valued less.

Training a reinforcement learning agent, then, just amounts to successfully training a neural network to either approximate the value function (so the agent will choose actions that lead to high-value states) or to directly approximate the policy function by observing rewards after actions and reinforcing actions based on the rewards received. Both approaches have their pros and cons, but often we combine learning both a policy and a value function together, which is called an actor-critic algorithm, where the actor refers to the policy and the critic refers to the value function.

11.2. The uncharted topics in deep reinforcement learning

The Markov decision process framework and value and policy functions we just reviewed were detailed in chapters 25. We then spent the rest of the book implementing more sophisticated techniques for successfully training value and policy functions in difficult environments (e.g., environments with sparse rewards) and environments with multiple interacting agents. Unfortunately, there were many exciting things we didn’t have the space to cover, so we’ll end the book with a brief tour of some other areas in deep reinforcement learning you might want to explore. We’ll only give a taste of a few topics we think are worth exploring more and hope you will look into these areas more deeply on your own.

11.2.1. Prioritized experience replay

We briefly mentioned the idea of prioritized replay earlier in the book when we decided to add multiple copies of the same experience to the replay memory if the experience led to a winning state. Since winning states are rare, and we want our agent to learn from these informative events, we thought that adding multiple copies would ensure that each training epoch would include a few of these winning events. This was a very unsophisticated means of prioritizing experiences in the replay based on how informative they are in training the agent.

The term prioritized experience replay generally refers to a specific implementation introduced in an academic paper titled “Prioritized Experience Replay” by Tom Schaul et al. (2015), and it uses a much more sophisticated mechanism to prioritize experiences. In their implementation, all experiences are recorded just once, unlike our approach, but rather than selecting a mini-batch from the replay completely randomly (i.e., uniformly), they preferentially select experiences that are more informative. They defined informative experiences as not merely those that led to a winning state like we did, but rather those where the DQN had a high error in predicting the reward. In essence, the model preferentially trains on the most surprising experiences. As the model trains, however, the once surprising experiences become less surprising, and the preferences get continually reweighted. This leads to substantially improved training performance. This kind of prioritized experience replay is standard practice for value-based reinforcement learning, whereas policy-based reinforcement learning still tends to rely on using multiple parallelized agents and environments.

11.2.2. Proximal policy optimization (PPO)

We mostly implemented deep Q-networks (DQN) in this book rather than policy functions, and this is for good reason. The (deep) policy functions we implemented in chapters 4 and 5 were rather unsophisticated and would not work very well for more complex environments. The problem is not with the policy networks themselves but with the training algorithm. The simple REINFORCE algorithm we used is fairly unstable. When the rewards vary significantly from action to action, the REINFORCE algorithm does not lead to stable results. We need a training algorithm that enforces smoother, more constrained updates to the policy network.

Proximal policy optimization (PPO) is a more advanced training algorithm for policy methods that allows for far more stable training. It was introduced in the paper “Proximal Policy Optimization Algorithms” by John Schulman et al. (2017) at OpenAI. We did not cover PPO in this book because, while the algorithm itself is relatively simple, understanding it requires mathematical machinery that is outside the scope of this introductory book. Making deep Q-learning more stable required only a few intuitive upgrades like adding a target network and implementing double Q-learning, so that is why we preferred to use value learning over policy methods in this book. However, in many cases, directly learning a policy function is more advantageous than a value function, such as for environments with a continuous action space, since we cannot create a DQN that returns an infinite number of Q values for each action.

11.2.3. Hierarchical reinforcement learning and the options framework

When a child learns to walk, they aren’t thinking about which individual muscle fibers to activate and for how long, or when a businessperson is debating a business decision with colleagues, they aren’t thinking about the individual sequences of sounds they need to make for the other people to understand their business strategy. Our actions exist at various levels of abstraction, from moving individual muscles up to grand schemes. This is just like noticing that a story is made up of individual letters, but those letters are composed into words that are composed into sentences and paragraphs and so on. The writer may be thinking of a general next scene in the story, and only once that’s decided will they actually get to typing individual characters.

All the agents we’ve implemented in this book operate at the level of typing individual characters so to speak; they are incapable of thinking at a higher level. Hierarchical reinforcement learning is an approach to solving this problem, allowing agents to build up higher-level actions from lower ones. Rather than our Gridworld agent deciding one step at a time what to do, it might survey the board and decide on a higher-level sequence of actions. It might learn reusable sequences such as “move all the way up” or “move around obstacle” that can be implemented in a variety of game states.

The success of deep learning in reinforcement learning is due to its ability to represent complex high-dimensional states in a hierarchy of higher-level state representations. In hierarchical reinforcement learning, the goal is to extend this to representing states and actions hierarchically. One popular approach to this is called the options framework.

Consider Gridworld, which has four primitive actions of up, right, left, and down, and each action lasts one time step. In the options framework, there are options rather than just primitive actions. An option is the combination of an option policy (which like a regular policy takes a state and returns a probability distribution over actions), a termination condition, and an input set (which is a subset of states). The idea is that a particular option gets triggered when the agent encounters a state in the option’s input set, and that particular option’s policy is run until the termination condition is met, at which point a different option may be selected. These option policies might be simpler policies than a single, big, deep neural network policy that we have implemented in this book. But by intelligently selecting these higher-level options, efficiencies can be gained by not having to use a more computationally intensive policy for taking each primitive step.

11.2.4. Model-based planning

We already discussed the idea of models in reinforcement learning in two contexts. In the first, a model is simply another term for an approximating function like a neural network. We sometimes just refer to our neural network as a model, since it approximates or models the value function or the policy function.

The other context is when we refer to model-based versus model-free learning. In both cases we are using a neural network as a model of the value function or a policy, but in this case model-based means the agent is making decisions based on an explicitly constructed model of the dynamics of the environment itself rather than just its value function. In model-free learning, all we care about is learning to accurately predict rewards, which may or may not require a deep understanding of how the environment actually works. In model-based learning, we actually want to learn how the environment works. Metaphorically, in model-free learning we are satisfied knowing that there is something called gravity that makes things fall, and we make use of this phenomenon, but in model-based learning we want to actually approximate the laws of gravity.

Our model-free DQN worked surprisingly well, especially when combined with other advances like curiosity, so what is the advantage of explicitly learning a model of the environment? With an explicit and accurate environment model, the agent can learn to make long-term plans rather than just deciding which next action to take. By using its environment model to predict the future several time steps ahead, it can evaluate the long-term consequences of its immediate actions, and this can lead to faster learning (due to increased sample efficiency). This is related to, but not necessarily the same as, the hierarchical reinforcement learning we discussed, since hierarchical reinforcement learning does not necessarily depend on an environment model. But with an environment model, the agent can plan out a sequence of primitive actions to accomplish some higher-level goal.

The simplest way to train an environment model is to just have a separate deep learning module that predicts future states. In fact, we did just that in chapter 8 on curiosity-based learning, but we did not use the environment model to plan or look into the future; we only used it to explore surprising states. But with a model, M(st), that takes a state and returns a predicted next state, st+1, we could then take that predicted next state and feed it back into the model to get the predicted state st+2, and so on. The distance into the future we could predict depends on the inherent randomness in the environment and the accuracy of the model, but even if we could only accurately predict out to a few time steps into the future, this would be immensely useful.

11.2.5. Monte Carlo tree search (MCTS)

Many games have a finite set of actions and a finite length, such as chess, Go, and Tic-Tac-Toe. The Deep Blue algorithm that IBM developed to play chess didn’t use machine learning at all; it was a brute force algorithm that used a form of tree search. Consider the game of Tic-Tac-Toe. It is a two-player game typically played on a square 3 × 3 grid where player 1 places an X-shaped token and player 2 places an O-shaped token. The goal of the game is to be the first to get three of your tokens lined up in a row, column, or diagonal.

The game is so simple that the human strategy also generally involves limited tree search. If you’re player 2 and there’s already one opposing token on the grid, you can consider all possible responses to all possible open spaces you have, and you can keep doing this until the end of the game. Of course, even for a 3 × 3 board, the first move has nine possible actions, and there are eight possible actions for player 2, and then seven possible actions for player 1 again, so the number of possible trajectories (the game tree) becomes quite large, but a brute force exhaustive search like this could be guaranteed win at Tic-Tac-Toe assuming the opponent isn’t using the same approach.

For a game like chess, the game tree is far too large to use a completely brute force search of the game tree; one must necessarily limit the number of potential moves to consider. Deep Blue used a tree-search algorithm that is more efficient than exhaustive search but still involved no learning. It still amounted to searching possible trajectories and just computing which ones led to winning states.

Another approach is the Monte Carlo tree search, in which you use some mechanism of randomly sampling a set of potential actions and expanding the tree from there, rather than considering all possible actions. The AlphaGo algorithm developed by DeepMind to play the game Go used a deep neural network to evaluate which actions were worth doing a tree search on and also to decide the value of selected moves. Therefore, AlphaGo combined brute force search with deep neural networks to get the best of both. These types of combination algorithms are currently state-of-the-art for games in the class of chess and Go.

11.3. The end

Thank you for reading our book! We really hope you have learned a satisfying amount about deep reinforcement learning. Please reach out to us in the forums at Manning.com with any questions or comments. We look forward to hearing from you.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset