© Abhishek Nandy and Manisha Biswas  2018
Abhishek Nandy and Manisha BiswasReinforcement Learning https://doi.org/10.1007/978-1-4842-3285-9_2

2. RL Theory and Algorithms

Abhishek Nandy and Manisha Biswas2
(1)
Rm HIG L-2/4, Bldg Swaranika Co-Opt HSG, Kolkata, West Bengal, India
(2)
North 24 Parganas, West Bengal, India
 
This chapter covers how Reinforcement Learning works and explains the concepts behind it, including the different algorithms that form the basis of Reinforcement Learning.
The chapter explains these algorithms, but to start with, you will learn why Reinforcement Learning can be hard and see some different scenarios. The chapter also covers different ways that Reinforcement Learning can be implemented.
Along the way, the chapter formulates the Markov Decision Process (MDP) and describes it. The chapter also covers SARSA and touches on temporal differences. Then, the chapter touches on Q Learning and dynamic programming.

Theoretical Basis of Reinforcement Learning

This section touches on the theoretical basis of Reinforcement Learning. Figure 2-1 shows how you are going to implement MDP, which is described later.
A454310_1_En_2_Fig1_HTML.jpg
Figure 2-1.
Theoretical basis of MDP
Environments in Reinforcement Learning are represented by the Markov Decision Process (discussed later in this chapter).
  • SS is a finite set of states. AA is a finite set of actions.
  • T:S×A×S→[0,1]T:S×A×S→[0,1] is a transition model that maps (state, action, state) triples to probabilities.
  • T(s,a,s′)T(s,a,s′) is the probability that you’ll land in state s′s′ if you were in state ss and took action aa.
In terms of conditional probabilities , the following is true:
T(s,a,s′)=P(s′|s,a)T(s,a,s′)=P(s′|s,a)
R:S×S→RR:S×S→R is a reward function that gives a real number that represents the amount of reward (or punishment) the environment will grant for a state transition. R(s,s′)R(s,s′) is the reward received after transitioning to state s′s′ from state ss .
If the transition model is known to the agent, i.e., the agent knows where it would probably go from where it stands, it’s fairly easy for the agent to know how to act in a way that maximizes its expected utility from its experience with the environment.
We can define the expected utility for the agent to be the accumulated rewards it gets throughout its experience with the environment. If the agent goes through the states s0,s1,…,sn−1,sns0,s1,…,sn−1,sn, you could formally define its expected utility as follows:
∑nt=1γtE[R(st1,st)]∑t=1nγtE[R(st1,st)]
where γγ is a discount factor used to decrease the values (and hence the importance) of past rewards, and EE is the expected value.
The problem arises when the agents have no clue about the probabilistic model behind the transitions, and this where RL comes in. The RL problem can formally be defined now as the problem of learning a set of parameters in order to maximize the expected utility.
RL comes in two flavors:
  • Model-based : The agent attempts to sample and learn the probabilistic model and use it to determine the best actions it can take. In this flavor, the set of parameters that was vaguely referred to is the MDP model.
  • Model-free : The agent doesn’t bother with the MDP model and instead attempts to develop a control function that looks at the state and decides the best action to take. In that case, the parameters to be learned are the ones that define the control function.

Where Reinforcement Learning Is Used

This section discusses the different fields of Reinforcement Learning, as shown in Figure 2-2.
A454310_1_En_2_Fig2_HTML.jpg
Figure 2-2.
Different fields of Reinforcement Learning

Manufacturing

In manufacturing, factory robots use Reinforcement Learning to move an object from one box and then keep it in another container.
If it fails or finds success upon delivering, the robot remembers the object and learns again, with the end result to get the best results with the greatest accuracy.

Inventory Management

In terms of inventory management, Reinforcement Learning can be used to reduce transit time in stocking and can be applied to placing products in warehouses for utilizing space optimally.

Delivery Management

Reinforcement Learning is applied to solve the problem of split delivery vehicle routing. Q Learning is used to serve appropriate customers with one vehicle.

Finance Sector

Reinforcement Learning is being used for accounting, using trading strategies.

Why Is Reinforcement Learning Difficult?

One of the toughest parts of Reinforcement Learning is having to map the environment and include all possible moves. For example, consider a board game.
You have to apply artificial intelligence to what is learned. In theory, Reinforcement Learning should work perfectly because there are a lot of state jumps and complex moves in a board game. However, applying Reinforcement Learning by itself becomes difficult.
To get the best results, we apply a rule-based engine with Reinforcement Learning. If we don’t apply a rule-based engine, there are so many options in board games that the agent will take forever to discover the path.
First of all, we apply simple rules so that the AI learns quickly and then, as the complexity increases, we apply Reinforcement Learning.
Figure 2-3 shows how applying Reinforcement Learning can be difficult.
A454310_1_En_2_Fig3_HTML.jpg
Figure 2-3.
Reinforcement Learning with rules

Preparing the Machine

Before you can run the examples, you need to perform certain steps to install the software. The examples in this book use the Anaconda version of Python, so this section explains how to find and download it. First, you need to open a terminal . The process of starting the terminal is shown in Figure 2-4.
A454310_1_En_2_Fig4_HTML.jpg
Figure 2-4.
Opening the terminal
Next, you need to update the packages. Write the following command in the terminal to do so. See Figure 2-5.
sudo apt-get update
A454310_1_En_2_Fig5_HTML.jpg
Figure 2-5.
Updating the packages
After you run the update command, the required installation content is installed, as shown in Figure 2-6.
A454310_1_En_2_Fig6_HTML.jpg
Figure 2-6.
Everything has been updated
Now you can use another command for installing the required packages. Figure 2-7 shows the process.
sudo apt-get install golang python3-dev python-dev libcupti-dev libjpeg-turbo8-dev make tmux htop chromium-browser git cmake zlib1g-dev libjpeg-dev xvfb libav-tools xorg-dev python-opengl libboost-all-dev libsdl2-dev swig.
A454310_1_En_2_Fig7_HTML.jpg
Figure 2-7.
Fetching the updates
As shown in Figure 2-8, you’ll need to type y and then press Enter to continue.
A454310_1_En_2_Fig8_HTML.jpg
Figure 2-8.
Continue with the installation
In the next step, the essential packages are downloaded and updated accordingly, as shown in Figure 2-9.
A454310_1_En_2_Fig9_HTML.jpg
Figure 2-9.
Downloading and extracting the packages
You have now installed the Anaconda distribution of Python. Next, you need to open a browser window for Ubuntu. This example shows Mozilla Firefox. Search for the Anaconda installation, as shown in Figure 2-10.
A454310_1_En_2_Fig10_HTML.jpg
Figure 2-10.
Downloading Anaconda
Now you have to find the download that’s appropriate for your particular operating system. The Anaconda page is shown in Figure 2-11.
A454310_1_En_2_Fig11_HTML.jpg
Figure 2-11.
Anaconda page
Select the appropriate distribution of Anaconda, as shown in Figure 2-12.
A454310_1_En_2_Fig12_HTML.jpg
Figure 2-12.
Selecting the Anaconda version
Save the file next, as shown in Figure 2-13.
A454310_1_En_2_Fig13_HTML.jpg
Figure 2-13.
Saving the file
Now, using the terminal, you have to get inside the downloads folder. You should also check for the file that was being saved. See Figure 2-14.
A454310_1_En_2_Fig14_HTML.jpg
Figure 2-14.
Getting inside the downloads folder
You now have to use the bash command to run the shell script (see Figure 2-15):
bash Anaconda3-4.4.0-Linux-x86_64.sh
A454310_1_En_2_Fig15_HTML.jpg
Figure 2-15.
Running the shell script
To select the platform, type yes and press Enter. Anaconda will be installed into the home location, as shown in Figure 2-16.
A454310_1_En_2_Fig16_HTML.jpg
Figure 2-16.
Setting up the Anaconda environment
The next step, shown in Figure 2-17, will install all the important packages for Anaconda so that it is configured properly.
A454310_1_En_2_Fig17_HTML.jpg
Figure 2-17.
Installing the key packages for Anaconda
After the Anaconda installation is complete, you need to open a new terminal to set up your Anaconda environment . You have to create a new environment for Anaconda using the conda create command (see Figure 2-18).
This command keeps all the packages in an isolated place.
conda create --name universe python=3.6 anaconda
A454310_1_En_2_Fig18_HTML.jpg
Figure 2-18.
Creating an environment
In the next step, the Anaconda environment will install the necessary packages. See Figure 2-19.
A454310_1_En_2_Fig19_HTML.jpg
Figure 2-19.
The packages for installing or updating Anaconda
Type y and then press Enter to continue. Then the entire process will be complete after every package is updated in the environment. You can now activate the environment. See Figure 2-20.
A454310_1_En_2_Fig20_HTML.jpg
Figure 2-20.
The packages for installing or updating Anaconda
Some additional updates might need to be installed. You also need to install Swig, as shown in Figure 2-21.
conda install pip six libgcc swig
A454310_1_En_2_Fig21_HTML.jpg
Figure 2-21.
Installing Swig too
You will also have to install OpenCV in order to update certain packages, as shown in Figure 2-22.
A454310_1_En_2_Fig22_HTML.jpg
Figure 2-22.
Installing OpenCV
If there are updates to OpenCV, type y to install them too. See Figure 2-23.
A454310_1_En_2_Fig23_HTML.jpg
Figure 2-23.
Installing OpenCV
Next, you need to install TensorFlow. This chapter shows how to install the CPU version. See Figure 2-24.
pip install --upgrade tensorflow
A454310_1_En_2_Fig24_HTML.jpg
Figure 2-24.
Installing TensorFlow
Figure 2-25 shows the packages being installed for TensorFlow.
A454310_1_En_2_Fig25_HTML.jpg
Figure 2-25.
TensorFlow installs the packages
The next step, shown in Figure 2-26, asks for the privileges to install the other packages. Type y to continue.
A454310_1_En_2_Fig26_HTML.jpg
Figure 2-26.
Package installation happens
In the next section, we install Docker. We will first learn what Docker is.

Installing Docker

When you want to keep your containers in the cloud, Docker is the best option. Developers generally use Docker to minimize workloads on a single machine, because the entire architecture can be hosted on the developer environment. Enterprises use Docker to maintain an agile environment. Operators generally use Docker to keep an eye on apps and to run and manage them effectively.
Now you will install Docker, as it is essential for OpenAI Gym and Universe to work. You need to install Docker because, when you are training an environment, Docker is very responsive to simulations since it runs with low resources.
The command to be entered in the terminal is shown here:
$ sudo apt-get install
    apt-transport-https
    ca-certificates
    curl
    software-properties-common
The next command to enter is:
$ curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
You use curl and the http link so that Docker can access these trusted key values.
Now download the Docker type using this command:
$ sudo add-apt-repository
   "deb [arch=amd64] https://download.docker.com/linux/ubuntu
   $(lsb_release -cs)
   stable"
Type this command to update Docker, as shown in Figure 2-27:
$ sudo apt-get update
A454310_1_En_2_Fig27_HTML.jpg
Figure 2-27.
Updating the package
Type this command to install Docker , as shown in Figure 2-28:
$ sudo apt-get install docker-ce
A454310_1_En_2_Fig28_HTML.jpg
Figure 2-28.
Docker installation
To test Docker, use this command (see Figure 2-29):
$ sudo service docker start
$ sudo docker run hello-world
A454310_1_En_2_Fig29_HTML.jpg
Figure 2-29.
Testing docker

An Example of Reinforcement Learning with Python

This section goes through an example of Reinforcement Learning and explains the flow of the algorithm. You’ll see how Reinforcement Learning can be applied. This section uses an open source GitHub repo that has a very good example of Reinforcement Learning. You will need to clone it to work with it.
The GitHub repo link is https://github.com/MorvanZhou/Reinforcement-learning-with-tensorflow . Within the Ubuntu module, get inside the terminal and start cloning the repo, as shown in Figure 2-30.
A454310_1_En_2_Fig30_HTML.jpg
Figure 2-30.
Cloning the repo
Figure 2-31 shows how the repo is replicated.
A454310_1_En_2_Fig31_HTML.jpg
Figure 2-31.
Replication of the repo
You will next get inside the folder that you used, as shown in Figure 2-32.
A454310_1_En_2_Fig32_HTML.jpg
Figure 2-32.
Getting inside the folder
We are working with a scenario of Reinforcement Learning where we are applying the letter O as a wanderer. That wanderer wants to get the treasure T as fast as it can.
The condition looks like this:
O-----T
The wanderer tries to find the quickest path to reach the treasure. During each episode, the steps the wanderer takes to reach the treasure are counted. With each episode, the condition improves and the number of steps declines.
Here are some of the basic steps in terms of Reinforcement Learning:
  • The program tries to work with actions, as actions are very important in terms of Reinforcement Learning.
  • The available actions for this wanderer is moving left or right:
    ACTIONS = ['left','right']
  • The wanderer can be considered the agent.
  • The number of states (also called the number of steps) is limited to 6 in this example:
    N_States = 6;
Now you need to apply hyperparameters for Reinforcement Learning.

What Are Hyperparameters ?

Hyperparameters are variables that were set before setting the model’s parameters. Generally, they are different from the parameters of the model for the underlying system under analysis.
We introduce epsilon, alpha, and gamma.
  • Epsilon is the greedy factor
  • Alpha is the learning rate
  • Gamma is the discount factor
The maximum number of episodes in this case is 13. The refresh rate is when the scenario is refreshed.

Writing the Code

To create the process from which the computer learns, we have to formulate a table. This process is known as Q Learning and the table is called a Q table (You will learn more about Q Learning in the next chapter.) All the key elements are stored in the Q table and all the decisions are made based on the Q table .
def build_q_table(n_states, actions):
    table = pd.DataFrame(
        np.zeros((n_states, len(actions))),     # q_table initial values
        columns=actions,    # actions's name
    )
    # print(table)    # show table
    return table
Now we have to take actions. To do so, we use this code:
def choose_action(state, q_table):
    # This is how to choose an action
    state_actions = q_table.iloc[state, :]
    if (np.random.uniform() > EPSILON) or (state_actions.all() == 0):  # act non-greedy or state-action have no value
        action_name = np.random.choice(ACTIONS)
    else:   # act greedy
        action_name = state_actions.argmax()
    return action_name
Now we create the environment and determine how the agents will work within the environment:
def get_env_feedback(S, A):
    # This is how the agent will interact with the environment
    if A == 'right':    # move right
        if S == N_STATES - 2:   # terminate
            S_ = 'terminal'
            R = 1
        else:
            S_ = S + 1
            R = 0
    else:   # move left
        R = 0
        if S == 0:
            S_ = S  # reach the wall
        else:
            S_ = S - 1
    return S_, R
This function prints the wanderer and treasure hunt conditions :
def update_env(S, episode, step_counter):
    # This is how the environment be updated
    env_list = ['-']*(N_STATES-1) + ['T']   # '---------T' our environment
    if S == 'terminal':
        interaction = 'Episode %s: total_steps = %s' % (episode+1, step_counter)
        print(' {}'.format(interaction), end='')
        time.sleep(2)
        print('                                 ', end='')
    else:
        env_list[S] = 'o'
        interaction = ''.join(env_list)
        print(' {}'.format(interaction), end='')
        time.sleep(FRESH_TIME)
The rl() method calls the Q Learning scenario, which we discuss in next chapter:
def rl():
    # main part of RL loop
    q_table = build_q_table(N_STATES, ACTIONS)
    for episode in range(MAX_EPISODES):
        step_counter = 0
        S = 0
        is_terminated = False
        update_env(S, episode, step_counter)
        while not is_terminated:
            A = choose_action(S, q_table)
            S_, R = get_env_feedback(S, A)  # take action & get next state and reward
            q_predict = q_table.ix[S, A]
            if S_ != 'terminal':
                q_target = R + GAMMA * q_table.iloc[S_, :].max()   # next state is not terminal
            else:
                q_target = R     # next state is terminal
                is_terminated = True    # terminate this episode
            q_table.ix[S, A] += ALPHA * (q_target - q_predict)  # update
            S = S_  # move to next state
            update_env(S, episode, step_counter+1)
            step_counter += 1
    return q_table
if __name__ == "__main__":
    q_table = rl()
    print(' Q-table: ')
    print(q_table)
The full code looks like this:
import numpy as np
import pandas as pd
import time
np.random.seed(2)  # reproducible
N_STATES = 6   # the length of the 1 dimensional world
ACTIONS = ['left', 'right']     # available actions
EPSILON = 0.9   # greedy police
ALPHA = 0.1     # learning rate
GAMMA = 0.9    # discount factor
MAX_EPISODES = 13   # maximum episodes
FRESH_TIME = 0.3    # fresh time for one move
def build_q_table(n_states, actions):
    table = pd.DataFrame(
        np.zeros((n_states, len(actions))),     # q_table initial values
        columns=actions,    # actions's name
    )
    # print(table)    # show table
    return table
def choose_action(state, q_table):
    # This is how to choose an action
    state_actions = q_table.iloc[state, :]
    if (np.random.uniform() > EPSILON) or (state_actions.all() == 0):  # act non-greedy or state-action have no value
        action_name = np.random.choice(ACTIONS)
    else:   # act greedy
        action_name = state_actions.argmax()
    return action_name
def get_env_feedback(S, A):
    # This is how agent will interact with the environment
    if A == 'right':    # move right
        if S == N_STATES - 2:   # terminate
            S_ = 'terminal'
            R = 1
        else:
            S_ = S + 1
            R = 0
    else:   # move left
        R = 0
        if S == 0:
            S_ = S  # reach the wall
        else:
            S_ = S - 1
    return S_, R
def update_env(S, episode, step_counter):
    # This is how environment be updated
    env_list = ['-']*(N_STATES-1) + ['T']   # '---------T' our environment
    if S == 'terminal':
        interaction = 'Episode %s: total_steps = %s' % (episode+1, step_counter)
        print(' {}'.format(interaction), end='')
        time.sleep(2)
        print('                                 ', end='')
    else:
        env_list[S] = 'o'
        interaction = ''.join(env_list)
        print(' {}'.format(interaction), end='')
        time.sleep(FRESH_TIME)
def rl():
    # main part of RL loop
    q_table = build_q_table(N_STATES, ACTIONS)
    for episode in range(MAX_EPISODES):
        step_counter = 0
        S = 0
        is_terminated = False
        update_env(S, episode, step_counter)
        while not is_terminated:
            A = choose_action(S, q_table)
            S_, R = get_env_feedback(S, A)  # take action & get next state and reward
            q_predict = q_table.ix[S, A]
            if S_ != 'terminal':
                q_target = R + GAMMA * q_table.iloc[S_, :].max()   # next state is not terminal
            else:
                q_target = R     # next state is terminal
                is_terminated = True    # terminate this episode
            q_table.ix[S, A] += ALPHA * (q_target - q_predict)  # update
            S = S_  # move to next state
            update_env(S, episode, step_counter+1)
            step_counter += 1
    return q_table
if __name__ == "__main__":
    q_table = rl()
    print(' Q-table: ')
    print(q_table)
Let’s now run the program and analyze the output. You need to get inside the cloned GitHub repo and into the required folder, as shown in Figure 2-33.
A454310_1_En_2_Fig33_HTML.jpg
Figure 2-33.
Getting inside the cloned repo
Now you need to get inside the directory to run the program, as shown in Figure 2-34.
A454310_1_En_2_Fig34_HTML.jpg
Figure 2-34.
Checking the directory
Now you have to run the program called treasure_on_right.py, which places the treasure to the right of the agent. See Figure 2-35.
A454310_1_En_2_Fig35_HTML.jpg
Figure 2-35.
Running the Python file
The program is running iterations, as shown in Figure 2-36.
A454310_1_En_2_Fig36_HTML.jpg
Figure 2-36.
As the iteration happens
As the program and the simulation complete, the final result is interpreted as a Q table, where on each step of completing the cycle, the values reflect how much time it spent in the left and right directions. Figure 2-37 shows the completed Q table.
A454310_1_En_2_Fig37_HTML.jpg
Figure 2-37.
The Q table created as a result

What Is MDP?

MDP (Markov Decision Process) is a framework that involves creating mathematical formulas and models for decision making where part of it is random and part of it remains in the hands of a decision maker.
MDPs have many different applications , as shown in Figure 2-38.
A454310_1_En_2_Fig38_HTML.jpg
Figure 2-38.
MDP and its applications
Every state in MDP satisfies the Markov property.

The Markov Property

In the world of Reinforcement Learning, the Markov property refers to a memory-less property that is stochastic. Stochastic means a general mathematical object consisting of random variables. When we are not storing a value of a variable because in each iteration there is a change, we call it stochastic. See Figure 2-39.
A454310_1_En_2_Fig39_HTML.jpg
Figure 2-39.
The Markov property process
We talk about the Markov Chain in the next section.

The Markov Chain

If a mathematical property has either a discrete state space or a discrete index set, it is known as a Markov Chain. The Markov Chain works in two ways, as shown in Figure 2-40.
A454310_1_En_2_Fig40_HTML.jpg
Figure 2-40.
Markov Chain
Let’s look at Markov Chains using an example. This example compares sales of Rin detergent versus the other detergents in the market. Assume that sales of Rin is 20 percent of the total detergent sales, which means the rest comprise 80 percent. People who use Rin detergent are defined as A; the others are A¢.
Now we define a rule. Of the people who use Rin detergent, 90% of them continue to use it after a week whereas 10% shift to another brand.
Similarly, 70% of the people who use another detergent shift to Rin after a week, and the rest continue to use the other detergent.
To analyze these conditions, we need a state diagram. See Figure 2-41.
A454310_1_En_2_Fig41_HTML.jpg
Figure 2-41.
Rin detergent state diagram
In the state diagram, we have created a scenario where the circular points represent states. From this state diagram, we have to assign a transition probability matrix.
The transition probability matrix we get from the state diagram is shown in Figure 2-42.
A454310_1_En_2_Fig42_HTML.jpg
Figure 2-42.
The transition probability matrix
To determine the use of Rin after two weeks, we have to apply a principle. This principle is common for each and every process you try.
It can be shown as a line connection , as shown in Figure 2-43.
A454310_1_En_2_Fig43_HTML.jpg
Figure 2-43.
A connected graph
From the origin, we have two paths—one for Rin detergent (through A) and the other for the rest (that is A¢). Here is how the path is created.
  1. 1.
    From the origin, we create a path for A, so we have to focus on the transition probability matrix.
     
  2. 2.
    We trace the path of A.
     
  3. 3.
    From the starting market share, the detergent Rin has a market value of 20%.
     
  4. 4.
    From the starting point A, we focus on the transition probability matrix.
     
There is a 90% probability of staying on A, so the other 10% change to the alternate path (to A¢).
Figure 2-44 shows this path calculation graphically .
A454310_1_En_2_Fig44_HTML.jpg
Figure 2-44.
Path calculation
The total path probability is determined as so: P = .2 *.9 + .8*.7  = .18 + .56  = .74.
This is the percentage of people using Rin after one week.
This formula can also be conceptualized as the current market share (SO) and transition probability (P):
S0 * P = market share after one week
See Figure 2-45.
A454310_1_En_2_Fig45_HTML.jpg
Figure 2-45.
The matrix created for the next week
The calculation is .2 * .9 + .8*.7 = .74
.2*.1 + .8*.3 =.26
[ .74 .26 ] = S1
Let’s work on a first state matrix. After one week, the sale of Rin detergent is 74% of the market. The other brands then make up 26% of the market.
Now try to find the percentage of people using Rin detergent after two weeks. Figure 2-46 shows the calculation that we need to do after two weeks.
A454310_1_En_2_Fig46_HTML.jpg
Figure 2-46.
The next transition matrix
So the result is:
A   A’
= [ .848 .152 ]
After two weeks, 84.8% of the people will use Rin and 15.2% will use other detergents.
One question you might have is whether the sale of Rin will ever maximize to 100% of the market. As we go along, the matrix will become stationary after a certain number of iterations and finally settle at:
A  A’
= [ .75 .25]
After going through the basics of the Markov state and the Markov Chain, it’s time to focus on MDPs again.

MDPs

Almost all Reinforcement Learning problems can be formalized as MDPs. MDPs create a condition that’s prevalent for applying Reinforcement Learning. The essentials of MDPs are a continued Markov process.
A state (St) is Markov if and only if it meets the criteria shown in Figure 2-47.
A454310_1_En_2_Fig47_HTML.jpg
Figure 2-47.
The Markov state property
The state captures all relevant information from the history. We do not have to retain everything in history because only the previous state determines what will happen now.
For a Markov state (s) and successor state (s’), the state transition probability is defined in Figure 2-48.
A454310_1_En_2_Fig48_HTML.jpg
Figure 2-48.
The transitive probability
MDP is a Markov reward process with a decision factor in it. It is a type of environment where all the states are Markov.
An MDP is a five tuple < S, A, P, R, Gamma>:
  • S stands for state
  • A stands for action
  • P is a policy
  • R stands for reward
Policy (π) is a distribution over actions in a given state. A policy is a function or a decision-making process that allows transitions from one state to another .

SARSA

SARSA stands for State Action Reward next State and next Action. It is a different kind of Reinforcement Learning approach and is generally derived from temporal difference learning. We’ll discuss temporal difference learning first.

Temporal Difference Learning

This type of learning is based on its own vicinity or its own range. We generally apply temporal difference learning when we are in a state and want to know what is happening in successive states.
The general idea is that we want to predict the best path over a period of time.
We go from state S0 to state SF. We get rewards in each state. We will be trying to predict the discounted sum of rewards. See Figure 2-49.
A454310_1_En_2_Fig49_HTML.jpg
Figure 2-49.
State transition
We start by looking at the Markov Chain, as shown in Figure 2-50.
A454310_1_En_2_Fig50_HTML.jpg
Figure 2-50.
The Markov Chain
A454310_1_En_2_Fig51_HTML.jpg
Figure 2-51.
The value function
The equation states that the value function maps the state to some number. This number is set to 0 if it is in the final state (see Figure 2-51).
For any state, the value is the expected value of the reward (r) and the discounted value of the ending state.

How SARSA Works

Now we get into SARSA. SARSA is known as an own policy Reinforcement Learning. An own policy means that we can see only our own experiences.
It accumulates updates in one or more steps and learns to update from its experiences.
From the current state, we choose an action and then get to the next state. At the next state, we choose another state and use the current state and the current action with the next state and next action. We then update all the values together as a Q value .
Here is the algorithm:
  1. 1.
    Initialize Q(s, a) arbitrarily.
     
  2. 2.
    Initialize s.
     
  3. 3.
    Choose a from s using the policy derived from Q. Repeat these two steps for each episode.
     
  4. 4.
    Take action a and observe r and s’.
     
  5. 5.
    Choose a’ from s’ using the policy derived from Q (for example, ----E-greedy).
    Q(s, a) ß----- Q(s, a) + α[r +γQ(s', a') - Q(s,a)]
    Sß---s'; aß-- a';
     
  6. 6.
    Repeat these steps for each episode until s is terminal.
     

Q Learning

Q Learning is a model-free Reinforcement Learning technique. Figure 2-52 illustrates the general procedure for Q Learning.
A454310_1_En_2_Fig52_HTML.jpg
Figure 2-52.
The Q Learning process

What Is Q?

Q can be stated as a function that consists of two parameters —s and a. The a parameter can also be referred to as a table.
Q represents the value that an action a takes with state s.
Q[s, a] = Immediate reward + discounted reward
The immediate reward is the point given when the agent moves from one state to another while doing an action.
The discounted reward is the point given for future references.

How to Use Q

We generally come up with scenarios where we have to find out where we can utilize the Q table values or the Q value so Q is implemented in this process.
We are looking at what action to take or which policy to implement when we are in state s. We use the Q table to get the best result.
If we are in state s, we need to determine which action is the best. We do not change s, but we go through all the values of a and determine which one is the largest. That will be the action we should take. Mathematically, this idea is represented as shown in Figures 2-53 and 2-54.
A454310_1_En_2_Fig53_HTML.jpg
Figure 2-53.
The policy equation
A454310_1_En_2_Fig54_HTML.jpg
Figure 2-54.
How policy works
For MDP, the policy we should implement depends on the current state. We maximize the rewards to get the optimal solution.

SARSA Implementation in Python

Recall that SARSA is as self policy Reinforcement Learning approach.
For example, SARSA can be used to solve a maze. Using the SARSA approach, we cannot compare two different maze environments. We have to stick to one maze and we’ll use the previous as an example. Also, we cannot compare this maze with another outside maze; we have to stick to the maze that we are working on.
The best thing about SARSA is that it can learn from the current state compared to the next state or to subsequent states. We accumulate all the experiences and learn from them.
Let’s break this idea down more. This scenario states that the update can be done on a Q table by comparing the changes in subsequent steps and then making a decision. This idea is illustrated in Figure 2-55.
A454310_1_En_2_Fig55_HTML.jpg
Figure 2-55.
Updating results using the SARSA table
The learning method in Python is different for SARSA. It looks like this:
def learn(self, s, a, r, s_, a_)
This method depends on the state, the action, the reward, the next state, and the next action.
If we compare the algorithm and convert it to Python, the construct for this equation is shown in Figure 2-56.
A454310_1_En_2_Fig56_HTML.jpg
Figure 2-56.
The SARSA equation
It’s converted to the following:
q_target = r + self.gamma * self.q_table.ix [s_, a_]
The difference between this equation and Q Learning is the change in this equation:
q_target = r + self.gamma * self.q_table.ix [s_, :].max()
The max() value is present in Q Learning but not in SARSA.
The logic for implementing a policy using SARSA is shown here:
# on-policy
class SarsaTable(RL):
    def __init__(self, actions, learning_rate=0.01, reward_decay=0.9, e_greedy=0.9):
        super(SarsaTable, self).__init__(actions, learning_rate, reward_decay, e_greedy)
    def learn(self, s, a, r, s_, a_):
        self.check_state_exist(s_)
        q_predict = self.q_table.ix[s, a]
        if s_ != 'terminal':
            q_target = r + self.gamma * self.q_table.ix[s_, a_]  # next state is not terminal
        else:
            q_target = r  # next state is terminal
        self.q_table.ix[s, a] += self.lr * (q_target - q_predict)  # update
The learning process is somewhat different than with Q Learning. The logic works according to the principle discussed previously .
We combine the state and action of the current status with the next state and next action. This in turn updates the Q table. This is the way the learning works.
def update():
    for episode in range(100):
        # initial observation
        observation = env.reset()
        # RL choose action based on observation
        action = RL.choose_action(str(observation))
        while True:
            # fresh env
            env.render()
            # RL take action and get next observation and reward
            observation_, reward, done = env.step(action)
            # RL choose action based on next observation
            action_ = RL.choose_action(str(observation_))
            # RL learn from this transition (s, a, r, s, a) ==> Sarsa
            RL.learn(str(observation), action, reward, str(observation_), action_)
            # swap observation and action
            observation = observation_
            action = action_
            # break while loop when end of this episode
            if done:
                break
Here is the code for creating the maze:
import numpy as np
import time
import sys
if sys.version_info.major == 2:
    import Tkinter as tk
else:
    import tkinter as tk
UNIT = 40   # pixels
MAZE_H = 4  # grid height
MAZE_W = 4  # grid width
class Maze(tk.Tk, object):
    def __init__(self):
        super(Maze, self).__init__()
        self.action_space = ['u', 'd', 'l', 'r']
        self.n_actions = len(self.action_space)
        self.title('maze')
        self.geometry('{0}x{1}'.format(MAZE_H * UNIT, MAZE_H * UNIT))
        self._build_maze()
    def _build_maze(self):
        self.canvas = tk.Canvas(self, bg='white',
                           height=MAZE_H * UNIT,
                           width=MAZE_W * UNIT)
        # create grids
        for c in range(0, MAZE_W * UNIT, UNIT):
            x0, y0, x1, y1 = c, 0, c, MAZE_H * UNIT
            self.canvas.create_line(x0, y0, x1, y1)
        for r in range(0, MAZE_H * UNIT, UNIT):
            x0, y0, x1, y1 = 0, r, MAZE_H * UNIT, r
            self.canvas.create_line(x0, y0, x1, y1)
        # create origin
        origin = np.array([20, 20])
        # hell
        hell1_center = origin + np.array([UNIT * 2, UNIT])
        self.hell1 = self.canvas.create_rectangle(
            hell1_center[0] - 15, hell1_center[1] - 15,
            hell1_center[0] + 15, hell1_center[1] + 15,
            fill='black')
        # hell
        hell2_center = origin + np.array([UNIT, UNIT * 2])
        self.hell2 = self.canvas.create_rectangle(
            hell2_center[0] - 15, hell2_center[1] - 15,
            hell2_center[0] + 15, hell2_center[1] + 15,
            fill='black')
        # create oval
        oval_center = origin + UNIT * 2
        self.oval = self.canvas.create_oval(
            oval_center[0] - 15, oval_center[1] - 15,
            oval_center[0] + 15, oval_center[1] + 15,
            fill='yellow')
        # create red rect
        self.rect = self.canvas.create_rectangle(
            origin[0] - 15, origin[1] - 15,
            origin[0] + 15, origin[1] + 15,
            fill='red')
        # pack all
        self.canvas.pack()
    def reset(self):
        self.update()
        time.sleep(0.5)
        self.canvas.delete(self.rect)
        origin = np.array([20, 20])
        self.rect = self.canvas.create_rectangle(
            origin[0] - 15, origin[1] - 15,
            origin[0] + 15, origin[1] + 15,
            fill='red')
        # return observation
        return self.canvas.coords(self.rect)
    def step(self, action):
        s = self.canvas.coords(self.rect)
        base_action = np.array([0, 0])
        if action == 0:   # up
            if s[1] > UNIT:
                base_action[1] -= UNIT
        elif action == 1:   # down
            if s[1] < (MAZE_H - 1) * UNIT:
                base_action[1] += UNIT
        elif action == 2:   # right
            if s[0] < (MAZE_W - 1) * UNIT:
                base_action[0] += UNIT
        elif action == 3:   # left
            if s[0] > UNIT:
                base_action[0] -= UNIT
        self.canvas.move(self.rect, base_action[0], base_action[1])  # move agent
        s_ = self.canvas.coords(self.rect)  # next state
        # reward function
        if s_ == self.canvas.coords(self.oval):
            reward = 1
            done = True
        elif s_ in [self.canvas.coords(self.hell1), self.canvas.coords(self.hell2)]:
            reward = -1
            done = True
        else:
            reward = 0
            done = False
        return s_, reward, done
    def render(self):
        time.sleep(0.1)
        self.update()

The Entire Reinforcement Logic in Python

When you are implementing the algorithm in Python, the structure looks like the following. The content is in the repo.
import numpy as np
import pandas as pd
class RL(object):
    def __init__(self, action_space, learning_rate=0.01, reward_decay=0.9, e_greedy=0.9):
        self.actions = action_space  # a list
        self.lr = learning_rate
        self.gamma = reward_decay
        self.epsilon = e_greedy
        self.q_table = pd.DataFrame(columns=self.actions)
    def check_state_exist(self, state):
        if state not in self.q_table.index:
            # append new state to q table
            self.q_table = self.q_table.append(
                pd.Series(
                    [0]*len(self.actions),
                    index=self.q_table.columns,
                    name=state,
                )
            )
    def choose_action(self, observation):
        self.check_state_exist(observation)
        # action selection
        if np.random.rand() < self.epsilon:
            # choose best action
            state_action = self.q_table.ix[observation, :]
            state_action = state_action.reindex(np.random.permutation(state_action.index))     # some actions have the same value
            action = state_action.argmax()
        else:
            # choose random action
            action = np.random.choice(self.actions)
        return action
    def learn(self, *args):
        Pass
# off-policy
class QLearningTable(RL):
    def __init__(self, actions, learning_rate=0.01, reward_decay=0.9, e_greedy=0.9):
        super(QLearningTable, self).__init__(actions, learning_rate, reward_decay, e_greedy)
    def learn(self, s, a, r, s_):
        self.check_state_exist(s_)
        q_predict = self.q_table.ix[s, a]
        if s_ != 'terminal':
            q_target = r + self.gamma * self.q_table.ix[s_, :].max()  # next state is not terminal
        else:
            q_target = r  # next state is terminal
        self.q_table.ix[s, a] += self.lr * (q_target - q_predict)  # update
# on-policy
class SarsaTable(RL):
    def __init__(self, actions, learning_rate=0.01, reward_decay=0.9, e_greedy=0.9):
        super(SarsaTable, self).__init__(actions, learning_rate, reward_decay, e_greedy)
    def learn(self, s, a, r, s_, a_):
        self.check_state_exist(s_)
        q_predict = self.q_table.ix[s, a]
        if s_ != 'terminal':
            q_target = r + self.gamma * self.q_table.ix[s_, a_]  # next state is not terminal
        else:
            q_target = r  # next state is terminal
        self.q_table.ix[s, a] += self.lr * (q_target - q_predict)  # update
The learning process in its entirety looks like this in the code (RL_brain.py):
from maze_env import Maze
from RL_brain import SarsaTable
def update():
    for episode in range(100):
        # initial observation
        observation = env.reset()
        # RL choose action based on observation
        action = RL.choose_action(str(observation))
        while True:
            # fresh env
            env.render()
            # RL take action and get next observation and reward
            observation_, reward, done = env.step(action)
            # RL choose action based on next observation
            action_ = RL.choose_action(str(observation_))
            # RL learn from this transition (s, a, r, s, a) ==> Sarsa
            RL.learn(str(observation), action, reward, str(observation_), action_)
            # swap observation and action
            observation = observation_
            action = action_
            # break while loop when end of this episode
            if done:
                break
    # end of game
    print('game over')
    env.destroy()
if __name__ == "__main__":
    env = Maze()
    RL = SarsaTable(actions=list(range(env.n_actions)))
    env.after(100, update)
    env.mainloop()
Let’s run the program and check it.
You can do this in the Anaconda environment, as shown in Figure 2-57.
A454310_1_En_2_Fig57_HTML.jpg
Figure 2-57.
Activating the environment
You then have to consider the SARSA maze, as shown in Figure 2-58.
A454310_1_En_2_Fig58_HTML.jpg
Figure 2-58.
Considering the SARSA maze
Now you have to call the run_this.py file to get the program running, as shown in Figure 2-59.
A454310_1_En_2_Fig59_HTML.jpg
Figure 2-59.
Running run_this.py
To run the program from the terminal, use this command:
python run_this.py
After running the code, the program will play the maze, as shown in Figure 2-60.
A454310_1_En_2_Fig60_HTML.jpg
Figure 2-60.
The program playing maze

Dynamic Programming in Reinforcement Learning

Problems that are sequential or temporal can be solved using dynamic programming. If you have a complex problem, you have to break it down into subproblems. Dynamic programming is the process of breaking a problem into subproblems, solving those subproblems, and finally combining them to solve the overall problem. The optimal substructure and the principle of optimality apply. The solution can be cached and reused. See Figure 2-61.
A454310_1_En_2_Fig61_HTML.jpg
Figure 2-61.
Dynamic problem-solving approach

Conclusion

This chapter went through different algorithms related to Reinforcement Learning. You also saw a simple example of Reinforcement Learning using Python. You then learned about SARSA with the help of an example in Python. The chapter ended by discussing dynamic programming basics.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset